# EIA Retail Sales Data – Data quality and cleaning
### Data Profiling
We began by inspecting basic column structure, data types, and value ranges using `info()`, `describe()`, and unique-value checks.  
The dataset includes monthly electricity retail sales for all U.S. states across all sectors.  
All key fields (`period`, `stateid`, `sales`, `revenue`) were present and contained no missing values.

A missing-value scan showed that **only the `customers` field had missing data**, with 4,284 missing entries (≈28%).  
Because our project does not use customer counts and because the missingness is extensive and systematic across states and years, we removed this column.

Unit-description fields (`sales-units`, `customers-units`, `revenue-units`) were also dropped because they provide metadata rather than analytical values.  
Similarly, `sectorid` and `sectorName` were removed because all records belong to the same category ("ALL" / "all sectors") and therefore do not contribute meaningful variation.

### Handling Missing Values
- Columns with missing values: `customers` (28.28% missing).  
- Decision: **drop the column** rather than impute, because it is not needed for the analysis and because imputation would not improve our integration with NOAA climate data.
- After removal, **no remaining numerical columns contain missing values**.

### Additional Cleaning Steps
- Removed redundant or constant columns (`sectorid`, `sectorName`, `stateDescription`).
- Removed metadata fields (`sales-units`, `revenue-units`).
- Verified the ranges for `sales` and `revenue`: no negative or zero values were found, and the observed distribution matches EIA expectations (high values concentrated in large-population states such as CA and TX).
- Retained the `period` field in `YYYY-MM` format because date parsing will be handled during dataset integration.

After these steps, the cleaned dataset contains only the variables required for downstream integration and analysis.


In [16]:
import pandas as pd

eia_raw_path = "../data/raw/eia_retail_sales.csv"

eia = pd.read_csv(eia_raw_path)

eia

Unnamed: 0,period,stateid,stateDescription,sectorid,sectorName,sales,customers,revenue,sales-units,customers-units,revenue-units
0,2001-01,AK,Alaska,ALL,all sectors,521.03566,,51.96404,million kilowatt hours,number of customers,million dollars
1,2001-01,AL,Alabama,ALL,all sectors,7362.47302,,407.61261,million kilowatt hours,number of customers,million dollars
2,2001-01,AR,Arkansas,ALL,all sectors,3804.21013,,216.58535,million kilowatt hours,number of customers,million dollars
3,2001-01,AZ,Arizona,ALL,all sectors,4786.79176,,304.10688,million kilowatt hours,number of customers,million dollars
4,2001-01,CA,California,ALL,all sectors,21744.31668,,1893.25678,million kilowatt hours,number of customers,million dollars
...,...,...,...,...,...,...,...,...,...,...,...
15142,2025-09,VT,Vermont,ALL,all sectors,409.43849,376794.0,80.80580,million kilowatt hours,number of customers,million dollars
15143,2025-09,WA,Washington,ALL,all sectors,6788.16737,3844518.0,762.56994,million kilowatt hours,number of customers,million dollars
15144,2025-09,WI,Wisconsin,ALL,all sectors,5688.77164,3256533.0,773.76893,million kilowatt hours,number of customers,million dollars
15145,2025-09,WV,West Virginia,ALL,all sectors,2546.36356,1029984.0,282.59998,million kilowatt hours,number of customers,million dollars


In [17]:
print("=== Column Info ===")
print(eia.info())

=== Column Info ===
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15147 entries, 0 to 15146
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   period            15147 non-null  object 
 1   stateid           15147 non-null  object 
 2   stateDescription  15147 non-null  object 
 3   sectorid          15147 non-null  object 
 4   sectorName        15147 non-null  object 
 5   sales             15147 non-null  float64
 6   customers         10863 non-null  float64
 7   revenue           15147 non-null  float64
 8   sales-units       15147 non-null  object 
 9   customers-units   15147 non-null  object 
 10  revenue-units     15147 non-null  object 
dtypes: float64(3), object(8)
memory usage: 1.3+ MB
None


In [18]:
print("=== Missing Values (Count & %) ===")
missing_count = eia.isna().sum()
missing_percent = eia.isna().mean() * 100

missing_df = pd.DataFrame({
    "missing_count": missing_count,
    "missing_percent": missing_percent.round(2)
})

display(missing_df)


=== Missing Values (Count & %) ===


Unnamed: 0,missing_count,missing_percent
period,0,0.0
stateid,0,0.0
stateDescription,0,0.0
sectorid,0,0.0
sectorName,0,0.0
sales,0,0.0
customers,4284,28.28
revenue,0,0.0
sales-units,0,0.0
customers-units,0,0.0


In [19]:
eia_clean = eia.drop(columns=["customers", "customers-units"])

In [20]:
print("=== Basic Statistics for revenue ===")
display(eia_clean["revenue"].describe())

print("\n=== Check for Negative or Zero Values ===")
print("revenue < 0 :", (eia_clean["revenue"] < 0).sum())
print("revenue == 0:", (eia_clean["revenue"] == 0).sum())

print("\n=== Top 10 Highest revenue Values ===")
display(eia_clean.sort_values("revenue", ascending=False).head(10))

print("\n=== IQR-based Outlier Check ===")
Q1 = eia_clean["revenue"].quantile(0.25)
Q3 = eia_clean["revenue"].quantile(0.75)
IQR = Q3 - Q1
upper_fence = Q3 + 1.5 * IQR

print("Upper fence for outlier detection:", upper_fence)
print("Values above fence:", (eia_clean["revenue"] > upper_fence).sum())


=== Basic Statistics for revenue ===


count    15147.000000
mean       613.349974
std        707.887815
min         40.153100
25%        160.546685
50%        411.986610
75%        761.202385
max       7759.617100
Name: revenue, dtype: float64


=== Check for Negative or Zero Values ===
revenue < 0 : 0
revenue == 0: 0

=== Top 10 Highest revenue Values ===


Unnamed: 0,period,stateid,stateDescription,sectorid,sectorName,sales,revenue,sales-units,revenue-units
14386,2024-07,CA,California,ALL,all sectors,25560.43547,7759.6171,million kilowatt hours,million dollars
14437,2024-08,CA,California,ALL,all sectors,25547.24336,7426.65572,million kilowatt hours,million dollars
15049,2025-08,CA,California,ALL,all sectors,23990.05275,7030.66884,million kilowatt hours,million dollars
13825,2023-08,CA,California,ALL,all sectors,25101.55563,6948.44467,million kilowatt hours,million dollars
15100,2025-09,CA,California,ALL,all sectors,22932.82692,6917.49793,million kilowatt hours,million dollars
14998,2025-07,CA,California,ALL,all sectors,22623.72628,6795.81309,million kilowatt hours,million dollars
14488,2024-09,CA,California,ALL,all sectors,23016.12769,6717.64485,million kilowatt hours,million dollars
13213,2022-08,CA,California,ALL,all sectors,27435.43426,6715.51299,million kilowatt hours,million dollars
13264,2022-09,CA,California,ALL,all sectors,26005.26868,6250.54789,million kilowatt hours,million dollars
13864,2023-08,TX,Texas,ALL,all sectors,54396.52465,6211.17856,million kilowatt hours,million dollars



=== IQR-based Outlier Check ===
Upper fence for outlier detection: 1662.185935
Values above fence: 1056


In [None]:
eia_clean = eia_clean.drop(
    columns=[
        "sectorid",
        "sectorName",
        "sales-units",
        "revenue-units",
        "stateDescription"
    ]
).copy()

eia_clean.head()

Unnamed: 0,period,stateid,sales,revenue
0,2001-01,AK,521.03566,51.96404
1,2001-01,AL,7362.47302,407.61261
2,2001-01,AR,3804.21013,216.58535
3,2001-01,AZ,4786.79176,304.10688
4,2001-01,CA,21744.31668,1893.25678


In [22]:
print("=== Missing Values ===")
print(eia_clean.isna().sum())

print("\n=== Summary Statistics ===")
display(eia_clean.describe())


=== Missing Values ===
period     0
stateid    0
sales      0
revenue    0
dtype: int64

=== Summary Statistics ===


Unnamed: 0,sales,revenue
count,15147.0,15147.0
mean,6095.73192,613.349974
std,6046.233268,707.887815
min,392.21815,40.1531
25%,1853.300335,160.546685
50%,4623.56158,411.98661
75%,8133.53587,761.202385
max,54396.52465,7759.6171
