# Data Wrangling and EDA - Capstone 3


There are 3 separate datasets that we have:

1) Cities_sensus_data - Has the population and ownership
2) Complete_all_cities_daily_climate_data - This is Solar Irradiation data from NASA Solar
3) Eia_energy_data - that has monthly energy usage and cost per state

We are going to start by taking a look at each of these datasets individually for Data Wrangling, then merge them before we move on to Exploratory Data Analysis. 

This notebook will be Data Wrangling for EIA Energy Data


Column details
Period is the date YYYY-MM
State ID is the 2 letter stateid that will be dropped.
StateDescription is the full name of the state
Sector ID is RES for all because we are only focusing on residential market
Price - Unit of measure: cents per kilowatthour
sales - Unit of measure: million kilowatthours
The price and sales unit details will be dropped


##### price (cents per kilowatt hour)
##### sales (Milion kilowatt hours)

### Data Wrangling - EIA Energy Data

In [1]:
#Imports
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os


In [2]:

eia_df = pd.read_csv('../../raw/eia_energy_data.csv')

eia_df.head()

Unnamed: 0,period,stateid,stateDescription,sectorid,sectorName,price,sales,price-units,sales-units
0,2024-05,AL,Alabama,RES,residential,14.73,2532.77329,cents per kilowatthour,million kilowatthours
1,2024-05,MN,Minnesota,RES,residential,15.69,1571.5684,cents per kilowatthour,million kilowatthours
2,2024-05,MI,Michigan,RES,residential,19.44,2477.25469,cents per kilowatthour,million kilowatthours
3,2024-05,MA,Massachusetts,RES,residential,28.7,1446.62243,cents per kilowatthour,million kilowatthours
4,2024-05,MD,Maryland,RES,residential,17.63,1858.93542,cents per kilowatthour,million kilowatthours


In [3]:
eia_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2542 entries, 0 to 2541
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   period            2542 non-null   object 
 1   stateid           2542 non-null   object 
 2   stateDescription  2542 non-null   object 
 3   sectorid          2542 non-null   object 
 4   sectorName        2542 non-null   object 
 5   price             2542 non-null   float64
 6   sales             2542 non-null   float64
 7   price-units       2542 non-null   object 
 8   sales-units       2542 non-null   object 
dtypes: float64(2), object(7)
memory usage: 178.9+ KB


Let's start by dropping the unnecessary colums

In [14]:
eia_df.drop(columns=['stateid', 'sectorid', 'sectorName', 'price-units', 'sales-units'], inplace=True)

eia_df.head()

Unnamed: 0,period,stateDescription,price,sales
0,2024-05,Alabama,14.73,2532.77329
1,2024-05,Minnesota,15.69,1571.5684
2,2024-05,Michigan,19.44,2477.25469
3,2024-05,Massachusetts,28.7,1446.62243
4,2024-05,Maryland,17.63,1858.93542


In [15]:
missing_values = eia_df.isnull().sum()
summary = eia_df.describe()

print(missing_values)
print(summary)

period              0
stateDescription    0
price               0
sales               0
dtype: int64
             price          sales
count  2542.000000    2542.000000
mean     16.047290    5905.854840
std       6.056787   16079.972529
min       8.830000     138.422540
25%      12.260000     782.913565
50%      13.970000    2144.375185
75%      17.337500    4489.892900
max      45.590000  164276.577890



The period column is a date, need to update the dtype and rename the column.


In [16]:
eia_df['period'] = pd.to_datetime(eia_df['period'], format='%Y-%m')

In [17]:
print(eia_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2542 entries, 0 to 2541
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   period            2542 non-null   datetime64[ns]
 1   stateDescription  2542 non-null   object        
 2   price             2542 non-null   float64       
 3   sales             2542 non-null   float64       
dtypes: datetime64[ns](1), float64(2), object(1)
memory usage: 79.6+ KB
None


In [18]:
eia_df.head()

Unnamed: 0,period,stateDescription,price,sales
0,2024-05-01,Alabama,14.73,2532.77329
1,2024-05-01,Minnesota,15.69,1571.5684
2,2024-05-01,Michigan,19.44,2477.25469
3,2024-05-01,Massachusetts,28.7,1446.62243
4,2024-05-01,Maryland,17.63,1858.93542


In [23]:
#Rename the column

eia_df.rename(columns={'stateDescription': 'state'}, inplace=True)

In [25]:
eia_df.head()

Unnamed: 0,period,state,price,sales
0,2024-05-01,Alabama,14.73,2532.77
1,2024-05-01,Minnesota,15.69,1571.57
2,2024-05-01,Michigan,19.44,2477.25
3,2024-05-01,Massachusetts,28.7,1446.62
4,2024-05-01,Maryland,17.63,1858.94


In [26]:
# Round sales to 2 decimal places
eia_df['sales'] = eia_df['sales'].round(2)

eia_df.head()

Unnamed: 0,period,state,price,sales
0,2024-05-01,Alabama,14.73,2532.77
1,2024-05-01,Minnesota,15.69,1571.57
2,2024-05-01,Michigan,19.44,2477.25
3,2024-05-01,Massachusetts,28.7,1446.62
4,2024-05-01,Maryland,17.63,1858.94


In [27]:
# Export and save the data
eia_df.to_csv('eia_data_cleaned.csv', index=False)

print('Exported successfully')

Exported successfully
