# Extract, Transform, Load (ETL)
---
* Source `https://www.eia.gov/opendata/browser/`
* Main Route `Electricity`
    * Sub Route 1 - `Electric Power Operations (Annual And Monthly)`    
    * Sub Route 2 - `State Specific Data`
        * Sub Sub Route - `Emissions From Energy Consumption At Conventional Power Plants and Combined-Heat-And-Power Plants`
* Year range `2012` to `2022`
---
API Key can be obtained via signing up via `https://www.eia.gov/opendata/` then added to to the variable `api_key` located under the `config.py` file. Additionally, the API url path can be obtained after choosing the primary route and its subroutes which can then be copied and used here.

More information regarding EIA's API documentation can be found at `https://www.eia.gov/opendata/documentation.php`.

**Goal** - Combine the 2 different sub-routes for Electricity dating back from 2012 to 2022 to get a data report to help with the Machine Learning (ML) process to determine the most sustainable type of fuel category for electricity generation by:<br/>
* Categorizing fuel types into the following bin:
    * `Fossil fuels` - anthracite coal, bituminous coal, bituminous coal and syntehtic coal, 'coal, excluding waste coal', distillate fuel oil, fossil fuels, ignite coal, natural gas, natural gas & other gases, other gases, petroleum, petroleum coke, petroleum liquids, refined coal, residual fuel oil, subbituminous coal, lignite coal
        * From Emission data - coal, natural gas, petroleum
    * `Renewables` - biogenic municipal solid waste, biomass, conventional hydroelectric, estimated small scale solar photovoltaic, estimated total solar, estimated total solar photovoltaic, geothermal, hydro-electric pumped storage, landfill gas, municiapl landfill gas, offshore wind turbine, onshore wind turbine, renewable, renewable waste products, solar, solar photovoltaic, solar thermal, waste coal, waste oil and other oils, wind, wood and wood wastes, other renewables
        * From Emission data - total
    * `Others` - nuclear, other (sources not specified by EIA)

* Categorizing the stateDescription into regional bin and rename it as region:
    * `West Region` - 'Alaska', 'California', 'Colorado', 'Hawaii', 'Idaho', 'Montana', 'Nevada', 'Oregon', 'Utah', 'Washington', 'Wyoming'
    * `Southwest Region` - 'Arizona', 'New Mexico', 'Oklahoma', 'Texas'
    * `Midwest Region` - 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Michigan', 'Minnesota', 'Missouri', 'Nebraska', 'North Dakota', 'Ohio', 'South Dakota', 'Wisconsin'
    * `Southeast Region` - 'Alabama', 'Arkansas', 'Florida', 'Georgia', 'Kentucky', 'Louisiana', 'Mississippi', 'North Carolina', 'South Carolina', 'Tennessee', 'Virginia', 'West Virginia'
    * `Northeast Region` - All other states not included in the above regions

*Note:* We will also not be including Puerto Rico in the data and focus mainly just on the 50 states.

Binning information is based on sources from `https://www.eia.gov/tools/faqs/faq.php?id=427&t=3`, `https://www.eia.gov/electricity/data/browser/`, and `https://www.eia.gov/dnav/pet/TblDefs/pet_cons_821dst_tbldef2.asp`.

In [1]:
# Import dependencies
from config import api_key
import json
import requests
import pandas as pd

### Functions

In [2]:
def request_to_df(url, api_key, years = []):
    '''Function to request data from target API by looping through the years provided to return as a list before combining into one DataFrame'''
    data = []
    
    for year in years:
        api_path = url.replace('||KEY||', api_key).replace('||START||', year).replace('||END||', year)
        
        # Send the request
        response = requests.get(api_path).json()
        
        # Verify the response and raise error if bad response returned otherwise pass it into a DataFrame before returning it
        if (not 'warning' in response.keys()) and (not 'error' in response.keys()) :
            data += response['response']['data']
        else:
            raise Exception('Bad request submitted or no response received from the source API, verify that the url and/or offset provided is correct')
    
    df = pd.DataFrame(data)
    return df

def category_bin(df, check_col, list_to_bin, bin_name, new_col = ''):
    '''
        Function to create a bin category for a DataFrame based on the provided list then replace existing value with a bin category.
        If new_col is provided, a new column will be created for the binned category
    '''
    tmp_df = df.copy()
    
    for item in list_to_bin:
        if new_col == '' or new_col.isspace():
            tmp_df[check_col] = tmp_df[check_col].replace(item, bin_name)
        else:
            tmp_df.loc[tmp_df[check_col] == item, new_col] = bin_name
    
    return tmp_df

def fix_nan(df, col, fill_value = 0, to_type = 'float'):
    '''Function to fill the selected column's NaN value with the provided value and change it type'''
    tmp_df = df.copy()
    tmp_df[col] = tmp_df[col].fillna(fill_value)
    tmp_df[col] = tmp_df[col].astype(to_type)
    
    return tmp_df

### Extract

In [3]:
# Set years for the API to go through
epo_years = [str(year) + '-{:02d}'.format(month) for year in range(2008, 2023) for month in range(1, 13)]
emission_years = [str(year) for year in range(2008, 2023)]

# Set up the paths for retrieving the data into DataFrames with ||START|| and ||END|| for start and end parameter
epo_url = 'https://api.eia.gov/v2/electricity/electric-power-operational-data/data/?api_key=||KEY||&frequency=monthly&data[0]=ash-content&data[1]=consumption-for-eg&data[2]=consumption-for-eg-btu&data[3]=consumption-uto&data[4]=consumption-uto-btu&data[5]=cost&data[6]=cost-per-btu&data[7]=generation&data[8]=heat-content&data[9]=receipts&data[10]=receipts-btu&data[11]=stocks&data[12]=sulfur-content&data[13]=total-consumption&data[14]=total-consumption-btu&start=||START||&end=||END||&sort[0][column]=period&sort[0][direction]=desc&offset=0&length=5000'
emission_url = 'https://api.eia.gov/v2/electricity/state-electricity-profiles/emissions-by-state-by-fuel/data/?api_key=||KEY||&frequency=annual&data[0]=co2-rate-lbs-mwh&data[1]=co2-thousand-metric-tons&data[2]=nox-rate-lbs-mwh&data[3]=nox-short-tons&data[4]=so2-rate-lbs-mwh&data[5]=so2-short-tons&start=||START||&end=||END||&sort[0][column]=period&sort[0][direction]=desc&offset=0&length=5000'

In [5]:
# Get the request and create the DataFrames
epo_raw_df = request_to_df(epo_url, api_key, epo_years)
emission_raw_df = request_to_df(emission_url, api_key, emission_years)

# Or if already using csv file for extraction then (make sure to comment out the above)
epo_raw_df = pd.read_csv('../static/data/epo_2012_2022_raw.csv')
emission_raw_df = pd.read_csv('../static/data/emission_2012_2022_raw.csv')

In [6]:
# Export out raw data to csv files
epo_raw_df.to_csv('../static/data/epo_2012_2022_raw.csv', index = False)
emission_raw_df.to_csv('../static/data/emission_2012_2022_raw.csv', index = False)

In [7]:
# Print out the keys for epo
epo_raw_df.keys()

Index(['period', 'location', 'stateDescription', 'sectorid',
       'sectorDescription', 'fueltypeid', 'fuelTypeDescription', 'ash-content',
       'ash-content-units', 'consumption-for-eg', 'consumption-for-eg-units',
       'consumption-for-eg-btu', 'consumption-for-eg-btu-units',
       'consumption-uto', 'consumption-uto-units', 'consumption-uto-btu',
       'consumption-uto-btu-units', 'cost', 'cost-units', 'cost-per-btu',
       'cost-per-btu-units', 'generation', 'generation-units', 'heat-content',
       'heat-content-units', 'receipts', 'receipts-units', 'receipts-btu',
       'receipts-btu-units', 'stocks', 'stocks-units', 'sulfur-content',
       'sulfur-content-units', 'total-consumption', 'total-consumption-units',
       'total-consumption-btu', 'total-consumption-btu-units'],
      dtype='object')

In [8]:
# Print out the types
epo_raw_df.dtypes

period                          object
location                        object
stateDescription                object
sectorid                        object
sectorDescription               object
fueltypeid                      object
fuelTypeDescription             object
ash-content                     object
ash-content-units               object
consumption-for-eg              object
consumption-for-eg-units        object
consumption-for-eg-btu          object
consumption-for-eg-btu-units    object
consumption-uto                 object
consumption-uto-units           object
consumption-uto-btu             object
consumption-uto-btu-units       object
cost                            object
cost-units                      object
cost-per-btu                    object
cost-per-btu-units              object
generation                      object
generation-units                object
heat-content                    object
heat-content-units              object
receipts                 

In [9]:
# Print out the keys for emission
emission_raw_df.keys()

Index(['period', 'stateid', 'stateDescription', 'fuelid', 'fuelDescription',
       'co2-rate-lbs-mwh', 'co2-thousand-metric-tons', 'nox-rate-lbs-mwh',
       'nox-short-tons', 'so2-rate-lbs-mwh', 'so2-short-tons',
       'co2-rate-lbs-mwh-units', 'co2-thousand-metric-tons-units',
       'nox-rate-lbs-mwh-units', 'nox-short-tons-units',
       'so2-rate-lbs-mwh-units', 'so2-short-tons-units'],
      dtype='object')

In [10]:
# Print out the types
emission_raw_df.dtypes

period                            object
stateid                           object
stateDescription                  object
fuelid                            object
fuelDescription                   object
co2-rate-lbs-mwh                  object
co2-thousand-metric-tons          object
nox-rate-lbs-mwh                  object
nox-short-tons                    object
so2-rate-lbs-mwh                  object
so2-short-tons                    object
co2-rate-lbs-mwh-units            object
co2-thousand-metric-tons-units    object
nox-rate-lbs-mwh-units            object
nox-short-tons-units              object
so2-rate-lbs-mwh-units            object
so2-short-tons-units              object
dtype: object

### Transform

In [61]:
# Create copies of the DataFrames
epo_cleaned_df = epo_raw_df.copy()
emission_cleaned_df = emission_raw_df.copy()

---
**epo_df**

In [62]:
# Drop the not needed columns for our objective
epo_cleaned_df = epo_cleaned_df.drop([
    'location', 'sectorid', 'fueltypeid', 'sectorDescription', 'ash-content-units', 'consumption-for-eg-units',
    'consumption-for-eg-btu-units', 'consumption-uto-units', 'consumption-uto-btu-units', 'cost-units', 'cost-per-btu-units',
    'generation-units', 'heat-content-units', 'receipts-units', 'receipts-btu-units', 'stocks-units', 'sulfur-content-units',
    'total-consumption-units', 'total-consumption-btu-units'
], axis = 1)
print(epo_cleaned_df.columns)

Index(['period', 'stateDescription', 'fuelTypeDescription', 'ash-content',
       'consumption-for-eg', 'consumption-for-eg-btu', 'consumption-uto',
       'consumption-uto-btu', 'cost', 'cost-per-btu', 'generation',
       'heat-content', 'receipts', 'receipts-btu', 'stocks', 'sulfur-content',
       'total-consumption', 'total-consumption-btu'],
      dtype='object')


In [63]:
# Print out the value_counts() in fuelTypeDescription for epo_cleaned_df
print('epo', epo_cleaned_df['fuelTypeDescription'].value_counts())

epo fuelTypeDescription
biomass                                     75729
all fuels                                   42153
fossil fuels                                40516
natural gas & other gases                   38337
renewable                                   38017
natural gas                                 37957
petroleum                                   36290
petroleum liquids                           36220
distillate fuel oil                         35446
all renewables                              35187
renewable waste products                    28506
all coal products                           26782
coal, excluding waste coal                  26699
other                                       23207
conventional hydroelectric                  23156
bituminous coal                             22200
bituminous coal and synthetic coal          22159
municiapl landfill gas                      21377
landfill gas                                20493
other renewables          

In [64]:
# Remove all rows where fuelTypeDescription for epo_cleaned_df is 'all coal products', 'all fuels', or 'all renewables'
epo_cleaned_df = epo_cleaned_df[
    ~epo_cleaned_df['fuelTypeDescription'].isin([
        'all coal products', 'all fuels', 'all renewables'
    ])
]

# Take only rows where stateDescription is in 'U.S. Total' since we are only interested in the national level then drop the
# column afterward as we do not need it for our objective
epo_cleaned_df = epo_cleaned_df[epo_cleaned_df['stateDescription'].isin(['U.S. Total'])].drop(columns = 'stateDescription')

In [65]:
epo_cleaned_df.head()

Unnamed: 0,period,fuelTypeDescription,ash-content,consumption-for-eg,consumption-for-eg-btu,consumption-uto,consumption-uto-btu,cost,cost-per-btu,generation,heat-content,receipts,receipts-btu,stocks,sulfur-content,total-consumption,total-consumption-btu
35,2008-01,biogenic municipal solid waste,,38.861,0.33181,0,0,0.0,,16.22972,,0.0,0.0,0.0,,38.861,0.33181
36,2008-01,natural gas,0.0,213193.626,218.25615,0,0,8.5,8.3075,25795.45248,1.0237,216900.522,221807.3309,0.0,0.0,213193.626,218.25615
40,2008-01,nuclear,,0.0,400.12862,0,0,0.0,,38151.089,,,,,,0.0,400.12862
41,2008-01,biomass,,312.321,0.24026,0,0,0.0,,19.57995,,0.0,0.0,0.0,,312.321,0.24026
42,2008-01,other gases,0.0,152.24,0.19821,0,0,11.52,19.2457,4.848,1.302,,,0.0,0.0,152.24,0.19821


In [66]:
# Create bins based on fuel type description into 'Fossil Fuels', 'Renewables', and 'Others' for epo_cleaned_df under
# energySource
epo_ff_source = [
    'anthracite coal', 'bituminous coal', 'bituminous coal and synthetic coal', 'coal, excluding waste coal', 
    'distillate fuel oil', 'fossil fuels', 'ignite coal', 'natural gas', 'natural gas & other gases', 'other gases', 
    'petroleum', 'petroleum coke', 'petroleum liquids', 'refined coal', 'residual fuel oil', 'subbituminous coal', 'lignite coal'
]

epo_renew_source = [
    'biogenic municipal solid waste', 'biomass', 'conventional hydroelectric', 'estimated small scale solar photovoltaic', 
    'estimated total solar', 'estimated total solar photovoltaic', 'geothermal', 'hydro-electric pumped storage', 
    'landfill gas', 'municiapl landfill gas', 'offshore wind turbine', 'onshore wind turbine', 'renewable', 
    'renewable waste products', 'solar', 'solar photovoltaic', 'solar thermal', 'waste coal', 'waste oil and other oils', 
    'wind', 'wood and wood wastes', 'other renewables'
]

epo_oth_source = [item for item in epo_cleaned_df['fuelTypeDescription'].value_counts().index if (item not in epo_ff_source and item not in epo_renew_source)]

epo_cleaned_df = category_bin(epo_cleaned_df, 'fuelTypeDescription', epo_ff_source, 'fossil fuels', 'energySource')
epo_cleaned_df = category_bin(epo_cleaned_df, 'fuelTypeDescription', epo_renew_source, 'renewables', 'energySource')
epo_cleaned_df = category_bin(epo_cleaned_df, 'fuelTypeDescription', epo_oth_source, 'others', 'energySource')

  tmp_df.loc[tmp_df[check_col] == item, new_col] = bin_name


In [67]:
# Check the value_counts() again to make sure binning was done correctly
epo_cleaned_df['energySource'].value_counts()

energySource
renewables      13741
fossil fuels    10714
others           1115
Name: count, dtype: int64

In [68]:
# Fill in NaN columns as 0 then set these columns as float
# 'ash-content', 'consumption-for-eg-btu', 'consumption-uto-btu', 'cost-per-btu', 'generation', 'heat-content',
# 'receipts-btu', 'sulfur-content', 'total-consumption-btu'

# Other units (in case needed more features): 'consumption-for-eg', 'consumption-uto', 'cost', 'receipts', 'total-consumption',
# 'stocks'
epo_cols = [
    'ash-content', 'consumption-for-eg-btu', 'consumption-uto-btu', 'cost-per-btu', 'generation', 'heat-content',
    'receipts-btu', 'sulfur-content', 'total-consumption-btu', 'consumption-for-eg', 'consumption-uto', 'cost', 
    'receipts', 'total-consumption', 'stocks'
]

for col in epo_cols:
    epo_cleaned_df = fix_nan(epo_cleaned_df, epo_cols)

In [69]:
# Review the cleaned DF before additional cleaning
display(epo_cleaned_df.head())
print('epo\'s shape:', epo_cleaned_df.shape)

Unnamed: 0,period,fuelTypeDescription,ash-content,consumption-for-eg,consumption-for-eg-btu,consumption-uto,consumption-uto-btu,cost,cost-per-btu,generation,heat-content,receipts,receipts-btu,stocks,sulfur-content,total-consumption,total-consumption-btu,energySource
35,2008-01,biogenic municipal solid waste,0.0,38.861,0.33181,0.0,0.0,0.0,0.0,16.22972,0.0,0.0,0.0,0.0,0.0,38.861,0.33181,renewables
36,2008-01,natural gas,0.0,213193.626,218.25615,0.0,0.0,8.5,8.3075,25795.45248,1.0237,216900.522,221807.3309,0.0,0.0,213193.626,218.25615,fossil fuels
40,2008-01,nuclear,0.0,0.0,400.12862,0.0,0.0,0.0,0.0,38151.089,0.0,0.0,0.0,0.0,0.0,0.0,400.12862,others
41,2008-01,biomass,0.0,312.321,0.24026,0.0,0.0,0.0,0.0,19.57995,0.0,0.0,0.0,0.0,0.0,312.321,0.24026,renewables
42,2008-01,other gases,0.0,152.24,0.19821,0.0,0.0,11.52,19.2457,4.848,1.302,0.0,0.0,0.0,0.0,152.24,0.19821,fossil fuels


epo's shape: (25570, 18)


In [70]:
# Drop the 'fuelTypeDescription' column since we are only looking at the high level of energy source
epo_cleaned_df =  epo_cleaned_df.drop(columns = 'fuelTypeDescription')

In [71]:
# Using groupby() and sum() functions to merge matching rows based on 'period', 'stateDescription', and 'energySource' 
# then round to the nearest 2. Note to also group the UOM columns as well.
group_by = ['period', 'energySource']

epo_cleaned_df = epo_cleaned_df.groupby(group_by).agg('sum').round(2).reset_index()
display(epo_cleaned_df.head())
print('epo\'s shape:', epo_cleaned_df.shape)

Unnamed: 0,period,energySource,ash-content,consumption-for-eg,consumption-for-eg-btu,consumption-uto,consumption-uto-btu,cost,cost-per-btu,generation,heat-content,receipts,receipts-btu,stocks,sulfur-content,total-consumption,total-consumption-btu
0,2008-01,fossil fuels,308.89,3734358.85,27878.91,594559.14,1025.72,4477.05,692.55,2854188.91,974.95,4011940.29,26408984.47,2367212.11,81.48,4328918.0,28904.63
1,2008-01,others,0.0,2666.2,2263.2,12391.55,7.95,0.0,0.0,214662.9,0.0,0.0,0.0,0.0,0.0,15057.75,2271.15
2,2008-01,renewables,254.19,51894.93,1381.14,4114.64,453.03,591.66,101.83,126059.98,101.13,4074.26,44771.59,7392.34,8.36,56009.57,1834.17
3,2008-02,fossil fuels,169.75,1476772.66,10823.13,321531.26,499.01,2371.16,374.65,1092655.81,617.48,1660658.05,10715695.93,1002393.72,64.86,1798303.92,11322.14
4,2008-02,others,0.0,913.06,13.82,8851.58,3.47,0.0,0.0,1004.15,0.0,0.0,0.0,0.0,0.0,9764.64,17.29


epo's shape: (539, 17)


In [72]:
# Add first date to the period so we can transform it into date format
epo_cleaned_df['period'] = epo_cleaned_df['period'] + '-01'
epo_cleaned_df.head()

Unnamed: 0,period,energySource,ash-content,consumption-for-eg,consumption-for-eg-btu,consumption-uto,consumption-uto-btu,cost,cost-per-btu,generation,heat-content,receipts,receipts-btu,stocks,sulfur-content,total-consumption,total-consumption-btu
0,2008-01-01,fossil fuels,308.89,3734358.85,27878.91,594559.14,1025.72,4477.05,692.55,2854188.91,974.95,4011940.29,26408984.47,2367212.11,81.48,4328918.0,28904.63
1,2008-01-01,others,0.0,2666.2,2263.2,12391.55,7.95,0.0,0.0,214662.9,0.0,0.0,0.0,0.0,0.0,15057.75,2271.15
2,2008-01-01,renewables,254.19,51894.93,1381.14,4114.64,453.03,591.66,101.83,126059.98,101.13,4074.26,44771.59,7392.34,8.36,56009.57,1834.17
3,2008-02-01,fossil fuels,169.75,1476772.66,10823.13,321531.26,499.01,2371.16,374.65,1092655.81,617.48,1660658.05,10715695.93,1002393.72,64.86,1798303.92,11322.14
4,2008-02-01,others,0.0,913.06,13.82,8851.58,3.47,0.0,0.0,1004.15,0.0,0.0,0.0,0.0,0.0,9764.64,17.29


In [73]:
# Convert the period column to DateTime format
epo_cleaned_df['period'] = pd.to_datetime(epo_cleaned_df['period'], format='%Y-%m-%d')

In [74]:
# Check the column types
epo_cleaned_df.dtypes

period                    datetime64[ns]
energySource                      object
ash-content                      float64
consumption-for-eg               float64
consumption-for-eg-btu           float64
consumption-uto                  float64
consumption-uto-btu              float64
cost                             float64
cost-per-btu                     float64
generation                       float64
heat-content                     float64
receipts                         float64
receipts-btu                     float64
stocks                           float64
sulfur-content                   float64
total-consumption                float64
total-consumption-btu            float64
dtype: object

In [75]:
epo_cleaned_df.columns

Index(['period', 'energySource', 'ash-content', 'consumption-for-eg',
       'consumption-for-eg-btu', 'consumption-uto', 'consumption-uto-btu',
       'cost', 'cost-per-btu', 'generation', 'heat-content', 'receipts',
       'receipts-btu', 'stocks', 'sulfur-content', 'total-consumption',
       'total-consumption-btu'],
      dtype='object')

In [76]:
# Export the cleanned DataFrame for epo_cleaned_df into csv
epo_cleaned_df.to_csv('../static/data/epo_2012_2022_cleaned.csv', index = '')

**emission_df**

In [86]:
# Drop columns 'stateid', 'stateDescription', 'fuelid', 'co2-rate-lbs-mwh', 'nox-rate-lbs-mwh', 'so2-rate-lbs-mwh',
# 'co2-rate-lbs-mwh-units', 'nox-rate-lbs-mwh-units', 'so2-rate-lbs-mwh-units' from emission_cleaned_df
emission_cleaned_df = emission_cleaned_df.drop([
    'stateid', 'fuelid', 'co2-rate-lbs-mwh', 'nox-rate-lbs-mwh', 'so2-rate-lbs-mwh',
    'co2-rate-lbs-mwh-units', 'nox-rate-lbs-mwh-units', 'so2-rate-lbs-mwh-units'
], axis = 1)
print(emission_cleaned_df.columns)

Index(['period', 'stateDescription', 'fuelDescription',
       'co2-thousand-metric-tons', 'nox-short-tons', 'so2-short-tons',
       'co2-thousand-metric-tons-units', 'nox-short-tons-units',
       'so2-short-tons-units'],
      dtype='object')


In [87]:
# Print out the unique values in fuelDescription for emission_cleaned_df
print('emission', emission_cleaned_df['fuelDescription'].value_counts())

emission fuelDescription
Total          780
Petroleum      775
Natural Gas    762
Other          761
Coal           725
Name: count, dtype: int64


In [88]:
# Retrieve only rows containing 'United States' since we are doing national level then drop the column
emission_cleaned_df = emission_cleaned_df[emission_cleaned_df['stateDescription'].isin(['United States'])] \
                            .drop(columns = 'stateDescription')

In [89]:
# Create bins based on fuel type description into 'Fossil Fuels', 'Renewables', and 'Others' for emission_cleaned_df under
# energySource
emission_ff_source = ['Petroleum', 'Natural Gas', 'Coal']
emission_renew_source = ['Total']
emission_oth_source = ['Other']

emission_cleaned_df = category_bin(emission_cleaned_df, 'fuelDescription', emission_ff_source, 'fossil fuels', 'energySource')
emission_cleaned_df = category_bin(emission_cleaned_df, 'fuelDescription', emission_renew_source, 'renewables', 'energySource')
emission_cleaned_df = category_bin(emission_cleaned_df, 'fuelDescription', emission_oth_source, 'others', 'energySource')

# Drop the 'fuelDescription' as we no longer needing it
emission_cleaned_df = emission_cleaned_df.drop(['fuelDescription'], axis = 1)

  tmp_df.loc[tmp_df[check_col] == item, new_col] = bin_name


In [90]:
# Check the value_counts() again to make sure binning was done correctly
emission_cleaned_df['energySource'].value_counts()

energySource
fossil fuels    45
others          15
renewables      15
Name: count, dtype: int64

In [91]:
# Fill in NaN columns as 0 then set these columns as float
# 'co2-thousand-metric-tons', 'nox-short-tons', 'so2-short-tons'
emission_cols = ['co2-thousand-metric-tons', 'nox-short-tons', 'so2-short-tons']

for col in emission_cols:
    emission_cleaned_df = fix_nan(emission_cleaned_df, emission_cols)

In [92]:
# Print out the columns
emission_cleaned_df.columns

Index(['period', 'co2-thousand-metric-tons', 'nox-short-tons',
       'so2-short-tons', 'co2-thousand-metric-tons-units',
       'nox-short-tons-units', 'so2-short-tons-units', 'energySource'],
      dtype='object')

In [93]:
emission_cleaned_df.head()

Unnamed: 0,period,co2-thousand-metric-tons,nox-short-tons,so2-short-tons,co2-thousand-metric-tons-units,nox-short-tons-units,so2-short-tons-units,energySource
162,2008,2001806.0,2954350.0,8103512.0,thousand metric tons,short tons,short tons,fossil fuels
163,2008,419599.0,386890.0,3115.0,thousand metric tons,short tons,short tons,fossil fuels
164,2008,14752.0,247542.0,248311.0,thousand metric tons,short tons,short tons,others
165,2008,47855.0,82181.0,275880.0,thousand metric tons,short tons,short tons,fossil fuels
166,2008,2484012.0,3670963.0,8630818.0,thousand metric tons,short tons,short tons,renewables


In [94]:
# Fill in the 'mm-dd' in the period before we convert to proper date type using just first month and date since this dateset
# is annually
emission_cleaned_df['period'] = emission_cleaned_df['period'] + '-01-01'

# Now do the conversion
emission_cleaned_df['period'] = pd.to_datetime(emission_cleaned_df['period'], format='%Y-%m-%d')

In [95]:
# Using groupby() and sum() functions to merge matching rows based on 'period', 'stateDescription', 'energySource', and 
# 'fuelTypeDescription' then round to the nearest 2. Note to also group the UOM columns as well.
emission_group_by = [
    'period', 'co2-thousand-metric-tons-units', 
    'nox-short-tons-units', 'so2-short-tons-units', 'energySource'
]

emission_cleaned_df = emission_cleaned_df.groupby(emission_group_by).agg('sum').round(2).reset_index()
display(emission_cleaned_df.head())
print('emission\'s shape:', emission_cleaned_df.shape)

Unnamed: 0,period,co2-thousand-metric-tons-units,nox-short-tons-units,so2-short-tons-units,energySource,co2-thousand-metric-tons,nox-short-tons,so2-short-tons
0,2008-01-01,thousand metric tons,short tons,short tons,fossil fuels,2469260.0,3423421.0,8382507.0
1,2008-01-01,thousand metric tons,short tons,short tons,others,14752.0,247542.0,248311.0
2,2008-01-01,thousand metric tons,short tons,short tons,renewables,2484012.0,3670963.0,8630818.0
3,2009-01-01,thousand metric tons,short tons,short tons,fossil fuels,2254958.0,2392188.0,6334871.0
4,2009-01-01,thousand metric tons,short tons,short tons,others,14549.0,248285.0,246242.0


emission's shape: (45, 8)


In [None]:
# Check the column types
emission_cleaned_df.dtypes

In [96]:
# Rearrange the energySource
emission_cleaned_df = emission_cleaned_df[[
    'period', 'energySource',
    'co2-thousand-metric-tons', 'nox-short-tons', 'so2-short-tons',
    'co2-thousand-metric-tons-units', 'nox-short-tons-units',
    'so2-short-tons-units'
]]

# Export the cleanned DataFrame for epo_cleaned_df into csv
emission_cleaned_df.to_csv('../static/data/emission_2012_2022_cleaned.csv', index = False)

### Database Storing
---

In [97]:
# Import dependencies for handling the database
from os import path, remove
from sqlalchemy import create_engine, text
from sqlalchemy.orm import Session

In [98]:
# Setup the db path
db_path = '../static/data/eia_electric.sqlite'

# Delete the existing database if it exists
if path.exists(db_path):
    remove(db_path)

PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: '../static/data/eia_electric.sqlite'

In [99]:
# Setup the engine and connect the database
engine = create_engine(f'sqlite:///{db_path}')
conn = engine.connect()

In [100]:
# Create session for querying later to verify tables have been created correctly
session = Session(bind = engine)

In [101]:
# Append the epo_cleaned_df to the database created
epo_cleaned_df.to_sql(name = 'epo', con = engine, if_exists = 'replace', index = False)

539

In [102]:
session.execute(text('SELECT * from epo')).fetchone()

('2008-01-01 00:00:00.000000', 'fossil fuels', 308.89, 3734358.85, 27878.91, 594559.14, 1025.72, 4477.05, 692.55, 2854188.91, 974.95, 4011940.29, 26408984.47, 2367212.11, 81.48, 4328918.0, 28904.63)

In [103]:
# Append the emission_cleaned_df to the database created
emission_cleaned_df.to_sql(name = 'emission', con = engine, if_exists = 'replace', index = False)

45

In [104]:
session.execute(text('SELECT * from emission')).fetchone()

('2008-01-01 00:00:00.000000', 'fossil fuels', 2469260.0, 3423421.0, 8382507.0, 'thousand metric tons', 'short tons', 'short tons')

In [105]:
# Close out of the session and engine
session.close()
engine.dispose()