# Extract, Transform, Load (ETL)
---
* Source `https://www.eia.gov/opendata/browser/`
* Main Route `Electricity`
    * Sub Route 1 - `Electric Power Operations (Annual And Monthly)`    
    * Sub Route 2 - `State Specific Data`
        * Sub Sub Route - `Emissions From Energy Consumption At Conventional Power Plants and Combined-Heat-And-Power Plants`
* Year range `2012` to `2022`
---
API Key can be obtained via signing up via `https://www.eia.gov/opendata/` then added to to the variable `api_key` located under the `config.py` file. Additionally, the API url path can be obtained after choosing the primary route and its subroutes which can then be copied and used here.

More information regarding EIA's API documentation can be found at `https://www.eia.gov/opendata/documentation.php`.

**Goal** - Combine the 2 different sub-routes for Electricity dating back from 2012 to 2022 to get a data report to help with the Machine Learning (ML) process to determine the most sustainable type of fuel category for electricity generation by:<br/>
* Categorizing fuel types into the following bin:
    * `Fossil fuels` - anthracite coal, bituminous coal, bituminous coal and syntehtic coal, 'coal, excluding waste coal', distillate fuel oil, fossil fuels, ignite coal, natural gas, natural gas & other gases, other gases, petroleum, petroleum coke, petroleum liquids, refined coal, residual fuel oil, subbituminous coal, lignite coal
        * From Emission data - coal, natural gas, petroleum
    * `Renewables` - biogenic municipal solid waste, biomass, conventional hydroelectric, estimated small scale solar photovoltaic, estimated total solar, estimated total solar photovoltaic, geothermal, hydro-electric pumped storage, landfill gas, municiapl landfill gas, offshore wind turbine, onshore wind turbine, renewable, renewable waste products, solar, solar photovoltaic, solar thermal, waste coal, waste oil and other oils, wind, wood and wood wastes, other renewables
        * From Emission data - total
    * `Others` - nuclear, other (sources not specified by EIA)

* Categorizing the stateDescription into regional bin and rename it as region:
    * `West Region` - 'Alaska', 'California', 'Colorado', 'Hawaii', 'Idaho', 'Montana', 'Nevada', 'Oregon', 'Utah', 'Washington', 'Wyoming'
    * `Southwest Region` - 'Arizona', 'New Mexico', 'Oklahoma', 'Texas'
    * `Midwest Region` - 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Michigan', 'Minnesota', 'Missouri', 'Nebraska', 'North Dakota', 'Ohio', 'South Dakota', 'Wisconsin'
    * `Southeast Region` - 'Alabama', 'Arkansas', 'Florida', 'Georgia', 'Kentucky', 'Louisiana', 'Mississippi', 'North Carolina', 'South Carolina', 'Tennessee', 'Virginia', 'West Virginia'
    * `Northeast Region` - All other states not included in the above regions

*Note:* We will also not be including Puerto Rico in the data and focus mainly just on the 50 states.

Binning information is based on sources from `https://www.eia.gov/tools/faqs/faq.php?id=427&t=3`, `https://www.eia.gov/electricity/data/browser/`, and `https://www.eia.gov/dnav/pet/TblDefs/pet_cons_821dst_tbldef2.asp`.

In [1]:
# Import dependencies
from config import api_key
import json
import requests
import pandas as pd

### Functions

In [2]:
def request_to_df(url, api_key, years = []):
    '''Function to request data from target API by looping through the years provided to return as a list before combining into one DataFrame'''
    data = []
    
    for year in years:
        api_path = url.replace('||KEY||', api_key).replace('||START||', year).replace('||END||', year)
        
        # Send the request
        response = requests.get(api_path).json()
        
        # Verify the response and raise error if bad response returned otherwise pass it into a DataFrame before returning it
        if (not 'warning' in response.keys()) and (not 'error' in response.keys()) :
            data += response['response']['data']
        else:
            raise Exception('Bad request submitted or no response received from the source API, verify that the url and/or offset provided is correct')
    
    df = pd.DataFrame(data)
    return df

def category_bin(df, check_col, list_to_bin, bin_name, new_col = ''):
    '''
        Function to create a bin category for a DataFrame based on the provided list then replace existing value with a bin category.
        If new_col is provided, a new column will be created for the binned category
    '''
    tmp_df = df.copy()
    
    for item in list_to_bin:
        if new_col == '' or new_col.isspace():
            tmp_df[check_col] = tmp_df[check_col].replace(item, bin_name)
        else:
            tmp_df.loc[tmp_df[check_col] == item, new_col] = bin_name
    
    return tmp_df

def fix_nan(df, col, fill_value = 0, to_type = 'float'):
    '''Function to fill the selected column's NaN value with the provided value and change it type'''
    tmp_df = df.copy()
    tmp_df[col] = tmp_df[col].fillna(fill_value)
    tmp_df[col] = tmp_df[col].astype(to_type)
    
    return tmp_df

### Extract

In [3]:
# Set years for the API to go through
years = ['2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021', '2022']

# Set up the paths for retrieving the data into DataFrames with ||START|| and ||END|| for start and end parameter
epo_url = 'https://api.eia.gov/v2/electricity/electric-power-operational-data/data/?api_key=||KEY||&frequency=annual&data[0]=ash-content&data[1]=consumption-for-eg&data[2]=consumption-for-eg-btu&data[3]=consumption-uto&data[4]=consumption-uto-btu&data[5]=cost&data[6]=cost-per-btu&data[7]=generation&data[8]=heat-content&data[9]=receipts&data[10]=receipts-btu&data[11]=stocks&data[12]=sulfur-content&data[13]=total-consumption&data[14]=total-consumption-btu&start=||START||&end=||END||&sort[0][column]=period&sort[0][direction]=desc&offset=0&length=5000'
emission_url = 'https://api.eia.gov/v2/electricity/state-electricity-profiles/emissions-by-state-by-fuel/data/?api_key=||KEY||&frequency=annual&data[0]=co2-rate-lbs-mwh&data[1]=co2-thousand-metric-tons&data[2]=nox-rate-lbs-mwh&data[3]=nox-short-tons&data[4]=so2-rate-lbs-mwh&data[5]=so2-short-tons&start=||START||&end=||END||&sort[0][column]=period&sort[0][direction]=desc&offset=0&length=5000'

In [4]:
# Get the request and create the DataFrames
epo_raw_df = request_to_df(epo_url, api_key, years)
emission_raw_df = request_to_df(emission_url, api_key, years)

In [5]:
# Export out raw data to csv files
epo_raw_df.to_csv('../static/data/epo_2012_2022_raw.csv', index = False)
emission_raw_df.to_csv('../static/data/emission_2012_2022_raw.csv', index = False)

In [6]:
# Print out the keys for epo
epo_raw_df.keys()

Index(['period', 'location', 'stateDescription', 'sectorid',
       'sectorDescription', 'fueltypeid', 'fuelTypeDescription', 'ash-content',
       'ash-content-units', 'consumption-for-eg', 'consumption-for-eg-units',
       'consumption-for-eg-btu', 'consumption-for-eg-btu-units',
       'consumption-uto', 'consumption-uto-units', 'consumption-uto-btu',
       'consumption-uto-btu-units', 'cost', 'cost-units', 'cost-per-btu',
       'cost-per-btu-units', 'generation', 'generation-units', 'heat-content',
       'heat-content-units', 'receipts', 'receipts-units', 'receipts-btu',
       'receipts-btu-units', 'stocks', 'stocks-units', 'sulfur-content',
       'sulfur-content-units', 'total-consumption', 'total-consumption-units',
       'total-consumption-btu', 'total-consumption-btu-units'],
      dtype='object')

In [7]:
# Print out the types
epo_raw_df.dtypes

period                          object
location                        object
stateDescription                object
sectorid                        object
sectorDescription               object
fueltypeid                      object
fuelTypeDescription             object
ash-content                     object
ash-content-units               object
consumption-for-eg              object
consumption-for-eg-units        object
consumption-for-eg-btu          object
consumption-for-eg-btu-units    object
consumption-uto                 object
consumption-uto-units           object
consumption-uto-btu             object
consumption-uto-btu-units       object
cost                            object
cost-units                      object
cost-per-btu                    object
cost-per-btu-units              object
generation                      object
generation-units                object
heat-content                    object
heat-content-units              object
receipts                 

In [8]:
# Print out the keys for emission
emission_raw_df.keys()

Index(['period', 'stateid', 'stateDescription', 'fuelid', 'fuelDescription',
       'co2-rate-lbs-mwh', 'co2-thousand-metric-tons', 'nox-rate-lbs-mwh',
       'nox-short-tons', 'so2-rate-lbs-mwh', 'so2-short-tons',
       'co2-rate-lbs-mwh-units', 'co2-thousand-metric-tons-units',
       'nox-rate-lbs-mwh-units', 'nox-short-tons-units',
       'so2-rate-lbs-mwh-units', 'so2-short-tons-units'],
      dtype='object')

In [9]:
# Print out the types
emission_raw_df.dtypes

period                            object
stateid                           object
stateDescription                  object
fuelid                            object
fuelDescription                   object
co2-rate-lbs-mwh                  object
co2-thousand-metric-tons          object
nox-rate-lbs-mwh                  object
nox-short-tons                    object
so2-rate-lbs-mwh                  object
so2-short-tons                    object
co2-rate-lbs-mwh-units            object
co2-thousand-metric-tons-units    object
nox-rate-lbs-mwh-units            object
nox-short-tons-units              object
so2-rate-lbs-mwh-units            object
so2-short-tons-units              object
dtype: object

### Transform

In [10]:
# Create copies of the DataFrames
epo_cleaned_df = epo_raw_df.copy()
emission_cleaned_df = emission_raw_df.copy()

---
**epo_df**

In [11]:
# Drop 'location', 'sectorid', 'fueltypeid', 'sectorDescription' columns from epo_cleaned_dfabs
epo_cleaned_df = epo_cleaned_df.drop(['location', 'sectorid', 'fueltypeid', 'sectorDescription'], axis = 1)
print(epo_cleaned_df.columns)

Index(['period', 'stateDescription', 'fuelTypeDescription', 'ash-content',
       'ash-content-units', 'consumption-for-eg', 'consumption-for-eg-units',
       'consumption-for-eg-btu', 'consumption-for-eg-btu-units',
       'consumption-uto', 'consumption-uto-units', 'consumption-uto-btu',
       'consumption-uto-btu-units', 'cost', 'cost-units', 'cost-per-btu',
       'cost-per-btu-units', 'generation', 'generation-units', 'heat-content',
       'heat-content-units', 'receipts', 'receipts-units', 'receipts-btu',
       'receipts-btu-units', 'stocks', 'stocks-units', 'sulfur-content',
       'sulfur-content-units', 'total-consumption', 'total-consumption-units',
       'total-consumption-btu', 'total-consumption-btu-units'],
      dtype='object')


In [12]:
# Print out the value_counts() in fuelTypeDescription for epo_cleaned_df
print('epo', epo_cleaned_df['fuelTypeDescription'].value_counts())

epo fuelTypeDescription
biomass                                     4520
all fuels                                   2371
fossil fuels                                2368
natural gas & other gases                   2224
renewable                                   2221
petroleum                                   2185
natural gas                                 2159
petroleum liquids                           2122
all renewables                              2107
distillate fuel oil                         2030
renewable waste products                    1735
all coal products                           1554
coal, excluding waste coal                  1484
other                                       1474
conventional hydroelectric                  1399
solar                                       1388
solar photovoltaic                          1343
landfill gas                                1297
other renewables                            1273
bituminous coal and synthetic coal          1

In [13]:
# Remove all rows where fuelTypeDescription for epo_cleaned_df is 'all coal products', 'all fuels', or 'all renewables'
epo_cleaned_df = epo_cleaned_df[
    ~epo_cleaned_df['fuelTypeDescription'].isin([
        'all coal products', 'all fuels', 'all renewables'
    ])
]

# Remove all rows where stateDescription contains 'U.S. Total', 'West North Central', 'West South Central', 'Pacific Noncontiguous',
# 'South Atlantic', 'Pacific', 'Pacific Contiguous', 'Puerto Rico'
epo_cleaned_df = epo_cleaned_df[
    ~epo_cleaned_df['stateDescription'].isin([
        'U.S. Total', 'West North Central', 'West South Central', 'Pacific Noncontiguous',
        'South Atlantic', 'Pacific', 'Pacific Contiguous', 'Puerto Rico'
    ])
]

In [14]:
# Create bin for stateDescription and put them into regional instead of state-wise for labeling
west_reg = ['Alaska', 'California', 'Colorado', 'Hawaii', 'Idaho', 'Montana', 'Nevada', 'Oregon', 'Utah', 'Washington', 'Wyoming']
southwest_reg = ['Arizona', 'New Mexico', 'Oklahoma', 'Texas']
midwest_reg = ['Illinois', 'Indiana', 'Iowa', 'Kansas', 'Michigan', 'Minnesota', 'Missouri', 'Nebraska', 'North Dakota', 'Ohio', 'South Dakota', 'Wisconsin']
southeast_reg = ['Alabama', 'Arkansas', 'Florida', 'Georgia', 'Kentucky', 'Louisiana', 'Mississippi', 'North Carolina', 'South Carolina', 'Tennessee', 'Virginia', 'West Virginia']
northeast_reg = [state for state in epo_cleaned_df['stateDescription'].value_counts().index \
                     if (state not in west_reg) and 
                     (state not in southwest_reg) and 
                     (state not in midwest_reg) and 
                     (state not in southeast_reg) 
                ]

epo_cleaned_df = category_bin(epo_cleaned_df, 'stateDescription', west_reg, 'West Region')
epo_cleaned_df = category_bin(epo_cleaned_df, 'stateDescription', southwest_reg, 'Southwest Region')
epo_cleaned_df = category_bin(epo_cleaned_df, 'stateDescription', midwest_reg, 'Midwest Region')
epo_cleaned_df = category_bin(epo_cleaned_df, 'stateDescription', southeast_reg, 'Southeast Region')
epo_cleaned_df = category_bin(epo_cleaned_df, 'stateDescription', northeast_reg, 'Northeast Region')

In [15]:
# Check the binning
epo_cleaned_df['stateDescription'].value_counts()

stateDescription
Northeast Region    13345
Midwest Region       8308
Southeast Region     8228
West Region          7483
Southwest Region     3153
Name: count, dtype: int64

In [16]:
# Create bins based on fuel type description into 'Fossil Fuels', 'Renewables', and 'Others' for epo_cleaned_df under
# energySource
epo_ff_source = [
    'anthracite coal', 'bituminous coal', 'bituminous coal and synthetic coal', 'coal, excluding waste coal', 
    'distillate fuel oil', 'fossil fuels', 'ignite coal', 'natural gas', 'natural gas & other gases', 'other gases', 
    'petroleum', 'petroleum coke', 'petroleum liquids', 'refined coal', 'residual fuel oil', 'subbituminous coal', 'lignite coal'
]

epo_renew_source = [
    'biogenic municipal solid waste', 'biomass', 'conventional hydroelectric', 'estimated small scale solar photovoltaic', 
    'estimated total solar', 'estimated total solar photovoltaic', 'geothermal', 'hydro-electric pumped storage', 
    'landfill gas', 'municiapl landfill gas', 'offshore wind turbine', 'onshore wind turbine', 'renewable', 
    'renewable waste products', 'solar', 'solar photovoltaic', 'solar thermal', 'waste coal', 'waste oil and other oils', 
    'wind', 'wood and wood wastes', 'other renewables'
]

epo_oth_source = [item for item in epo_cleaned_df['fuelTypeDescription'].value_counts().index if (item not in epo_ff_source and item not in epo_renew_source)]

epo_cleaned_df = category_bin(epo_cleaned_df, 'fuelTypeDescription', epo_ff_source, 'fossil fuels', 'energySource')
epo_cleaned_df = category_bin(epo_cleaned_df, 'fuelTypeDescription', epo_renew_source, 'renewables', 'energySource')
epo_cleaned_df = category_bin(epo_cleaned_df, 'fuelTypeDescription', epo_oth_source, 'others', 'energySource')

  tmp_df.loc[tmp_df[check_col] == item, new_col] = bin_name


In [17]:
# Check the value_counts() again to make sure binning was done correctly
epo_cleaned_df['energySource'].value_counts()

energySource
renewables      21155
fossil fuels    17652
others           1710
Name: count, dtype: int64

In [18]:
# Fill in NaN columns as 0 then set these columns as float
# 'ash-content', 'consumption-for-eg-btu', 'consumption-uto-btu', 'cost-per-btu', 'generation', 'heat-content',
# 'receipts-btu', 'sulfur-content', 'total-consumption-btu'
#
# Other units (in case needed more features): 'consumption-for-eg', 'consumption-uto', 'cost', 'receipts', 'total-consumption',
# 'stocks'
epo_cols = [
    'ash-content', 'consumption-for-eg-btu', 'consumption-uto-btu', 'cost-per-btu', 'generation', 'heat-content',
    'receipts-btu', 'sulfur-content', 'total-consumption-btu', 'consumption-for-eg', 'consumption-uto', 'cost', 
    'receipts', 'total-consumption', 'stocks'
]

for col in epo_cols:
    epo_cleaned_df = fix_nan(epo_cleaned_df, epo_cols)

In [19]:
# Review the cleaned DF before additional cleaning
display(epo_cleaned_df.head())
print('epo\'s shape:', epo_cleaned_df.shape)

Unnamed: 0,period,stateDescription,fuelTypeDescription,ash-content,ash-content-units,consumption-for-eg,consumption-for-eg-units,consumption-for-eg-btu,consumption-for-eg-btu-units,consumption-uto,...,receipts-btu-units,stocks,stocks-units,sulfur-content,sulfur-content-units,total-consumption,total-consumption-units,total-consumption-btu,total-consumption-btu-units,energySource
0,2012,West Region,bituminous coal,8.25,percent,739.31,thousand short tons,15.13525,million MMBtu,14.156,...,billion Btu,0.0,thousand short tons,0.67,percent,753.466,thousand short tons,15.43204,million MMBtu,fossil fuels
4,2012,Southeast Region,distillate fuel oil,0.0,percent,329.909,thousand short tons,1.91416,million MMBtu,0.0,...,billion Btu,0.0,thousand short tons,0.0,percent,329.909,thousand short tons,1.91416,million MMBtu,fossil fuels
5,2012,Southeast Region,biomass,0.0,percent,1225.891,thousand physical units,0.63069,million MMBtu,0.0,...,billion Btu,0.0,thousand physical units,0.0,percent,1225.891,thousand physical units,0.63069,million MMBtu,renewables
6,2012,Southeast Region,biomass,0.0,percent,1225.891,thousand physical units,0.63069,million MMBtu,0.0,...,billion Btu,0.0,thousand physical units,0.0,percent,1225.891,thousand physical units,0.63069,million MMBtu,renewables
7,2012,Northeast Region,bituminous coal,6.87,percent,10.463,thousand short tons,0.27404,million MMBtu,67.694,...,billion Btu,0.0,thousand short tons,0.66,percent,78.157,thousand short tons,2.0778,million MMBtu,fossil fuels


epo's shape: (40517, 34)


In [20]:
# Using groupby() and sum() functions to merge matching rows based on 'period', 'stateDescription', 'energySource', and 
# 'fuelTypeDescription' then round to the nearest 2. Note to also group the UOM columns as well.
group_by = [
    'period', 'stateDescription', 'energySource', 'fuelTypeDescription', 'ash-content-units', 'consumption-for-eg-units',
    'consumption-for-eg-btu-units', 'consumption-uto-units', 'consumption-uto-btu-units', 'cost-units', 'cost-per-btu-units',
    'generation-units', 'heat-content-units', 'receipts-units', 'receipts-btu-units', 'stocks-units', 'sulfur-content-units',
    'total-consumption-units', 'total-consumption-btu-units'
]

epo_cleaned_df = epo_cleaned_df.groupby(group_by).agg('sum').round(2).reset_index()
display(epo_cleaned_df.head())
print('epo\'s shape:', epo_cleaned_df.shape)

Unnamed: 0,period,stateDescription,energySource,fuelTypeDescription,ash-content-units,consumption-for-eg-units,consumption-for-eg-btu-units,consumption-uto-units,consumption-uto-btu-units,cost-units,...,cost,cost-per-btu,generation,heat-content,receipts,receipts-btu,stocks,sulfur-content,total-consumption,total-consumption-btu
0,2012,Midwest Region,fossil fuels,bituminous coal,percent,thousand short tons,million MMBtu,thousand short tons,million MMBtu,dollars per short tons,...,0.0,0.0,274192.77,567.15,120079.43,2884577.58,0.0,44.36,124368.3,2922.8
1,2012,Midwest Region,fossil fuels,bituminous coal and synthetic coal,percent,thousand short tons,million MMBtu,thousand short tons,million MMBtu,dollars per short tons,...,0.0,0.0,152826.97,586.86,71764.82,1659520.5,0.0,47.63,76672.09,1716.88
2,2012,Midwest Region,fossil fuels,"coal, excluding waste coal",percent,thousand short tons,million MMBtu,thousand short tons,million MMBtu,dollars per short tons,...,0.0,0.0,836185.96,635.46,513119.66,9455272.29,0.0,22.5,494398.23,9087.89
3,2012,Midwest Region,fossil fuels,distillate fuel oil,percent,thousand short tons,million MMBtu,thousand short tons,million MMBtu,dollars per short tons,...,0.0,0.0,538.33,162.77,925.72,5372.27,0.0,0.0,1164.96,6.77
4,2012,Midwest Region,fossil fuels,fossil fuels,percent,thousand physical units,million MMBtu,thousand physical units,million MMBtu,dollars per physical units,...,0.0,0.0,1035747.09,0.0,0.0,10114581.32,0.0,41.91,0.0,10889.38


epo's shape: (1995, 34)


In [21]:
# Convert the period column to DateTime format
epo_cleaned_df['period'] = pd.to_datetime(epo_cleaned_df['period'], format='%Y')

In [22]:
# Check the column types
epo_cleaned_df.dtypes

period                          datetime64[ns]
stateDescription                        object
energySource                            object
fuelTypeDescription                     object
ash-content-units                       object
consumption-for-eg-units                object
consumption-for-eg-btu-units            object
consumption-uto-units                   object
consumption-uto-btu-units               object
cost-units                              object
cost-per-btu-units                      object
generation-units                        object
heat-content-units                      object
receipts-units                          object
receipts-btu-units                      object
stocks-units                            object
sulfur-content-units                    object
total-consumption-units                 object
total-consumption-btu-units             object
ash-content                            float64
consumption-for-eg                     float64
consumption-f

In [23]:
epo_cleaned_df.columns

Index(['period', 'stateDescription', 'energySource', 'fuelTypeDescription',
       'ash-content-units', 'consumption-for-eg-units',
       'consumption-for-eg-btu-units', 'consumption-uto-units',
       'consumption-uto-btu-units', 'cost-units', 'cost-per-btu-units',
       'generation-units', 'heat-content-units', 'receipts-units',
       'receipts-btu-units', 'stocks-units', 'sulfur-content-units',
       'total-consumption-units', 'total-consumption-btu-units', 'ash-content',
       'consumption-for-eg', 'consumption-for-eg-btu', 'consumption-uto',
       'consumption-uto-btu', 'cost', 'cost-per-btu', 'generation',
       'heat-content', 'receipts', 'receipts-btu', 'stocks', 'sulfur-content',
       'total-consumption', 'total-consumption-btu'],
      dtype='object')

In [24]:
# Rename the stateDescription to region to correctly reflect the column category
epo_cleaned_df.rename(columns = {'stateDescription': 'region'}, inplace = True)
epo_cleaned_df.columns

Index(['period', 'region', 'energySource', 'fuelTypeDescription',
       'ash-content-units', 'consumption-for-eg-units',
       'consumption-for-eg-btu-units', 'consumption-uto-units',
       'consumption-uto-btu-units', 'cost-units', 'cost-per-btu-units',
       'generation-units', 'heat-content-units', 'receipts-units',
       'receipts-btu-units', 'stocks-units', 'sulfur-content-units',
       'total-consumption-units', 'total-consumption-btu-units', 'ash-content',
       'consumption-for-eg', 'consumption-for-eg-btu', 'consumption-uto',
       'consumption-uto-btu', 'cost', 'cost-per-btu', 'generation',
       'heat-content', 'receipts', 'receipts-btu', 'stocks', 'sulfur-content',
       'total-consumption', 'total-consumption-btu'],
      dtype='object')

In [26]:
# Rearrange the energySource to be next to fuelTypeDescription
epo_cleaned_df = epo_cleaned_df[[
    'period', 'region', 'energySource', 'fuelTypeDescription', 'ash-content', 'ash-content-units',
    'consumption-for-eg', 'consumption-for-eg-units', 'consumption-for-eg-btu', 'consumption-for-eg-btu-units',
    'consumption-uto', 'consumption-uto-units', 'consumption-uto-btu', 'consumption-uto-btu-units', 'cost', 
    'cost-units', 'cost-per-btu', 'cost-per-btu-units', 'generation', 'generation-units', 'heat-content',
    'heat-content-units', 'receipts', 'receipts-units', 'receipts-btu', 'receipts-btu-units', 'stocks', 
    'stocks-units', 'sulfur-content', 'sulfur-content-units', 'total-consumption', 'total-consumption-units',
    'total-consumption-btu', 'total-consumption-btu-units'
]]

# Export the cleanned DataFrame for epo_cleaned_df into csv
epo_cleaned_df.to_csv('../static/data/epo_2012_2022_cleaned.csv', index = False)

**emission_df**

In [27]:
# Drop columns 'stateid', 'stateDescription', 'fuelid', 'co2-rate-lbs-mwh', 'nox-rate-lbs-mwh', 'so2-rate-lbs-mwh',
# 'co2-rate-lbs-mwh-units', 'nox-rate-lbs-mwh-units', 'so2-rate-lbs-mwh-units' from emission_cleaned_df
emission_cleaned_df = emission_cleaned_df.drop([
    'stateid', 'fuelid', 'co2-rate-lbs-mwh', 'nox-rate-lbs-mwh', 'so2-rate-lbs-mwh',
    'co2-rate-lbs-mwh-units', 'nox-rate-lbs-mwh-units', 'so2-rate-lbs-mwh-units'
], axis = 1)
print(emission_cleaned_df.columns)

Index(['period', 'stateDescription', 'fuelDescription',
       'co2-thousand-metric-tons', 'nox-short-tons', 'so2-short-tons',
       'co2-thousand-metric-tons-units', 'nox-short-tons-units',
       'so2-short-tons-units'],
      dtype='object')


In [28]:
# Print out the unique values in fuelDescription for emission_cleaned_df
print('emission', emission_cleaned_df['fuelDescription'].value_counts())

emission fuelDescription
Total          572
Petroleum      567
Natural Gas    561
Other          561
Coal           529
Name: count, dtype: int64


In [29]:
# Remove all rows where stateDescription for epo_cleaned_df is 'United States'
emission_cleaned_df = emission_cleaned_df[
    ~emission_cleaned_df['stateDescription'].isin(['United States'])
]

In [30]:
# Bin the 'stateDescription' using the same region lists defined in the epo section
emission_cleaned_df = category_bin(emission_cleaned_df, 'stateDescription', west_reg, 'West Region')
emission_cleaned_df = category_bin(emission_cleaned_df, 'stateDescription', southwest_reg, 'Southwest Region')
emission_cleaned_df = category_bin(emission_cleaned_df, 'stateDescription', midwest_reg, 'Midwest Region')
emission_cleaned_df = category_bin(emission_cleaned_df, 'stateDescription', southeast_reg, 'Southeast Region')
emission_cleaned_df = category_bin(emission_cleaned_df, 'stateDescription', northeast_reg, 'Northeast Region')

In [31]:
# Check the binning
emission_cleaned_df['stateDescription'].value_counts()

stateDescription
Southeast Region    660
Midwest Region      652
Northeast Region    611
West Region         592
Southwest Region    220
Name: count, dtype: int64

In [32]:
# Create bins based on fuel type description into 'Fossil Fuels', 'Renewables', and 'Others' for emission_cleaned_df under
# energySource
emission_ff_source = ['Petroleum', 'Natural Gas', 'Coal']
emission_renew_source = ['Total']
emission_oth_source = ['Other']

emission_cleaned_df = category_bin(emission_cleaned_df, 'fuelDescription', emission_ff_source, 'fossil fuels', 'energySource')
emission_cleaned_df = category_bin(emission_cleaned_df, 'fuelDescription', emission_renew_source, 'renewables', 'energySource')
emission_cleaned_df = category_bin(emission_cleaned_df, 'fuelDescription', emission_oth_source, 'others', 'energySource')

# Drop the 'fuelDescription' as we no longer needing it
emission_cleaned_df = emission_cleaned_df.drop(['fuelDescription'], axis = 1)

  tmp_df.loc[tmp_df[check_col] == item, new_col] = bin_name


In [33]:
# Check the value_counts() again to make sure binning was done correctly
emission_cleaned_df['energySource'].value_counts()

energySource
fossil fuels    1624
renewables       561
others           550
Name: count, dtype: int64

In [34]:
# Fill in NaN columns as 0 then set these columns as float
# 'co2-thousand-metric-tons', 'nox-short-tons', 'so2-short-tons'
emission_cols = ['co2-thousand-metric-tons', 'nox-short-tons', 'so2-short-tons']

for col in emission_cols:
    emission_cleaned_df = fix_nan(emission_cleaned_df, emission_cols)

In [35]:
# Print out the columns
emission_cleaned_df.columns

Index(['period', 'stateDescription', 'co2-thousand-metric-tons',
       'nox-short-tons', 'so2-short-tons', 'co2-thousand-metric-tons-units',
       'nox-short-tons-units', 'so2-short-tons-units', 'energySource'],
      dtype='object')

In [36]:
# Convert the period column to DateTime format
emission_cleaned_df['period'] = pd.to_datetime(emission_cleaned_df['period'], format='%Y')

In [37]:
# Using groupby() and sum() functions to merge matching rows based on 'period', 'stateDescription', 'energySource', and 
# 'fuelTypeDescription' then round to the nearest 2. Note to also group the UOM columns as well.
emission_group_by = [
    'period', 'stateDescription', 'co2-thousand-metric-tons-units', 
    'nox-short-tons-units', 'so2-short-tons-units', 'energySource'
]

emission_cleaned_df = emission_cleaned_df.groupby(emission_group_by).agg('sum').round(2).reset_index()
display(emission_cleaned_df.head())
print('emission\'s shape:', emission_cleaned_df.shape)

Unnamed: 0,period,stateDescription,co2-thousand-metric-tons-units,nox-short-tons-units,so2-short-tons-units,energySource,co2-thousand-metric-tons,nox-short-tons,so2-short-tons
0,2012-01-01,Midwest Region,thousand metric tons,short tons,short tons,fossil fuels,634744.0,644925.0,1677127.0
1,2012-01-01,Midwest Region,thousand metric tons,short tons,short tons,others,1703.0,61853.0,23587.0
2,2012-01-01,Midwest Region,thousand metric tons,short tons,short tons,renewables,636449.0,706778.0,1700714.0
3,2012-01-01,Northeast Region,thousand metric tons,short tons,short tons,fossil fuels,214554.0,209167.0,360508.0
4,2012-01-01,Northeast Region,thousand metric tons,short tons,short tons,others,7740.0,67017.0,25847.0


emission's shape: (165, 9)


In [38]:
# Check the column types
emission_cleaned_df.dtypes

period                            datetime64[ns]
stateDescription                          object
co2-thousand-metric-tons-units            object
nox-short-tons-units                      object
so2-short-tons-units                      object
energySource                              object
co2-thousand-metric-tons                 float64
nox-short-tons                           float64
so2-short-tons                           float64
dtype: object

In [39]:
# Rename the stateDescription to region to correctly reflect the column category
emission_cleaned_df.rename(columns = {'stateDescription': 'region'}, inplace = True)
emission_cleaned_df.columns

Index(['period', 'region', 'co2-thousand-metric-tons-units',
       'nox-short-tons-units', 'so2-short-tons-units', 'energySource',
       'co2-thousand-metric-tons', 'nox-short-tons', 'so2-short-tons'],
      dtype='object')

In [40]:
# Rearrange the energySource
emission_cleaned_df = emission_cleaned_df[[
    'period', 'region', 'energySource',
    'co2-thousand-metric-tons', 'nox-short-tons', 'so2-short-tons',
    'co2-thousand-metric-tons-units', 'nox-short-tons-units',
    'so2-short-tons-units'
]]

# Export the cleanned DataFrame for epo_cleaned_df into csv
emission_cleaned_df.to_csv('../static/data/emission_2012_2022_cleaned.csv', index = False)

### Database Storing
---

In [41]:
# Import dependencies for handling the database
from os import path, remove
from sqlalchemy import create_engine, text
from sqlalchemy.orm import Session

In [42]:
# Setup the db path
db_path = '../static/data/eia_electric.sqlite'

# Delete the existing database if it exists
if path.exists(db_path):
    remove(db_path)

In [43]:
# Setup the engine and connect the database
engine = create_engine(f'sqlite:///{db_path}')
conn = engine.connect()

In [44]:
# Create session for querying later to verify tables have been created correctly
session = Session(bind = engine)

In [45]:
# Append the epo_cleaned_df to the database created
epo_cleaned_df.to_sql(name = 'epo', con = engine, if_exists = 'replace', index = False)

1995

In [46]:
session.execute(text('SELECT * from epo')).fetchone()

('2012-01-01 00:00:00.000000', 'Midwest Region', 'fossil fuels', 'bituminous coal', 182.72, 'percent', 120401.56, 'thousand short tons', 2833.32, 'million MMBtu', 3966.74, 'thousand short tons', 89.48, 'million MMBtu', 0.0, 'dollars per short tons', 0.0, 'dollars per million Btu', 274192.77, 'thousand megawatthours', 567.15, 'Btu per short tons', 120079.43, 'thousand short tons', 2884577.58, 'billion Btu', 0.0, 'thousand short tons', 44.36, 'percent', 124368.3, 'thousand short tons', 2922.8, 'million MMBtu')

In [47]:
# Append the emission_cleaned_df to the database created
emission_cleaned_df.to_sql(name = 'emission', con = engine, if_exists = 'replace', index = False)

165

In [48]:
session.execute(text('SELECT * from emission')).fetchone()

('2012-01-01 00:00:00.000000', 'Midwest Region', 'fossil fuels', 634744.0, 644925.0, 1677127.0, 'thousand metric tons', 'short tons', 'short tons')

In [49]:
# Close out of the session and engine
session.close()
engine.dispose()