# How Green is My Grid?

## The Problem of Dirty Electricity

Climate scientists predict that we need to have net zero carbon emissions by [YEAR] in order to avoid the global warming "tipping point" of [BAD THING]. The shift away from using fossil fuels to power cars, stoves, and so on is a critical and readily visible part of this effort. However, moving from fossil fuels to electric power can't help us achieve net zero emissions unless electricity production itself has net zero emissions. Many of our power plants use CO2-emitting sources such as coal and natural gas to produce electricity. 

The problem of CO2-heavy electricity production will only grow. The US is estimated to [increase its electricity consumption](https://www.nationalgrid.com/stories/energy-explained/how-will-our-electricity-supply-change-future) by 50% by 2036 and will double by 2050. If we continue to use coal and fossil fuels to produce electricity, our CO2 emissions will be *increasing* at a time when the planet's future depends on our ability to *reduce* emissions.

[[why focus on co2]]

### Questions:

- How much electricity did these plants produce in 2021? How much power would we expect to need from them by 2036 and 2050?
- Which power plants are working at lowest capacity? I.e., which power plants would be able to produce more power as our demand for energy increases? 
- What will the overall impact of increased demand for electricity be in terms of pollution?
- What percent of these are using "clean" (or almost clean) power? Which ones are using power sources that contribute to global warming?
- Are there particular power companies/states producing power with a lower emissions rate that could serve as models for cleaner power production?

## Data Source

The US Environmental Protection Agency (EPA) releases the [eGrid report](https://www.epa.gov/egrid) each year. This report contains data on each of the 11K power plants in the US and Puerto Rico, including power sources, pollution, and efficiency. It also contains a summary of demographic information for the area surrounding each power plant. The most recent data is from 2021.

A full description of all terms and data in the dataset can be found in [this guide](https://www.epa.gov/system/files/documents/2023-01/eGRID2021_technical_guide.pdf).

Federal regulations require power plants to report their emissions and energy use. This is the data that is presented in the eGrid report. The EPA describes the dataset as containing information on "almost all electric power generated in the United States". It's not clear which power plants would be exempt from this reporting rule and what impact that missing data might have on analysis of the dataset. However, eGrid is used throughout the US government and in industry to calculate the environmental impact of power production, so we will follow the consensus that the dataset is representative of the entire country's power production.

## Clean Data

I cleaned the EPA's eGrid data by:
- extracting and renaming the relevant columns
- creating a schema of expected data types to catch irregularities, which led me to:
- casting columns to the correct data types
- removing rows with 'NaN' in critical columns
- normalizing plant owner and utility company names
    
You can find a notebook documenting the full data cleaning process [here](data/clean_egrid_data.ipynb).

## Load Libraries

In [245]:
import pandas as pd
import numpy as np
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
import math

## Load Data

In [2]:
egrid_df = pd.read_csv('data/cleaned_egrid_data.csv')

In [4]:
egrid_df.columns

Index(['Unnamed: 0', 'plant_sequence_num', 'state', 'plant_owner',
       'utility_name', 'balancing_auth_code', 'nerc_region', 'egrid_subregion',
       'county', 'latitude', 'longitude', 'primary_fuel',
       'primary_fuel_category', 'capacity_factor', 'nameplate_capacity_mw',
       'annual_net_generation_mwh', 'annual_co2_emissions_tons',
       'annual_co2_equiv_emissions_tons', 'annual_co2_emission_rate_lb/mwh',
       'annual_co2_equiv_emissions_rate_lb_mwh',
       'annual_coal_net_generation_mwh', 'annual_oil_net_generation_mwh',
       'annual_gas_net_generation_mwh', 'annual_nuclear_net_generation_mwh',
       'annual_hydro__net_generation_mwh', 'annual_biomass_net_generation_mwh',
       'annual_wind_net_generation_mwh', 'annual_solar_net_generation_mwh',
       'annual_geothermal_net_generation_mwh',
       'annual_other_fossil_fuel_net_generation_mwh',
       'annual_other_purchased_net_generation_mwh', 'coal_generation_percent',
       'oil_generation_percent', 'gas_gen

## What type of fuel does the majority of our power come from?

In [218]:
fuel_production_columns = [
    'annual_coal_net_generation_mwh', 'annual_oil_net_generation_mwh',
   'annual_gas_net_generation_mwh', 'annual_nuclear_net_generation_mwh',
   'annual_hydro__net_generation_mwh', 'annual_biomass_net_generation_mwh',
   'annual_wind_net_generation_mwh', 'annual_solar_net_generation_mwh',
   'annual_geothermal_net_generation_mwh',
   'annual_other_fossil_fuel_net_generation_mwh',
   'annual_other_purchased_net_generation_mwh'
]

fuel_type_names = [
    'Coal',
    'Oil',
    'Gas',
    'Nuclear',
    'Hydro',
    'Biomass',
    'Wind',
    'Solar',
    'Geothermal',
    'Other fossil fuels',
    'Unknown/purchased'
]

fuel_production_amts = [[egrid_df[type].sum()] for type in fuel_production_columns]

fuel_production_by_type = pd.DataFrame(data=dict(zip(fuel_type_names, fuel_production_amts))).transpose()

In [216]:
pd.set_option('display.float_format', '{:.2g}'.format)

In [219]:
fuel_production_by_type = fuel_production_by_type.sort_values(by=0, ascending=False)
fuel_production_by_type.columns = ['Annual Production (MWh)']
fuel_production_by_type.index.name = 'Fuel Type'
fuel_production_by_type

Unnamed: 0_level_0,Annual Production (MWh)
Fuel Type,Unnamed: 1_level_1
Gas,1600000000.0
Coal,900000000.0
Nuclear,780000000.0
Wind,380000000.0
Hydro,250000000.0
Solar,110000000.0
Biomass,54000000.0
Oil,26000000.0
Other fossil fuels,19000000.0
Geothermal,16000000.0


In [222]:
total_power_production = fuel_production_by_type['Annual Production (MWh)'].sum()
fuel_production_by_type['Percent of annual production'] = fuel_production_by_type['Annual Production (MWh)'] / total_power_production

fuel_production_by_type

Unnamed: 0_level_0,Annual Production (MWh),Percent of annual production
Fuel Type,Unnamed: 1_level_1,Unnamed: 2_level_1
Gas,1600000000.0,0.38
Coal,900000000.0,0.22
Nuclear,780000000.0,0.19
Wind,380000000.0,0.092
Hydro,250000000.0,0.06
Solar,110000000.0,0.028
Biomass,54000000.0,0.013
Oil,26000000.0,0.0063
Other fossil fuels,19000000.0,0.0047
Geothermal,16000000.0,0.0039


In [234]:
dirty_fuel_types = [
    'Coal',
    'Oil',
    'Gas',
    'Other fossil fuels'
]

clean_fuel_types = [
    'Nuclear',
    'Hydro',
    'Biomass',
    'Wind',
    'Solar',
    'Geothermal'
]

clean_fuel_production = fuel_production_by_type.loc[clean_fuel_types].sum()
print('Electricity from clean fuels:')
print(clean_fuel_production)
print('\n')

dirty_fuel_production = fuel_production_by_type.loc[dirty_fuel_types].sum()
print('Electricity from dirty fuels:')
print(dirty_fuel_production)

# % from nuclear and wind is high!
# % oil is low (probably because we use it mostly in heating, not accounted for here)
# % geothermal is low--why? because mostly used on people's property? no idea--research!

Electricity from clean fuels:
Annual Production (MWh)        1.6e+09
Percent of annual production      0.39
dtype: float64


Electricity from dirty fuels:
Annual Production (MWh)        2.5e+09
Percent of annual production      0.61
dtype: float64


In [239]:
# set up notebook to display Bokeh charts

output_notebook()

In [247]:
fuel_x_range = list(fuel_production_by_type.index)
fuel_type_chart = figure(x_range=fuel_x_range, height=350, title="Annual Power Production by Fuel Type", toolbar_location=None, tools="")
fuel_type_chart.vbar(x=fuel_x_range, top=list(fuel_production_by_type['Annual Production (MWh)']), width=0.9)

fuel_type_chart.xgrid.grid_line_color = None
fuel_type_chart.y_range.start = 0
fuel_type_chart.xaxis.major_label_orientation = math.pi/4

show(fuel_type_chart)

Note: geothermal energy was included in clean energy sources because it doesn't produce CO2. However, geothermal energy [pollutes both air and water](https://fws.gov/node/265252#:~:text=Air%20and%20water%20pollution%20are,for%20cooling%20or%20other%20purposes.). 
    
Findings:
- 60% of our electricity comes from gas and coal
- nearly 40% of our electricity comes from clean fuel sources, with nuclear and wind as the largest sources of clean energy
- oil amounts for only 0.6% of electricity from power plants. This is likely because oil is used mostly for [heating](https://www.eia.gov/energyexplained/heating-oil/use-of-heating-oil.php#:~:text=Who%20uses%20heating%20oil%3F,the%20U.S.%20Northeast%20Census%20Region.) (although this use is on the decline).
- even though geothermal energy accounts for only 3.9% of our electricity, the US is the [largest producer of geothermal energy](https://www.eia.gov/energyexplained/geothermal/use-of-geothermal-energy.php) 

## What are the greenest fuel types?

Moving towards a cleaner power grid will require choices about when to phase out existing power plants that use dirty fuel. To eliminate CO2 as quickly as possible, we should focus on the worst polluters. 

Which dirty fuels produce the most CO2? Are there differences between green fuels? [[do we want this statement?]]
[[something about % produced vs pollution]]

[[]]'annual_co2_emission_rate_lb/mwh' <- use this]]]

We will restrict this analysis to power plants that use only one fuel type; there is not enough data to determine what percent of the CO2 produced is from each fuel source in plants using more than one fuel type.


### Add a column to represent whether or not a power plant uses only one fuel source

In [252]:
fuel_percent_cols = [
'coal_generation_percent',
'oil_generation_percent',
'gas_generation_percent',
'nuclear_generation_percent', 
'hydro_generation_percent',
'biomass_generation_percent', 
'wind_generation_percent',
'solar_generation_percent', 
'geothermal_generation_percent',
'other_fossil_fuel_generation_percent',
'other_purchased_generation_percent'
]

egrid_df['is_single_fuel_plant'] = egrid_df.apply(lambda x: x[fuel_percent_cols].transpose().gt(0).sum() == 1, axis=1)

In [257]:
egrid_df.head()

Unnamed: 0.1,Unnamed: 0,plant_sequence_num,state,plant_owner,utility_name,balancing_auth_code,nerc_region,egrid_subregion,county,latitude,...,gas_generation_percent,nuclear_generation_percent,hydro_generation_percent,biomass_generation_percent,wind_generation_percent,solar_generation_percent,geothermal_generation_percent,other_fossil_fuel_generation_percent,other_purchased_generation_percent,is_single_fuel_plant
0,3,3,AK,"Copper Valley Elec Assn, Inc","Copper Valley Elec Assn, Inc",UNKNOWN,AK,AKMS,Valdez Cordova,61,...,0,0,1,0,0,0,0,0,0,True
1,4,4,AK,"Alaska Village Elec Coop, Inc","Alaska Village Elec Coop, Inc",UNKNOWN,AK,AKMS,Northwest Arctic,67,...,0,0,0,0,0,0,0,0,0,True
2,5,5,AK,"Inside Passage Elec Coop, Inc","Inside Passage Elec Coop, Inc",UNKNOWN,AK,AKMS,Hoonah-Angoon,57,...,0,0,0,0,0,0,0,0,0,True
3,6,6,AK,Aniak Light & Power Co Inc,Aniak Light & Power Co Inc,UNKNOWN,AK,AKMS,Bethel,62,...,0,0,0,0,0,0,0,0,0,True
4,7,7,AK,Alaska Electric Light&Power Co,Alaska Electric Light & Power Co.,UNKNOWN,AK,AKMS,Juneau,58,...,0,0,1,0,0,0,0,0,0,True


In [258]:
single_fuel_plants_df = egrid_df[egrid_df['is_single_fuel_plant'] == True]

### Which fuels are the least efficient?

The `annual_co2_emission_rate_lb/mwh` column can be used to determine which fuels produce the most CO2 emissions relative to the electricity they produce. The ratio of CO2 emissions : power production shows us the environmental impact of different fuels [[what??]]].

Oil, gas, and coal are the biggest polluters when measured by pounds of CO2 produced per MWh. 

The cleanest fuels are wind, solar, and nuclear. We also see that geothermal power produces nearly 25 times more CO2 than biomass, the next clean fuel on the list. This is in keeping with the debate over whether or not geothermal power is really "green"; it produces much more CO2 than other green fuels in addition to the water and air pollution.

In [305]:
fuel_type_efficiency = single_fuel_plants_df.groupby('primary_fuel_category')['annual_co2_emission_rate_lb/mwh'].mean().sort_values(ascending=False)
fuel_type_efficiency.index.name = 'Fuel type'
fuel_type_efficiency = fuel_type_efficiency.reset_index().rename(columns={ 'annual_co2_emission_rate_lb/mwh': 'CO2 emissions lb/MWh' })

In [306]:
# Add a column for metric tons/MWh (this is a more common measure of CO2 emissions)

lbs_to_metric_tons_factor = 0.000453592
fuel_type_efficiency['CO2 emissions metric tons/MWh'] = fuel_type_efficiency['CO2 emissions lb/MWh'] * lbs_to_metric_tons_factor

# Ignoring the "grab bag" categories, because those include multiple fuel types
fuel_type_efficiency[~fuel_type_efficiency['Fuel type'].isin(['OTHF', 'OFSL'])]

Unnamed: 0,Fuel type,CO2 emissions lb/MWh,CO2 emissions metric tons/MWh
0,OIL,4100.0,1.9
1,GAS,3500.0,1.6
2,COAL,1600.0,0.75
4,GEOTHERMAL,100.0,0.048
6,BIOMASS,4.4,0.002
7,HYDRO,0.007,3.2e-06
8,NUCLEAR,0.0032,1.5e-06
9,SOLAR,0.0,0.0
10,WIND,0.0,0.0


In [352]:
coal_to_geothermal = fuel_type_efficiency[fuel_type_efficiency['Fuel type'] == 'COAL']['CO2 emissions lb/MWh'].iloc[0] / fuel_type_efficiency[fuel_type_efficiency['Fuel type'] == 'GEOTHERMAL']['CO2 emissions lb/MWh'].iloc[0]
coal_to_geothermal

15.656479394372145

In [307]:
fuel_type_efficiency.to_csv('co2_emissions_by_fuel_type.csv')

In [None]:
# bar chart?

If we look at what percent of CO2 output each fuel type is responsible for, will we get the same results?

In [335]:
total_co2_production = single_fuel_plants_df['annual_co2_emissions_tons'].sum()

percent_of_net_emissions_by_fuel = single_fuel_plants_df.groupby('primary_fuel_category')['annual_co2_emissions_tons'].sum().sort_values(ascending=False)
percent_of_net_emissions_by_fuel.index.name = 'Fuel type'
percent_of_net_emissions_by_fuel = percent_of_net_emissions_by_fuel.reset_index().rename(columns={ 'annual_co2_emissions_tons': 'Annual CO2 emissions (tons)' })
percent_of_net_emissions_by_fuel['Percent of annual CO2 emissions'] = percent_of_net_emissions_by_fuel['Annual CO2 emissions (tons)'] / total_co2_production

net_power_by_fuel_type = single_fuel_plants_df.groupby('primary_fuel_category')['annual_net_generation_mwh'].sum()
net_power_by_fuel_type.index.name = 'Fuel type'
net_power_by_fuel_type = net_power_by_fuel_type.reset_index().rename(columns={ 'annual_net_generation_mwh': 'Annual power production MWh' })

percentage_emissions_and_production = pd.merge(percent_of_net_emissions_by_fuel, net_power_by_fuel_type, how='left', on=['Fuel type'])
percentage_emissions_and_production['Percent of annual power production'] = percentage_emissions_and_production['Annual power production MWh'] / percentage_emissions_and_production['Annual power production MWh'].sum()


# Ignoring the "grab bag" categories, because those include multiple fuel types
percentage_emissions_and_production[~percentage_emissions_and_production['Fuel type'].isin(['OTHF', 'OFSL'])]

Unnamed: 0,Fuel type,Annual CO2 emissions (tons),Percent of annual CO2 emissions,Annual power production MWh,Percent of annual power production
0,GAS,480000000.0,0.95,1100000000.0,0.41
1,COAL,13000000.0,0.026,12000000.0,0.0045
2,OIL,13000000.0,0.025,14000000.0,0.0053
3,GEOTHERMAL,1300000.0,0.0026,16000000.0,0.0059
5,BIOMASS,77000.0,0.00015,14000000.0,0.0054
7,NUCLEAR,830.0,1.6e-06,750000000.0,0.28
8,HYDRO,490.0,9.6e-07,250000000.0,0.095
9,SOLAR,0.0,0.0,110000000.0,0.043
10,WIND,0.0,0.0,380000000.0,0.14


In [None]:
# pie chart side by side, % emissions and % annual production

## Current Electricity Production and Future Demand Estimate

We will use the total current electricity production from the dataset's power plants as an estimate of the current electricity produced in the US and Puerto Rico.

First, add a column to represent each plant's total annual MWh production. A megawatt hour is 1,000 kilowatts of electricity generated per hour. For reference, a megawatt hour of power can keep [two refrigerators running for a year](https://www.freeingenergy.com/what-is-a-megawatt-hour-of-electricity-and-what-can-you-do-with-it/).

In [65]:
annual_power_production = egrid_df['annual_net_generation_mwh'].sum()

print('The total annual production is {total_energy} MWh.'.format(total_energy='{:.2e}'.format(annual_power_production)))

The total annual production is 4.12e+09 MWh.


At a minimum, we can expect electricity consumption to [rise](https://www.nationalgrid.com/stories/energy-explained/how-will-our-electricity-supply-change-future) by 50% by 2036 and to double by 2050. Based on current production, we can estimate that the US will need to produce:

In [69]:
est_2036_need = annual_power_production * 1.5
est_2050_need = annual_power_production * 2

print('The US and Puerto Rico will need {est_2036} MWh/yr by 2036 and {est_2050} MWh/yr by 2050.'.format(est_2036='{:.2e}'.format(est_2036_need), est_2050='{:.2e}'.format(est_2050_need)))

The US and Puerto Rico will need 6.18e+09 MWh/yr by 2036 and 8.24e+09 MWh/yr by 2050.


In [70]:
shortfall_2036 = est_2036_need - annual_power_production
shortfall_2050 = est_2050_need - annual_power_production

print('We will need to produce an additional {shortfall_2036} MWh/yr by 2036 and an additional {shortfall_2050} MWh/yr by 2050 to keep up with demand.'.format(shortfall_2036='{:.2e}'.format(shortfall_2036), shortfall_2050='{:.2e}'.format(shortfall_2050)))

We will need to produce an additional 2.06e+09 MWh/yr by 2036 and an additional 4.12e+09 MWh/yr by 2050 to keep up with demand.


### Estimating additional electric production capacity

An obvious question is to ask if the current power plants are operating at maximum capacity, and if not, how much more energy we could produce from them.

Unfortunately, I was unable to find sufficient data to answer this question. 

We have the capacity factor (a ratio of energy produced to potential energy that could be produced) for each plant, and the national average capacity for power plants based on fuel type. 

The difference between the power that is actually produced and the potential power is due to several factors:
    
    1. downtime for maintenance
    2. downtime due to lack of fuel (oil shortages, cloudy days for solar plants, etc.)
    3. downtime or reduced production because the grid has sufficient power
    
Without data to differentiate these three reasons, it's impossible to calculate which plants could actually produce more power day in and day out, year after year. So for the purpose of this analysis, we will assume that our current grid is at capacity. 

## Environmental impact of increased energy needs

What will the environmental impact of increased energy use be?

We'll use two models:

- assuming that the grid will continue to use the same ratio of renewable and nonrenewable fuels
- assuming any future power plants will be built with renewable energy

In [71]:
# to get estimate of what energy : pollution ratio is, average plants w/ only one fuel source power:pollution

egrid_df[['coal_generation_percent',
       'oil_generation_percent', 'gas_generation_percent',
       'nuclear_generation_percent', 'hydro_generation_percent',
       'biomass_generation_percent', 'wind_generation_percent',
       'solar_generation_percent', 'geothermal_generation_percent',
       'other_fossil_fuel_generation_percent',
       'other_purchased_generation_percent']]

Unnamed: 0,coal_generation_percent,oil_generation_percent,gas_generation_percent,nuclear_generation_percent,hydro_generation_percent,biomass_generation_percent,wind_generation_percent,solar_generation_percent,geothermal_generation_percent,other_fossil_fuel_generation_percent,other_purchased_generation_percent
0,0.000000,0.000000,0.000000,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.000000,1.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.000000,1.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.000000,1.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.000000,0.000000,0.000000,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
10693,0.000000,0.000000,0.000000,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
10694,0.997846,0.000000,0.002154,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10695,0.995580,0.000000,0.004420,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10696,0.998338,0.000000,0.001662,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
# run numbers for w/ and w/o clean energy to make up the difference
# make a note about assumptions you're making

In [44]:
egrid_df.groupby('primary_fuel_category')['capacity_factor'].mean().sort_values(ascending=False)

primary_fuel_category
NUCLEAR       0.866785
BIOMASS       0.553806
GEOTHERMAL    0.507146
COAL          0.428093
HYDRO         0.345689
OFSL          0.342017
WIND          0.301482
GAS           0.297236
OTHF          0.187086
SOLAR         0.186999
OIL           0.035228
Name: capacity_factor, dtype: float64

In [5]:
<!-- # maybe try graphing power vs emissions by type of fuel -->
for all single source plants, see what's an outlier (what to do/avoid)

array(['HYDRO', 'OIL', 'COAL', 'GAS', 'OTHF', 'WIND', 'BIOMASS', 'SOLAR',
       'NUCLEAR', 'GEOTHERMAL', 'OFSL'], dtype=object)

True

False