# Southwest Hydropower Plants and Energy Generation Data
By Sara Smithers  
July 15<sup>th</sup>, 2022

### Notebook Objectives
- Combine and clean monthly energy generation data for the U.S. Energy Information Administration
- Initial exploratory data analysis (EDA) of monthly energy data 
- Clean US hydroelectric plant data from Oak Ridge National Laboratory
- Join US hydroelectric plants data on county names with FIPS from USDA
- Initial EDA of hydroelectric plant data
- Combine and clean California Large and Small Hydropower Plant production data
- Join California Hydropower production data table with reference table including plant EIA ID numbers

### Combining and Cleaning Monthly Energy Generation Data

Original monthly energy generation xls file was downloaded from https://www.eia.gov/electricity/data/state/. Each sheet in the file was a year of data spanning from 2001 to 2020. The sheets were combined into one CSV using Excel Power Query. 

In [1]:
import pandas as pd

In [19]:
# import 'State_power_generation_monthly'

energy_by_state = pd.read_csv('data/State_power_generation_monthly.csv')

In [20]:
#Rename columns to update format

energy_by_state.rename(columns = {'YEAR':'year', 'MONTH':'month', 'STATE':'state', 'TYPE OF PRODUCER':'type_of_producer', 'ENERGY SOURCE':'energy_source', 'GENERATION\n(Megawatthours)':'generation_mwh'}, inplace = True)

In [21]:
# check data types

energy_by_state.dtypes

year                            int64
month                           int64
state                          object
type_of_producer               object
energy_source                  object
GENERATION (Megawatthours)    float64
dtype: object

In [22]:
energy_by_state

Unnamed: 0,year,month,state,type_of_producer,energy_source,GENERATION (Megawatthours)
0,2001,1,AK,Total Electric Power Industry,Coal,46903.0
1,2001,1,AK,Total Electric Power Industry,Petroleum,71085.0
2,2001,1,AK,Total Electric Power Industry,Natural Gas,367521.0
3,2001,1,AK,Total Electric Power Industry,Hydroelectric Conventional,104549.0
4,2001,1,AK,Total Electric Power Industry,Wind,87.0
...,...,...,...,...,...,...
460761,2020,12,WY,"Electric Generators, Electric Utilities",Coal,2889631.0
460762,2020,12,WY,"Electric Generators, Electric Utilities",Hydroelectric Conventional,76328.0
460763,2020,12,WY,"Electric Generators, Electric Utilities",Natural Gas,24361.0
460764,2020,12,WY,"Electric Generators, Electric Utilities",Petroleum,4166.0


In [23]:
# sort data frame by date

energy_by_state.sort_values(['year','month'], inplace = True)
energy_by_state

Unnamed: 0,year,month,state,type_of_producer,energy_source,GENERATION (Megawatthours)
0,2001,1,AK,Total Electric Power Industry,Coal,46903.0
1,2001,1,AK,Total Electric Power Industry,Petroleum,71085.0
2,2001,1,AK,Total Electric Power Industry,Natural Gas,367521.0
3,2001,1,AK,Total Electric Power Industry,Hydroelectric Conventional,104549.0
4,2001,1,AK,Total Electric Power Industry,Wind,87.0
...,...,...,...,...,...,...
460761,2020,12,WY,"Electric Generators, Electric Utilities",Coal,2889631.0
460762,2020,12,WY,"Electric Generators, Electric Utilities",Hydroelectric Conventional,76328.0
460763,2020,12,WY,"Electric Generators, Electric Utilities",Natural Gas,24361.0
460764,2020,12,WY,"Electric Generators, Electric Utilities",Petroleum,4166.0


In [24]:
# Check dataframe for NULLS

energy_by_state.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 460766 entries, 0 to 460765
Data columns (total 6 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   year                        460766 non-null  int64  
 1   month                       460766 non-null  int64  
 2   state                       460766 non-null  object 
 3   type_of_producer            460766 non-null  object 
 4   energy_source               460766 non-null  object 
 5   GENERATION (Megawatthours)  460766 non-null  float64
dtypes: float64(1), int64(2), object(3)
memory usage: 24.6+ MB


In [25]:
# no NULLS

In [26]:
# drop duplicates

energy_by_state.drop_duplicates()

Unnamed: 0,year,month,state,type_of_producer,energy_source,GENERATION (Megawatthours)
0,2001,1,AK,Total Electric Power Industry,Coal,46903.0
1,2001,1,AK,Total Electric Power Industry,Petroleum,71085.0
2,2001,1,AK,Total Electric Power Industry,Natural Gas,367521.0
3,2001,1,AK,Total Electric Power Industry,Hydroelectric Conventional,104549.0
4,2001,1,AK,Total Electric Power Industry,Wind,87.0
...,...,...,...,...,...,...
460761,2020,12,WY,"Electric Generators, Electric Utilities",Coal,2889631.0
460762,2020,12,WY,"Electric Generators, Electric Utilities",Hydroelectric Conventional,76328.0
460763,2020,12,WY,"Electric Generators, Electric Utilities",Natural Gas,24361.0
460764,2020,12,WY,"Electric Generators, Electric Utilities",Petroleum,4166.0


In [27]:
# Check for unique state mispelling and errors. 

energy_by_state['state'].unique()

array(['AK', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA',
       'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY', 'LA', 'MA', 'MD', 'ME',
       'MI', 'MN', 'MO', 'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM',
       'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX',
       'UT', 'VA', 'VT', 'WA', 'WI', 'WV', 'WY', 'US-TOTAL', 'US-Total'],
      dtype=object)

In [28]:
# Drop US totals rows

energy_by_state = energy_by_state[energy_by_state['state'].str.contains('US-T') == False]

In [29]:
# check that rows are dropped 

energy_by_state['state'].unique()

array(['AK', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA',
       'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY', 'LA', 'MA', 'MD', 'ME',
       'MI', 'MN', 'MO', 'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM',
       'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX',
       'UT', 'VA', 'VT', 'WA', 'WI', 'WV', 'WY'], dtype=object)

In [30]:
energy_by_state['type_of_producer'].unique()

array(['Total Electric Power Industry',
       'Electric Generators, Electric Utilities',
       'Combined Heat and Power, Electric Power',
       'Combined Heat and Power, Commercial Power',
       'Combined Heat and Power, Industrial Power',
       'Electric Generators, Independent Power Producers'], dtype=object)

In [31]:
# Drop totals rows

energy_by_state = energy_by_state[energy_by_state['type_of_producer'] != 'Total Electric Power Industry']

In [32]:
energy_by_state['energy_source'].unique()

array(['Coal', 'Petroleum', 'Natural Gas', 'Hydroelectric Conventional',
       'Wind', 'Total', 'Nuclear', 'Wood and Wood Derived Fuels',
       'Other Gases', 'Other Biomass', 'Other',
       'Solar Thermal and Photovoltaic', 'Pumped Storage', 'Geothermal'],
      dtype=object)

In [33]:
# Drop totals rows

energy_by_state = energy_by_state[energy_by_state['energy_source'] != 'Total']

In [34]:
# check for errors in years columns

energy_by_state['year'].unique()

array([2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011,
       2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020], dtype=int64)

In [35]:
# check for all months in year

energy_by_state['month'].unique()

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12], dtype=int64)

In [36]:
energy_by_state

Unnamed: 0,year,month,state,type_of_producer,energy_source,GENERATION (Megawatthours)
6,2001,1,AK,"Electric Generators, Electric Utilities",Coal,18410.0
7,2001,1,AK,"Electric Generators, Electric Utilities",Petroleum,64883.0
8,2001,1,AK,"Electric Generators, Electric Utilities",Natural Gas,305277.0
9,2001,1,AK,"Electric Generators, Electric Utilities",Hydroelectric Conventional,104549.0
10,2001,1,AK,"Electric Generators, Electric Utilities",Wind,87.0
...,...,...,...,...,...,...
460761,2020,12,WY,"Electric Generators, Electric Utilities",Coal,2889631.0
460762,2020,12,WY,"Electric Generators, Electric Utilities",Hydroelectric Conventional,76328.0
460763,2020,12,WY,"Electric Generators, Electric Utilities",Natural Gas,24361.0
460764,2020,12,WY,"Electric Generators, Electric Utilities",Petroleum,4166.0


In [32]:
# 267,456 rows

Save to cleaned monthly energy data to CSV

In [37]:
energy_by_state.to_csv(f'data/cleaned/monthly_energy_clean.csv', index = False)

#### Monthly energy Exploratory Data Analysis (EDA)

In [45]:
energy_by_state.describe()

Unnamed: 0,year,month,generation_mwh
count,267456.0,267456.0,267456.0
mean,2010.996635,6.508674,301838.5
std,5.76144,3.452656,1019125.0
min,2001.0,1.0,-316725.0
25%,2006.0,4.0,510.0
50%,2011.0,7.0,7869.0
75%,2016.0,10.0,71923.75
max,2020.0,12.0,18305640.0


In [53]:
energy_by_state.groupby('energy_source')['generation_mwh'].sum()

energy_source
Coal                              32497462630
Geothermal                          304760908
Hydroelectric Conventional         5426877005
Natural Gas                       21129267157
Nuclear                           15848626161
Other                               262421372
Other Biomass                       368584506
Other Gases                         250260689
Petroleum                          1080242337
Pumped Storage                     -127522458
Solar Thermal and Photovoltaic      379540433
Wind                               2533329054
Wood and Wood Derived Fuels         774680389
Name: generation_mwh, dtype: int64

In [97]:
southwest_states = ['NM', 'UT', 'CO', 'AZ', 'CA', 'NV', 'TX']

southwest_energy = energy_by_state[energy_by_state['state'].isin(southwest_states)]

In [98]:
southwest_energy[southwest_energy['year'] == 2020].groupby('energy_source')['generation_mwh'].sum()/(southwest_energy[southwest_energy['year'] == 2020]['generation_mwh'].sum())*100

energy_source
Coal                              15.918544
Geothermal                         1.656481
Hydroelectric Conventional         3.556911
Natural Gas                       48.416487
Nuclear                            9.478554
Other                              0.142491
Other Biomass                      0.351350
Other Gases                        0.443649
Petroleum                          0.012278
Pumped Storage                    -0.013410
Solar Thermal and Photovoltaic     5.947885
Wind                              13.637044
Wood and Wood Derived Fuels        0.451734
Name: generation_mwh, dtype: float64

In [99]:
southwest_energy[southwest_energy['year'] == 2001].groupby('energy_source')['generation_mwh'].sum()/(southwest_energy[southwest_energy['year'] == 2001]['generation_mwh'].sum())*100

energy_source
Coal                              36.134852
Geothermal                         1.668166
Hydroelectric Conventional         4.821841
Natural Gas                       42.173348
Nuclear                           12.338947
Other                              0.204081
Other Biomass                      0.284192
Other Gases                        0.333392
Petroleum                          0.910157
Pumped Storage                    -0.039933
Solar Thermal and Photovoltaic     0.066899
Wind                               0.583737
Wood and Wood Derived Fuels        0.520321
Name: generation_mwh, dtype: float64

In [100]:
renewables = ['Wind', 'Wood and Wood Derived Fuels', 'Other Biomass', 'Solar Thermal and Photovolatic', 'Hydroelectric Conventional', 'Geothermal']

renewable_energy = southwest_energy[southwest_energy['energy_source'].isin(renewables)]

In [101]:
renewable_energy[renewable_energy['year'] == 2020].groupby('energy_source')['generation_mwh'].sum()/(renewable_energy[renewable_energy['year'] == 2020]['generation_mwh'].sum())*100

energy_source
Geothermal                      8.428419
Hydroelectric Conventional     18.098088
Other Biomass                   1.787720
Wind                           69.387283
Wood and Wood Derived Fuels     2.298490
Name: generation_mwh, dtype: float64

In [102]:
renewable_energy[renewable_energy['year'] == 2001].groupby('energy_source')['generation_mwh'].sum()/(renewable_energy[renewable_energy['year'] == 2001]['generation_mwh'].sum())*100

energy_source
Geothermal                     21.174303
Hydroelectric Conventional     61.204412
Other Biomass                   3.607294
Wind                            7.409473
Wood and Wood Derived Fuels     6.604518
Name: generation_mwh, dtype: float64

### Cleaning US Hydroelectric Plant Data

Original US Hydroelectric Plant data was downloaded as a CSV from the Oak Ridge National Laboratory- https://hydrosource.ornl.gov/dataset/EHA2022.

In [3]:
active_hy_plants = pd.read_csv('Data/active_hydroelectric_plants.csv')

In [4]:
active_hy_plants.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2321 entries, 0 to 2320
Data columns (total 28 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   EHA_PtID         2321 non-null   object 
 1   PtName           2321 non-null   object 
 2   County           2306 non-null   object 
 3   State            2321 non-null   object 
 4   Lat              2321 non-null   float64
 5   Lon              2321 non-null   float64
 6   Pt_Own           2314 non-null   object 
 7   OwType           2314 non-null   object 
 8   Dam_Own          2161 non-null   object 
 9   Type             2321 non-null   object 
 10  EIA_PtID         1681 non-null   float64
 11  Mode             1526 non-null   object 
 12  Number_of_Units  2321 non-null   int64  
 13  CH_MW            2293 non-null   float64
 14  CH_MWh           1991 non-null   float64
 15  CH_Pf            1983 non-null   float64
 16  CH_OpYear        1497 non-null   float64
 17  PS_MW         

In [5]:
active_hy_plants.head()

Unnamed: 0,EHA_PtID,PtName,County,State,Lat,Lon,Pt_Own,OwType,Dam_Own,Type,...,PS_MWh,PS_Pf,PS_OpYear,Water,HUC,NID_ID,ReEDSPCA,NERC,Sector,Trans
0,hc1428_p01,Squa Pan Hydro Station,Aroostook,ME,46.5563,-68.3257,WPS New England Generation Inc,Wholesale Power Marketer,WPS New England Generation Inc,HY,...,,,,Squa Pan Stream,10100040000.0,ME00234,134,NPCC,IPP Non-CHP,Emera Maine
1,hc1427_p01,Caribou Generation Station,Aroostook,ME,46.8488,-68.0022,WPS New England Generation Inc,Wholesale Power Marketer,WPS New England Generation Inc,HY,...,,,,Aroostook River,10100040000.0,ME00227,134,NPCC,IPP Non-CHP,Emera Maine
2,hc1497_p01,McKay,Piscataquis,ME,45.8815,-69.1767,Great Lakes Hydro America LLC,Wholesale Power Marketer,Great Lakes Hydro America LLC,HY,...,,,,West Branch Penobscot River,10200010000.0,ME00204,134,NPCC,IPP Non-CHP,Emera Maine
3,hc1495_p01,North Twin,Penobscot,ME,45.6346,-68.7813,Great Lakes Hydro America LLC,Wholesale Power Marketer,Great Lakes Hydro America LLC,HY,...,,,,West Branch Penobscot River,10200010000.0,ME00203,134,NPCC,IPP Non-CHP,Emera Maine
4,hc1494_p01,Millinocket,Penobscot,ME,45.6374,-68.73,Great Lakes Hydro America LLC,Wholesale Power Marketer,Great Lakes Hydro America LLC,HY,...,,,,West Branch Penobscot River,10200010000.0,ME00202,134,NPCC,IPP Non-CHP,Emera Maine


In [6]:
# drop columns not needed for analysis

active_hy_plants.drop(['HUC', 'NID_ID', 'ReEDSPCA', 'NERC', 'PS_MW', 'PS_MWh', 'PS_Pf', 'PS_OpYear'], axis = 1, inplace = True)

In [7]:
# rename columns

active_hy_plants.rename(columns = {'EHA_PtID':'eha_id', 'PtName':'plant_name', 'Pt_Own':'plant_owner', 'OwType':'owner_type', 'Dam_Own':'dam_owner', 'EIA_PtID':'eia_id', 'CH_MW':'capacity_mw', 'CH_MWh':'avg_generation_mwh', 'CH_Pf':'percent_capacity', 'CH_OpYear':'year_online_hy','Water':'waterway', 'Trans':'dist_owner'}, inplace = True)

In [8]:
active_hy_plants.isnull().sum()

eha_id                  0
plant_name              0
County                 15
State                   0
Lat                     0
Lon                     0
plant_owner             7
owner_type              7
dam_owner             160
Type                    0
eia_id                640
Mode                  795
Number_of_Units         0
capacity_mw            28
avg_generation_mwh    330
percent_capacity      338
year_online_hy        824
waterway               28
Sector                800
dist_owner            809
dtype: int64

In [9]:
active_hy_plants['Sector'].unique()

array(['IPP Non-CHP', nan, 'Electric Utility', 'Industrial CHP',
       'Industrial Non-CHP', 'Commercial CHP', 'Industrial',
       'Commercial Non-CHP', 'IPP'], dtype=object)

In [15]:
active_hy_plants['Type'].unique()

array(['HY', 'HY/PS'], dtype=object)

In [14]:
# Drop PS (pumped storage) only plants

active_hy_plants[active_hy_plants['Type'] == 'PS']

Unnamed: 0,eha_id,plant_name,County,State,Lat,Lon,plant_owner,owner_type,dam_owner,Type,eia_id,Mode,Number_of_Units,capacity_mw,avg_generation_mwh,percent_capacity,year_online_hy,waterway,Sector,dist_owner


In [12]:
active_hy_plants = active_hy_plants[active_hy_plants['Type'] != 'PS']

In [13]:
active_hy_plants

Unnamed: 0,eha_id,plant_name,County,State,Lat,Lon,plant_owner,owner_type,dam_owner,Type,eia_id,Mode,Number_of_Units,capacity_mw,avg_generation_mwh,percent_capacity,year_online_hy,waterway,Sector,dist_owner
0,hc1428_p01,Squa Pan Hydro Station,Aroostook,ME,46.556300,-68.325700,WPS New England Generation Inc,Wholesale Power Marketer,WPS New England Generation Inc,HY,1516.0,Peaking,1,1.500,881.4000,6.707763,1941.0,Squa Pan Stream,IPP Non-CHP,Emera Maine
1,hc1427_p01,Caribou Generation Station,Aroostook,ME,46.848800,-68.002200,WPS New England Generation Inc,Wholesale Power Marketer,WPS New England Generation Inc,HY,1513.0,Run-of-river,2,0.800,3765.3700,53.729595,1926.0,Aroostook River,IPP Non-CHP,Emera Maine
2,hc1497_p01,McKay,Piscataquis,ME,45.881500,-69.176700,Great Lakes Hydro America LLC,Wholesale Power Marketer,Great Lakes Hydro America LLC,HY,54134.0,Peaking,3,37.600,197413.5164,59.935610,1917.0,West Branch Penobscot River,IPP Non-CHP,Emera Maine
3,hc1495_p01,North Twin,Penobscot,ME,45.634600,-68.781300,Great Lakes Hydro America LLC,Wholesale Power Marketer,Great Lakes Hydro America LLC,HY,54134.0,Peaking,3,9.600,50403.4510,59.935610,1917.0,West Branch Penobscot River,IPP Non-CHP,Emera Maine
4,hc1494_p01,Millinocket,Penobscot,ME,45.637400,-68.730000,Great Lakes Hydro America LLC,Wholesale Power Marketer,Great Lakes Hydro America LLC,HY,54134.0,Run-of-river/Upstream Peaking,8,41.100,215789.7746,59.935610,1917.0,West Branch Penobscot River,IPP Non-CHP,Emera Maine
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2316,hc9152_p01,Hidden Falls Lake,Sitka,AK,57.217000,-134.873000,Alaska Dept Of Fish And Games,Municipal,,HY,,Canal/Conduit,2,0.330,1401.6000,,1985.0,Hidden Falls Lake,,
2317,hc9153_p01,Wallowa Lake County Service District Hydro Sta...,Wallowa,OR,45.281000,-117.215000,Wallowa Resources Community Solutions Inc.,Private non-utility,,HY,,Canal/Conduit,1,0.020,,,2020.0,State Park Spring,,
2318,hc9154_p01,B24 Hydroelectric Station Project,Los Angeles,CA,34.035881,-117.970986,San Gabriel Valley Water Company,Private non-utility,,HY,,Canal/Conduit,1,0.150,1200.0000,,2019.0,Ground Water,,
2319,hc9155_p01,Pioneer Valley Hydro Site Project,Gunnison,CO,38.289386,-107.590528,"Pioneer Valley, LLC",Private non-utility,,HY,,Canal/Conduit,2,0.006,45.9000,,2020.0,Un-named spring,,


In [16]:
active_hy_plants['Type'].unique()

array(['HY', 'HY/PS'], dtype=object)

In [17]:
active_hy_plants.isnull().sum()

eha_id                  0
plant_name              0
County                 15
State                   0
Lat                     0
Lon                     0
plant_owner             7
owner_type              7
dam_owner             160
Type                    0
eia_id                639
Mode                  788
Number_of_Units         0
capacity_mw             0
avg_generation_mwh    302
percent_capacity      310
year_online_hy        796
waterway               28
Sector                798
dist_owner            807
dtype: int64

In [16]:
# check for plants where generation is null

active_hy_plants[active_hy_plants['avg_generation_mwh'].isnull()]

Unnamed: 0,eha_id,plant_name,County,State,Lat,Lon,plant_owner,owner_type,dam_owner,Type,eia_id,Mode,Number_of_Units,capacity_mw,avg_generation_mwh,percent_capacity,year_online_hy,waterway,Sector,dist_owner
8,hc7504_p01,Corriveau,Oxford,ME,45.111631,-69.448711,"GREEN POWER USA, LLC",Industrial,"GREEN POWER USA, LLC",HY,,Run-of-river,3,0.3500,,,,Swift River,,
9,hc7145_p01,Moosehead,Piscataquis,ME,45.183600,-69.230400,Town of Dover-Foxcroft,Publicly Owned Utility,Town of Dover-Foxcroft,HY,,,1,0.3000,,,,Piscataquis River,,
12,hc7131_p01,Milo,Piscataquis,ME,45.251100,-68.988300,Ridgewood Maine Hydro Partners LP,Private Non-utility,Ridgewood Maine Hydro Partners LP,HY,,,3,0.6950,,,,Sebec River,,
21,hc7305_p01,Foss Mill,Waldo,ME,44.531600,-69.145500,Lesia Sochor,Private Non-utility,Lesia Sochor,HY,,,1,0.0150,,,,Marsh Stream,,
37,hc7071_p01,Waverly Avenue,Somerset,ME,44.794700,-69.386900,Christopher M. Anthony,Private Non-utility,Christopher M. Anthony,HY,,,1,0.4000,,,,Sebasticook River,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2296,hc9021_p01,Kohala Ditch Hydro,,HI,20.240462,-155.838224,,,,HY,,Canal/Conduit,1,0.0400,,,,,,
2297,hc9022_p01,Hoowaiwai Farms Hydro,,HI,19.794001,-155.125753,,,,HY,,,1,0.0800,,,,,,
2298,hc9023_p01,Ainako Hydro,,HI,19.711143,-155.123611,,,,HY,,,1,0.0067,,,,,,
2315,hc9151_p01,Enery Recovery Phase 1,Umatilla,OR,45.675921,-118.767138,"City Of Pendleton, or",Municipal,,HY,,Canal/Conduit,4,0.4000,,,2012.0,Umatilla River,,


In [17]:
# check for plants where eia_id is null

active_hy_plants[active_hy_plants['eia_id'].isnull()]

Unnamed: 0,eha_id,plant_name,County,State,Lat,Lon,plant_owner,owner_type,dam_owner,Type,eia_id,Mode,Number_of_Units,capacity_mw,avg_generation_mwh,percent_capacity,year_online_hy,waterway,Sector,dist_owner
8,hc7504_p01,Corriveau,Oxford,ME,45.111631,-69.448711,"GREEN POWER USA, LLC",Industrial,"GREEN POWER USA, LLC",HY,,Run-of-river,3,0.350,,,,Swift River,,
9,hc7145_p01,Moosehead,Piscataquis,ME,45.183600,-69.230400,Town of Dover-Foxcroft,Publicly Owned Utility,Town of Dover-Foxcroft,HY,,,1,0.300,,,,Piscataquis River,,
10,hc7126_p01,Brown's Mill,Piscataquis,ME,45.183500,-69.219400,Ridgewood Maine Hydro Partners LP,Private Non-utility,Ridgewood Maine Hydro Partners LP,HY,,,2,0.594,2500.0,48.045139,,Piscataquis River,,
11,hc7251_p01,Sebec,Piscataquis,ME,45.270200,-69.116000,Ampersand Sebec Lake Hydro LLC,Private Non-utility,Ampersand Sebec Lake Hydro LLC,HY,,,2,0.867,2160.0,28.440063,,Sebec River,,
12,hc7131_p01,Milo,Piscataquis,ME,45.251100,-68.988300,Ridgewood Maine Hydro Partners LP,Private Non-utility,Ridgewood Maine Hydro Partners LP,HY,,,3,0.695,,,,Sebec River,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2316,hc9152_p01,Hidden Falls Lake,Sitka,AK,57.217000,-134.873000,Alaska Dept Of Fish And Games,Municipal,,HY,,Canal/Conduit,2,0.330,1401.6,,1985.0,Hidden Falls Lake,,
2317,hc9153_p01,Wallowa Lake County Service District Hydro Sta...,Wallowa,OR,45.281000,-117.215000,Wallowa Resources Community Solutions Inc.,Private non-utility,,HY,,Canal/Conduit,1,0.020,,,2020.0,State Park Spring,,
2318,hc9154_p01,B24 Hydroelectric Station Project,Los Angeles,CA,34.035881,-117.970986,San Gabriel Valley Water Company,Private non-utility,,HY,,Canal/Conduit,1,0.150,1200.0,,2019.0,Ground Water,,
2319,hc9155_p01,Pioneer Valley Hydro Site Project,Gunnison,CO,38.289386,-107.590528,"Pioneer Valley, LLC",Private non-utility,,HY,,Canal/Conduit,2,0.006,45.9,,2020.0,Un-named spring,,


In [22]:
# check for plants where owner_type is null

In [23]:
active_hy_plants[active_hy_plants['owner_type'].isnull()]

Unnamed: 0,eha_id,plant_name,County,State,Lat,Lon,plant_owner,owner_type,dam_owner,Type,eia_id,Mode,Number_of_Units,capacity_mw,avg_generation_mwh,percent_capacity,year_online_hy,waterway,Sector,dist_owner
2287,hc9012_p01,Lutak,,AK,59.279287,-135.477053,,,,HY,,,1,0.285,,,,,,
2289,hc9014_p01,McRoberts Creek,,AK,59.2867,-135.68,,,,HY,,Run-of-river,1,0.125,,,,,,
2290,hc9015_p01,Ten Mile,,AK,52.19,-174.197,,,,HY,,,1,0.6,,,,,,
2295,hc9020_p01,Waimea/Waikoloa Pipeline,,HI,20.038657,-155.674807,,,,HY,,Canal/Conduit,1,0.04,,,,,,
2296,hc9021_p01,Kohala Ditch Hydro,,HI,20.240462,-155.838224,,,,HY,,Canal/Conduit,1,0.04,,,,,,
2297,hc9022_p01,Hoowaiwai Farms Hydro,,HI,19.794001,-155.125753,,,,HY,,,1,0.08,,,,,,
2298,hc9023_p01,Ainako Hydro,,HI,19.711143,-155.123611,,,,HY,,,1,0.0067,,,,,,


In [24]:
active_hy_plants.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2293 entries, 0 to 2320
Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   eha_id              2293 non-null   object 
 1   plant_name          2293 non-null   object 
 2   County              2278 non-null   object 
 3   State               2293 non-null   object 
 4   Lat                 2293 non-null   float64
 5   Lon                 2293 non-null   float64
 6   plant_owner         2286 non-null   object 
 7   owner_type          2286 non-null   object 
 8   dam_owner           2133 non-null   object 
 9   Type                2293 non-null   object 
 10  eia_id              1654 non-null   float64
 11  Mode                1505 non-null   object 
 12  Number_of_Units     2293 non-null   int64  
 13  capacity_mw         2293 non-null   float64
 14  avg_generation_mwh  1991 non-null   float64
 15  percent_capacity    1983 non-null   float64
 16  year_o

In [25]:
active_hy_plants.dtypes

eha_id                 object
plant_name             object
County                 object
State                  object
Lat                   float64
Lon                   float64
plant_owner            object
owner_type             object
dam_owner              object
Type                   object
eia_id                float64
Mode                   object
Number_of_Units         int64
capacity_mw           float64
avg_generation_mwh    float64
percent_capacity      float64
year_online_hy        float64
waterway               object
Sector                 object
dist_owner             object
dtype: object

In [None]:
active_hy_plants.isnull().sum()

In [145]:
active_hy_plants.drop_duplicates()

Unnamed: 0,eha_id,plant_name,County,State,Lat,Lon,plant_owner,owner_type,dam_owner,Type,eia_id,Mode,Number_of_Units,capacity_mw,avg_generation_mwh,percent_capacity,year_online_hy,waterway,Sector,dist_owner
0,hc1428_p01,Squa Pan Hydro Station,Aroostook,ME,46.556300,-68.325700,WPS New England Generation Inc,Wholesale Power Marketer,WPS New England Generation Inc,HY,1516.0,Peaking,1,1.500,881.4000,6.707763,1941.0,Squa Pan Stream,IPP Non-CHP,Emera Maine
1,hc1427_p01,Caribou Generation Station,Aroostook,ME,46.848800,-68.002200,WPS New England Generation Inc,Wholesale Power Marketer,WPS New England Generation Inc,HY,1513.0,Run-of-river,2,0.800,3765.3700,53.729595,1926.0,Aroostook River,IPP Non-CHP,Emera Maine
2,hc1497_p01,McKay,Piscataquis,ME,45.881500,-69.176700,Great Lakes Hydro America LLC,Wholesale Power Marketer,Great Lakes Hydro America LLC,HY,54134.0,Peaking,3,37.600,197413.5164,59.935610,1917.0,West Branch Penobscot River,IPP Non-CHP,Emera Maine
3,hc1495_p01,North Twin,Penobscot,ME,45.634600,-68.781300,Great Lakes Hydro America LLC,Wholesale Power Marketer,Great Lakes Hydro America LLC,HY,54134.0,Peaking,3,9.600,50403.4510,59.935610,1917.0,West Branch Penobscot River,IPP Non-CHP,Emera Maine
4,hc1494_p01,Millinocket,Penobscot,ME,45.637400,-68.730000,Great Lakes Hydro America LLC,Wholesale Power Marketer,Great Lakes Hydro America LLC,HY,54134.0,Run-of-river/Upstream Peaking,8,41.100,215789.7746,59.935610,1917.0,West Branch Penobscot River,IPP Non-CHP,Emera Maine
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2316,hc9152_p01,Hidden Falls Lake,Sitka,AK,57.217000,-134.873000,Alaska Dept Of Fish And Games,Municipal,,HY,,Canal/Conduit,2,0.330,1401.6000,,1985.0,Hidden Falls Lake,,
2317,hc9153_p01,Wallowa Lake County Service District Hydro Sta...,Wallowa,OR,45.281000,-117.215000,Wallowa Resources Community Solutions Inc.,Private non-utility,,HY,,Canal/Conduit,1,0.020,,,2020.0,State Park Spring,,
2318,hc9154_p01,B24 Hydroelectric Station Project,Los Angeles,CA,34.035881,-117.970986,San Gabriel Valley Water Company,Private non-utility,,HY,,Canal/Conduit,1,0.150,1200.0000,,2019.0,Ground Water,,
2319,hc9155_p01,Pioneer Valley Hydro Site Project,Gunnison,CO,38.289386,-107.590528,"Pioneer Valley, LLC",Private non-utility,,HY,,Canal/Conduit,2,0.006,45.9000,,2020.0,Un-named spring,,


In [18]:
# check for errors in county spelling

active_hy_plants['County'].unique()

array(['Aroostook', 'Piscataquis', 'Penobscot', 'Oxford', 'Waldo',
       'Somerset', 'Franklin', 'Kennebec', 'Somerset, Waldo', 'Coos',
       'Androscoggin', 'Cumberland', 'Washington', 'Hancock', 'Knox',
       'Lincoln', 'Carroll', 'York', 'Strafford', 'Grafton', 'Merrimack',
       'Belknap', 'Cheshire', 'Hillsborough', 'Worcester', 'Middlesex',
       'MERRIMACK', 'Essex', 'Caledonia', 'Orange', 'Windsor', 'Rutland',
       'Sullivan', 'Windham', 'Bedford, Pittsylvania', 'Hampden',
       'Bennington', 'Elbert', 'Hampshire', 'Hartford', 'HARTFORD',
       'Suffolk', 'Plymouth', 'Providence', 'Kent', 'New London',
       'Tolland', 'New Haven', 'Dalton', 'Pittsfield', 'Berkshire',
       'Litchfield', 'Putnam, Hancock', 'Fairfield', 'Warren', 'Hamilton',
       'Fulton', 'Saratoga', 'Warren, Saratoga', 'Rensselaer', 'Albany',
       'Herkimer', 'Saratoga, Schenectady', 'Murray', 'Columbia',
       'Greene', 'Ulster', 'Dutchess', 'Putnam', 'Passaic', 'Pike',
       'Cherokee', 'Mon

In [20]:
# update county name to lower case only to compare to fips data 

active_hy_plants['County'] = active_hy_plants['County'].str.lower()

In [21]:
# remove any counties with 'county' at the end of name

active_hy_plants['County']= active_hy_plants['County'].str.replace(' county', '')

In [22]:
# split any counties that contain two counties

active_hy_plants[['county_1','county_2']]= active_hy_plants['County'].str.split(',', expand=True)

In [23]:
active_hy_plants['county_1'].unique()

array(['aroostook', 'piscataquis', 'penobscot', 'oxford', 'waldo',
       'somerset', 'franklin', 'kennebec', 'coos', 'androscoggin',
       'cumberland', 'washington', 'hancock', 'knox', 'lincoln',
       'carroll', 'york', 'strafford', 'grafton', 'merrimack', 'belknap',
       'cheshire', 'hillsborough', 'worcester', 'middlesex', 'essex',
       'caledonia', 'orange', 'windsor', 'rutland', 'sullivan', 'windham',
       'bedford', 'hampden', 'bennington', 'elbert', 'hampshire',
       'hartford', 'suffolk', 'plymouth', 'providence', 'kent',
       'new london', 'tolland', 'new haven', 'dalton', 'pittsfield',
       'berkshire', 'litchfield', 'putnam', 'fairfield', 'warren',
       'hamilton', 'fulton', 'saratoga', 'rensselaer', 'albany',
       'herkimer', 'murray', 'columbia', 'greene', 'ulster', 'dutchess',
       'passaic', 'pike', 'cherokee', 'monroe', 'montgomery', 'otsego',
       'lackawanna', 'tioga', 'dauphin', 'huntingdon', 'lancaster',
       'chester', 'ralls', 'harford', 

In [24]:
# remove the word 'division' from the end of any county names to match fips data

active_hy_plants['county_1']= active_hy_plants['county_1'].str.replace(' division', '')

In [25]:
active_hy_plants['county_1'].unique()

array(['aroostook', 'piscataquis', 'penobscot', 'oxford', 'waldo',
       'somerset', 'franklin', 'kennebec', 'coos', 'androscoggin',
       'cumberland', 'washington', 'hancock', 'knox', 'lincoln',
       'carroll', 'york', 'strafford', 'grafton', 'merrimack', 'belknap',
       'cheshire', 'hillsborough', 'worcester', 'middlesex', 'essex',
       'caledonia', 'orange', 'windsor', 'rutland', 'sullivan', 'windham',
       'bedford', 'hampden', 'bennington', 'elbert', 'hampshire',
       'hartford', 'suffolk', 'plymouth', 'providence', 'kent',
       'new london', 'tolland', 'new haven', 'dalton', 'pittsfield',
       'berkshire', 'litchfield', 'putnam', 'fairfield', 'warren',
       'hamilton', 'fulton', 'saratoga', 'rensselaer', 'albany',
       'herkimer', 'murray', 'columbia', 'greene', 'ulster', 'dutchess',
       'passaic', 'pike', 'cherokee', 'monroe', 'montgomery', 'otsego',
       'lackawanna', 'tioga', 'dauphin', 'huntingdon', 'lancaster',
       'chester', 'ralls', 'harford', 

In [26]:
# concatenate State & county to merge fips data

active_hy_plants['county_state'] = active_hy_plants['county_1'].str.cat(active_hy_plants[['State']], sep=',')

In [27]:
active_hy_plants[active_hy_plants['county_1'].isnull()]

Unnamed: 0,eha_id,plant_name,County,State,Lat,Lon,plant_owner,owner_type,dam_owner,Type,...,capacity_mw,avg_generation_mwh,percent_capacity,year_online_hy,waterway,Sector,dist_owner,county_1,county_2,county_state
2274,hc7353_p01,Eklutna Energy Recovery Sta,,AK,61.2194,-149.7292,Municipality of Anchorage,Publicly Owned Utility,Municipality of Anchorage,HY,...,0.75,,,,ANCHORAGE WATER & WASTEWATER PL,,,,,
2279,hc7576_p01,Gartina Falls,,AK,58.06991,-135.382897,Inside Passage Electric Cooperative,Cooperative,Inside Passage Electric Cooperative,HY,...,0.45,1810.0,45.915779,,GARTINA CREEK,,,,,
2286,hc9011_p01,Larsen Bay,,AK,57.513,-153.987,"City of Larsen Bay, AK",Publicly Owned Utility,"City of Larsen Bay, AK",HY,...,0.475,,,,,,,,,
2287,hc9012_p01,Lutak,,AK,59.279287,-135.477053,,,,HY,...,0.285,,,,,,,,,
2288,hc9013_p01,Town Creek,,AK,54.126,-165.774,City of Akutan,Publicly Owned Utility,City of Akutan,HY,...,0.1,775.0,88.47032,,,,,,,
2289,hc9014_p01,McRoberts Creek,,AK,59.2867,-135.68,,,,HY,...,0.125,,,,,,,,,
2290,hc9015_p01,Ten Mile,,AK,52.19,-174.197,,,,HY,...,0.6,,,,,,,,,
2291,hc9016_p01,Chuniisax Creek,,AK,61.6,-148.96,City of Atka,Publicly Owned Utility,City of Atka,HY,...,0.284,,,,,,,,,
2292,hc9017_p01,Southfork Hydro,,AK,61.27,-149.47,Alaska Power & Telephone Co.,Investor-Owned Utility,Alaska Power & Telephone Co.,HY,...,1.2,,,,,,,,,
2293,hc9018_p01,Kahaluu Shaft Tank,,HI,19.581673,-155.954419,Department of Water Supply,Publicly Owned Utility,Department of Water Supply,HY,...,0.04,,,,,,,,,


#### Join Hydropower Plant data with County data so that dataframe include FIPS county codes

In [28]:
# import csv with fips codes to join with plant data. https://www.nrcs.usda.gov/wps/portal/nrcs/detail/national/home/?cid=nrcs143_013697

county_codes = pd.io.parsers.read_csv('data/county_codes/us_county_codes.csv', dtype={'FIPS': 'str'})

In [29]:
county_codes

Unnamed: 0,FIPS,Name,State
0,01001,Autauga,AL
1,01003,Baldwin,AL
2,01005,Barbour,AL
3,01007,Bibb,AL
4,01009,Blount,AL
...,...,...,...
3227,72151,Yabucoa,PR
3228,72153,Yauco,PR
3229,78010,St. Croix,VI
3230,78020,St. John,VI


In [30]:
county_codes['Name'] = county_codes['Name'].str.lower()

In [31]:
county_codes['county_state'] = county_codes['Name'].str.cat(county_codes[['State']], sep=',')

In [32]:
county_codes

Unnamed: 0,FIPS,Name,State,county_state
0,01001,autauga,AL,"autauga,AL"
1,01003,baldwin,AL,"baldwin,AL"
2,01005,barbour,AL,"barbour,AL"
3,01007,bibb,AL,"bibb,AL"
4,01009,blount,AL,"blount,AL"
...,...,...,...,...
3227,72151,yabucoa,PR,"yabucoa,PR"
3228,72153,yauco,PR,"yauco,PR"
3229,78010,st. croix,VI,"st. croix,VI"
3230,78020,st. john,VI,"st. john,VI"


In [33]:
county_codes.drop(['Name','State'], axis =1, inplace= True)

In [34]:
county_codes

Unnamed: 0,FIPS,county_state
0,01001,"autauga,AL"
1,01003,"baldwin,AL"
2,01005,"barbour,AL"
3,01007,"bibb,AL"
4,01009,"blount,AL"
...,...,...
3227,72151,"yabucoa,PR"
3228,72153,"yauco,PR"
3229,78010,"st. croix,VI"
3230,78020,"st. john,VI"


In [35]:
active_hy_plants

Unnamed: 0,eha_id,plant_name,County,State,Lat,Lon,plant_owner,owner_type,dam_owner,Type,...,capacity_mw,avg_generation_mwh,percent_capacity,year_online_hy,waterway,Sector,dist_owner,county_1,county_2,county_state
0,hc1428_p01,Squa Pan Hydro Station,aroostook,ME,46.556300,-68.325700,WPS New England Generation Inc,Wholesale Power Marketer,WPS New England Generation Inc,HY,...,1.500,881.4000,6.707763,1941.0,Squa Pan Stream,IPP Non-CHP,Emera Maine,aroostook,,"aroostook,ME"
1,hc1427_p01,Caribou Generation Station,aroostook,ME,46.848800,-68.002200,WPS New England Generation Inc,Wholesale Power Marketer,WPS New England Generation Inc,HY,...,0.800,3765.3700,53.729595,1926.0,Aroostook River,IPP Non-CHP,Emera Maine,aroostook,,"aroostook,ME"
2,hc1497_p01,McKay,piscataquis,ME,45.881500,-69.176700,Great Lakes Hydro America LLC,Wholesale Power Marketer,Great Lakes Hydro America LLC,HY,...,37.600,197413.5164,59.935610,1917.0,West Branch Penobscot River,IPP Non-CHP,Emera Maine,piscataquis,,"piscataquis,ME"
3,hc1495_p01,North Twin,penobscot,ME,45.634600,-68.781300,Great Lakes Hydro America LLC,Wholesale Power Marketer,Great Lakes Hydro America LLC,HY,...,9.600,50403.4510,59.935610,1917.0,West Branch Penobscot River,IPP Non-CHP,Emera Maine,penobscot,,"penobscot,ME"
4,hc1494_p01,Millinocket,penobscot,ME,45.637400,-68.730000,Great Lakes Hydro America LLC,Wholesale Power Marketer,Great Lakes Hydro America LLC,HY,...,41.100,215789.7746,59.935610,1917.0,West Branch Penobscot River,IPP Non-CHP,Emera Maine,penobscot,,"penobscot,ME"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2316,hc9152_p01,Hidden Falls Lake,sitka,AK,57.217000,-134.873000,Alaska Dept Of Fish And Games,Municipal,,HY,...,0.330,1401.6000,,1985.0,Hidden Falls Lake,,,sitka,,"sitka,AK"
2317,hc9153_p01,Wallowa Lake County Service District Hydro Sta...,wallowa,OR,45.281000,-117.215000,Wallowa Resources Community Solutions Inc.,Private non-utility,,HY,...,0.020,,,2020.0,State Park Spring,,,wallowa,,"wallowa,OR"
2318,hc9154_p01,B24 Hydroelectric Station Project,los angeles,CA,34.035881,-117.970986,San Gabriel Valley Water Company,Private non-utility,,HY,...,0.150,1200.0000,,2019.0,Ground Water,,,los angeles,,"los angeles,CA"
2319,hc9155_p01,Pioneer Valley Hydro Site Project,gunnison,CO,38.289386,-107.590528,"Pioneer Valley, LLC",Private non-utility,,HY,...,0.006,45.9000,,2020.0,Un-named spring,,,gunnison,,"gunnison,CO"


In [36]:
# join plant data with fips data on county_state name to include fips code in dataframe

hy_plants_merged  = pd.merge(left = active_hy_plants, right = county_codes, how = "left", on= 'county_state')

In [169]:
# save joined dataframe as csv 

hy_plants_merged.to_csv(f'data/hy_plants_merged.csv', index = False)

In [173]:
# additional county/fips cleaning in excel --> hy_plants_final in cleaned folder is final, 2217 rows

hy_plants_final = pd.io.parsers.read_csv('data/cleaned/hy_plants_final.csv', dtype={'FIPS': 'str'})

In [31]:
hy_plants_final

Unnamed: 0,eha_id,plant_name,County,State,Lat,Lon,plant_owner,owner_type,dam_owner,Type,...,Number_of_Units,capacity_mw,avg_hy_generation,percent_capacity,year_online_hy,waterway,Sector,dist_owner,county_2,FIPS
0,hc2133_p01,Rocky River,abbeville,SC,34.257700,-82.609500,Abbeville City of,Publicly Owned Utility,Abbeville City of,HY,...,2,2.600,4940.22,21.690464,1941.0,Rocky River,Electric Utility,City of Abbeville - (SC),,45001
1,hc0005_p01,Boise R Diversion,ada,ID,43.537685,-116.093756,Reclamation Pacific Northwest Region (PN),Reclamation,Reclamation Pacific Northwest Region (PN),HY,...,3,3.300,8320.35,28.782171,2004.0,Boise River,Electric Utility,Idaho Power Co,,16001
2,hc0135_p01,Mora Drop Hydroelectric Project,ada,ID,43.459700,-116.472600,Boise-Kuna Irrigation District,Publicly Owned Utility,U S Bureau of Reclamation,HY,...,1,1.700,4277.69,28.724752,2006.0,Mora Canal,IPP Non-CHP,Idaho Power Co,,16001
3,hc0182_p01,Main Canal No. 6,ada,ID,43.490468,-116.393658,Boise Project Board of Control,Publicly Owned Utility,,HY,...,3,1.055,,,,MAIN CANAL,,,,16001
4,hc0321_p01,Lucky Peak,ada,ID,43.546700,-116.055000,Boise-Kuna Irrigation District,Publicly Owned Utility,CENWW,HY,...,3,101.200,322896.59,36.423262,1988.0,Boise River,IPP Non-CHP,Idaho Power Co,,16001
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2212,hc1337_p02,Fish Power,yuba,CA,39.392100,-121.141100,Yuba County Water Agency,Publicly Owned Utility,Yuba County Water Agency,HY,...,1,0.150,1071.00,81.506849,,North Yuba River,,,,6115
2213,hc1337_p03,Narrows 2,yuba,CA,39.392100,-121.141100,Yuba County Water Agency,Publicly Owned Utility,Yuba County Water Agency,HY,...,1,46.700,180492.94,44.120379,1969.0,North Yuba River,Electric Utility,Pacific Gas & Electric Co,,6115
2214,hc1970_p01,Deadwood Creek,yuba,CA,39.547800,-121.094700,"Hydro Sierra Energy, LLC",Publicly Owned Utility,"Hydro Sierra Energy, LLC",HY,...,1,2.000,2100.50,11.989155,1993.0,Deadwood Creek,Electric Utility,Pacific Gas & Electric Co,,6115
2215,hc7027_p01,Virginia Ranch,yuba,CA,39.323600,-121.309700,Browns Valley Irrigation District,Publicly Owned Utility,Browns Valley Irrigation District,HY,...,2,1.000,,,,French Dry Creek,,,,6115


In [183]:
hy_plants_final[hy_plants_final['State'].isin(southwest_states)]

Unnamed: 0,eha_id,plant_name,County,State,Lat,Lon,plant_owner,owner_type,dam_owner,Type,...,Number_of_Units,capacity_mw,avg_hy_generation,percent_capacity,year_online_hy,waterway,Sector,dist_owner,county_2,FIPS
19,hc7239_p01,Lower Cold Springs Powerhouse,alameda,CA,41.929800,-122.352500,J. N. and H.E. Foster and R.Z. Walker,Private Non-utility,J. N. and H.E. Foster and R.Z. Walker,HY,...,1,0.09000,,,,LOWER COLD SPRINGS,,,,6001
20,hc7257_p01,CPUD Pipeline,alameda,CA,38.266700,-120.633300,Calaveras Public Utility District,Publicly Owned Utility,Calaveras Public Utility District,HY,...,3,0.27975,,,,JEFF DAVIS RESERVOIR,,,,6001
21,hc7445_p01,WTP Powerhouse No. 2,alameda,CA,37.541700,-121.925200,Alameda County Water District,Publicly Owned Utility,Alameda County Water District,HY,...,6,1.25000,6740.00,61.552511,,SOUTH BAY AQUEDUCT,,,,6001
22,hc7110_p01,San Luis Obispo Powerhouse,alameda,CA,35.327000,-120.666500,City of San Luis Obispo,Publicly Owned Utility,City of San Luis Obispo,HY,...,1,0.68000,,,,WATER SUP PIPELINE,,,,6001
37,hc9037_p01,CHYDRO,alpine,CA,38.774211,-119.793056,South Tahoe Public Utility District,Publicly Owned Utility,,HY,...,1,0.05500,,,,Conduit/Canal,,,,6003
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2212,hc1337_p02,Fish Power,yuba,CA,39.392100,-121.141100,Yuba County Water Agency,Publicly Owned Utility,Yuba County Water Agency,HY,...,1,0.15000,1071.00,81.506849,,North Yuba River,,,,6115
2213,hc1337_p03,Narrows 2,yuba,CA,39.392100,-121.141100,Yuba County Water Agency,Publicly Owned Utility,Yuba County Water Agency,HY,...,1,46.70000,180492.94,44.120379,1969.0,North Yuba River,Electric Utility,Pacific Gas & Electric Co,,6115
2214,hc1970_p01,Deadwood Creek,yuba,CA,39.547800,-121.094700,"Hydro Sierra Energy, LLC",Publicly Owned Utility,"Hydro Sierra Energy, LLC",HY,...,1,2.00000,2100.50,11.989155,1993.0,Deadwood Creek,Electric Utility,Pacific Gas & Electric Co,,6115
2215,hc7027_p01,Virginia Ranch,yuba,CA,39.323600,-121.309700,Browns Valley Irrigation District,Publicly Owned Utility,Browns Valley Irrigation District,HY,...,2,1.00000,,,,French Dry Creek,,,,6115


In [181]:
hy_plants_final[hy_plants_final['State'].isin(southwest_states)].groupby('State')['plant_name'].count()

State
AZ     14
CA    388
CO     92
NM      7
NV     13
TX     26
UT     69
Name: plant_name, dtype: int64

In [29]:
hy_plants_final.dtypes

eha_id                object
plant_name            object
County                object
State                 object
Lat                  float64
Lon                  float64
plant_owner           object
owner_type            object
dam_owner             object
Type                  object
eia_id               float64
Mode                  object
Number_of_Units        int64
capacity_mw          float64
avg_hy_generation    float64
percent_capacity     float64
year_online_hy       float64
waterway              object
Sector                object
dist_owner            object
county_2              object
FIPS                  object
dtype: object

In [None]:
# Remove eia_id nulls for California plants join

hy_plants_final.dropna(subset = 'eia_id', inplace = True)

In [33]:
hy_plants_final['eia_id'] = hy_plants_final['eia_id'].astype(int)

In [34]:
hy_plants_final.dtypes

eha_id                object
plant_name            object
County                object
State                 object
Lat                  float64
Lon                  float64
plant_owner           object
owner_type            object
dam_owner             object
Type                  object
eia_id                 int32
Mode                  object
Number_of_Units        int64
capacity_mw          float64
avg_hy_generation    float64
percent_capacity     float64
year_online_hy       float64
waterway              object
Sector                object
dist_owner            object
county_2              object
FIPS                  object
dtype: object

In [35]:
hy_plants_final.to_csv(f'data/cleaned/hy_plants_for_join.csv', index = False)

### California Hydropower Plants

California Plant data was downloaded from https://ww2.energy.ca.gov/almanac/renewables_data/hydro/index_cms.php

In [38]:
# import 'California_Large_Hydropower' csv 

CA_large_hy_plants = pd.read_csv('data/California_Large_Hydropower.csv')

In [4]:
CA_large_hy_plants

Unnamed: 0,Year,Company_Name,Plant_ID,Plant_Name,State,Capacity_(MW),Gross_MWh,Net_MWh
0,,,,,,,,
1,2021.0,California Department of Water Resources,H0137,Devil Canyon,CA,276.4,194773,193260
2,2021.0,California Department of Water Resources,H0164,"Edward C Hyatt (Unit 1,3,5 Pumping-Generating)",CA,644.3,439709,427867
3,2021.0,California Department of Water Resources,H0337,Mojave Siphon,CA,32.8,10991,10322
4,2021.0,California Department of Water Resources,H0452,W R Gianelli (Pumping-Generating),CA,424,154911,-18027
...,...,...,...,...,...,...,...,...
1498,2001.0,United States Bureau of Reclamation,H0475,Shasta,CA,714,1648574,1648574
1499,2001.0,United States Bureau of Reclamation,H0493,Spring Creek,CA,180,452757,452652
1500,2001.0,United States Bureau of Reclamation,H0520,Trinity,CA,140,403532,379480
1501,2001.0,Yuba County Water Agency,H0352,Colgate,CA,315,619469,619469


In [5]:
# Add column to identify 'large' plants

CA_large_hy_plants['type']= 'Large'

In [6]:
# import 'California_Small_Hydropower' csv

CA_small_hy_plants = pd.read_csv('data/California_Small_Hydropower.csv')

In [7]:
CA_small_hy_plants

Unnamed: 0,Year,Company_Name,Plant_ID,Plant_Name,State,Capacity_(MW),Gross_MWh,Net_MWh
0,,,,,,,,
1,2021.0,Big Creek Water Works Ltd,H0037,Big Creek Water Works,CA,5.0,4311,4311
2,2021.0,Calaveras County Water District,H0073,Hogan,CA,3.0,277,277
3,2021.0,California Department of Water Resources,H0058,Alamo,CA,19.7,13879,13171
4,2021.0,California Department of Water Resources,H0511,Thermalito Diversion Dam,CA,3.0,10910,10860
...,...,...,...,...,...,...,...,...
4140,2001.0,Utica Power Authority,H0008,Angels,CA,1.4,4188,4188
4141,2001.0,Utica Power Authority,H0346,Murphys,CA,4.5,9270,9270
4142,2001.0,Yolo County Flood Control & Water Conservation...,H0243,Indian Valley Dam,CA,3.0,8605,8605
4143,2001.0,Yolo County Flood Control & Water Conservation...,H0576,Clear Lake,CA,2.5,0,0


In [8]:
# add column to identify 'small' plants

CA_small_hy_plants['type']= 'Small'

In [9]:
# Union large & small plants csvs

CA_hy_plants = pd.concat([CA_large_hy_plants, CA_small_hy_plants])

In [14]:
CA_hy_plants

Unnamed: 0,Year,Company_Name,Plant_ID,Plant_Name,State,Capacity_(MW),Gross_MWh,Net_MWh,type
0,,,,,,,,,Large
1,2021.0,California Department of Water Resources,H0137,Devil Canyon,CA,276.4,194773,193260,Large
2,2021.0,California Department of Water Resources,H0164,"Edward C Hyatt (Unit 1,3,5 Pumping-Generating)",CA,644.3,439709,427867,Large
3,2021.0,California Department of Water Resources,H0337,Mojave Siphon,CA,32.8,10991,10322,Large
4,2021.0,California Department of Water Resources,H0452,W R Gianelli (Pumping-Generating),CA,424,154911,-18027,Large
...,...,...,...,...,...,...,...,...,...
4140,2001.0,Utica Power Authority,H0008,Angels,CA,1.4,4188,4188,Small
4141,2001.0,Utica Power Authority,H0346,Murphys,CA,4.5,9270,9270,Small
4142,2001.0,Yolo County Flood Control & Water Conservation...,H0243,Indian Valley Dam,CA,3.0,8605,8605,Small
4143,2001.0,Yolo County Flood Control & Water Conservation...,H0576,Clear Lake,CA,2.5,0,0,Small


In [10]:
# check for nulls

CA_hy_plants.isnull().sum()

Year             2
Company_Name     2
Plant_ID         2
Plant_Name       2
State            2
Capacity_(MW)    2
Gross_MWh        2
Net_MWh          2
type             0
dtype: int64

In [11]:
# drop null rows

CA_hy_plants.dropna(inplace = True)

In [12]:
CA_hy_plants.isnull().sum()

Year             0
Company_Name     0
Plant_ID         0
Plant_Name       0
State            0
Capacity_(MW)    0
Gross_MWh        0
Net_MWh          0
type             0
dtype: int64

In [13]:
# check for duplicates

CA_hy_plants.drop_duplicates()

Unnamed: 0,Year,Company_Name,Plant_ID,Plant_Name,State,Capacity_(MW),Gross_MWh,Net_MWh,type
1,2021.0,California Department of Water Resources,H0137,Devil Canyon,CA,276.4,194773,193260,Large
2,2021.0,California Department of Water Resources,H0164,"Edward C Hyatt (Unit 1,3,5 Pumping-Generating)",CA,644.3,439709,427867,Large
3,2021.0,California Department of Water Resources,H0337,Mojave Siphon,CA,32.8,10991,10322,Large
4,2021.0,California Department of Water Resources,H0452,W R Gianelli (Pumping-Generating),CA,424,154911,-18027,Large
5,2021.0,California Department of Water Resources,H0510,Thermalito (Unit 1 HY Unit 2-3-4 Pumping-Gener...,CA,115.1,82744,80813,Large
...,...,...,...,...,...,...,...,...,...
4140,2001.0,Utica Power Authority,H0008,Angels,CA,1.4,4188,4188,Small
4141,2001.0,Utica Power Authority,H0346,Murphys,CA,4.5,9270,9270,Small
4142,2001.0,Yolo County Flood Control & Water Conservation...,H0243,Indian Valley Dam,CA,3.0,8605,8605,Small
4143,2001.0,Yolo County Flood Control & Water Conservation...,H0576,Clear Lake,CA,2.5,0,0,Small


In [14]:
# rename columns to match hydropower plant data

CA_hy_plants.rename(columns = {'Company_Name':'plant_owner', 'Plant_Name':'plant_name', 'Capacity_(MW)':'capacity_mw'}, inplace = True)

In [15]:
# check data types

CA_hy_plants.dtypes

Year           float64
plant_owner     object
Plant_ID        object
plant_name      object
State           object
capacity_mw     object
Gross_MWh       object
Net_MWh         object
type            object
dtype: object

In [16]:
# convert 'capacity_mw', Gross_MWh' and 'Net_MWh' to numeric data type

CA_hy_plants['capacity_mw'].astype(str)

1       276.4
2       644.3
3        32.8
4         424
5       115.1
        ...  
4140      1.4
4141      4.5
4142      3.0
4143      2.5
4144      0.2
Name: capacity_mw, Length: 5646, dtype: object

In [17]:
# remove commas from numbers to convert to numeric data type

CA_hy_plants['capacity_mw'].astype(str).str.replace(',', '')

1       276.4
2       644.3
3        32.8
4         424
5       115.1
        ...  
4140      1.4
4141      4.5
4142      3.0
4143      2.5
4144      0.2
Name: capacity_mw, Length: 5646, dtype: object

In [18]:
# Convert to numeric data type

CA_hy_plants['capacity_mw'] = pd.to_numeric(CA_hy_plants['capacity_mw'].astype(str).str.replace(',', ''))

In [31]:
# remove commas from numbers

CA_hy_plants['Gross_MWh'] = CA_hy_plants['Gross_MWh'].str.replace(',', '')

CA_hy_plants['Net_MWh'] = CA_hy_plants['Net_MWh'].str.replace(',','')

In [32]:
CA_hy_plants

Unnamed: 0,Year,plant_owner,Plant_ID,plant_name,State,capacity_mw,Gross_MWh,Net_MWh,type
1,2021.0,California Department of Water Resources,H0137,Devil Canyon,CA,276.4,194773,193260,Large
2,2021.0,California Department of Water Resources,H0164,"Edward C Hyatt (Unit 1,3,5 Pumping-Generating)",CA,644.3,439709,427867,Large
3,2021.0,California Department of Water Resources,H0337,Mojave Siphon,CA,32.8,10991,10322,Large
4,2021.0,California Department of Water Resources,H0452,W R Gianelli (Pumping-Generating),CA,424.0,154911,-18027,Large
5,2021.0,California Department of Water Resources,H0510,Thermalito (Unit 1 HY Unit 2-3-4 Pumping-Gener...,CA,115.1,82744,80813,Large
...,...,...,...,...,...,...,...,...,...
4140,2001.0,Utica Power Authority,H0008,Angels,CA,1.4,4188,4188,Small
4141,2001.0,Utica Power Authority,H0346,Murphys,CA,4.5,9270,9270,Small
4142,2001.0,Yolo County Flood Control & Water Conservation...,H0243,Indian Valley Dam,CA,3.0,8605,8605,Small
4143,2001.0,Yolo County Flood Control & Water Conservation...,H0576,Clear Lake,CA,2.5,0,0,Small


In [33]:
# Convert to numeric data type

CA_hy_plants['Net_MWh'] = pd.to_numeric(CA_hy_plants['Net_MWh'])

CA_hy_plants['Gross_MWh'] = pd.to_numeric(CA_hy_plants['Gross_MWh'])

In [35]:
CA_hy_plants

Unnamed: 0,Year,plant_owner,Plant_ID,plant_name,State,capacity_mw,Gross_MWh,Net_MWh,type
1,2021.0,California Department of Water Resources,H0137,Devil Canyon,CA,276.4,194773,193260,Large
2,2021.0,California Department of Water Resources,H0164,"Edward C Hyatt (Unit 1,3,5 Pumping-Generating)",CA,644.3,439709,427867,Large
3,2021.0,California Department of Water Resources,H0337,Mojave Siphon,CA,32.8,10991,10322,Large
4,2021.0,California Department of Water Resources,H0452,W R Gianelli (Pumping-Generating),CA,424.0,154911,-18027,Large
5,2021.0,California Department of Water Resources,H0510,Thermalito (Unit 1 HY Unit 2-3-4 Pumping-Gener...,CA,115.1,82744,80813,Large
...,...,...,...,...,...,...,...,...,...
4140,2001.0,Utica Power Authority,H0008,Angels,CA,1.4,4188,4188,Small
4141,2001.0,Utica Power Authority,H0346,Murphys,CA,4.5,9270,9270,Small
4142,2001.0,Yolo County Flood Control & Water Conservation...,H0243,Indian Valley Dam,CA,3.0,8605,8605,Small
4143,2001.0,Yolo County Flood Control & Water Conservation...,H0576,Clear Lake,CA,2.5,0,0,Small


In [36]:
CA_hy_plants.dtypes

Year           float64
plant_owner     object
Plant_ID        object
plant_name      object
State           object
capacity_mw    float64
Gross_MWh        int64
Net_MWh          int64
type            object
dtype: object

In [37]:
CA_hy_plants.isnull().sum()

Year           0
plant_owner    0
Plant_ID       0
plant_name     0
State          0
capacity_mw    0
Gross_MWh      0
Net_MWh        0
type           0
dtype: int64

In [38]:
# convert year data type to int

CA_hy_plants['Year'] = CA_hy_plants['Year'].astype(int)

In [40]:
CA_hy_plants.dtypes

Year             int32
plant_owner     object
Plant_ID        object
plant_name      object
State           object
capacity_mw    float64
Gross_MWh        int64
Net_MWh          int64
type            object
dtype: object

#### Join California plants with reference table so that dataframe include EIA IDs

In [42]:
# import California plant reference table with EIA plant IDs

reference_table = pd.read_csv('data/PlantID_reference.csv', encoding='latin1')

In [43]:
reference_table

Unnamed: 0,CECPlantID,PlantName,EIAPlantID,ResourceID,Resource ID Name,Energy Source Category,City,County,State
0,B0001,Vaca Dixon Battery Storage,59256.0,VACADX_1_NAS,,BATTERY,Vacaville,Solano,CA
1,B0002,Tehachapi Storage Project,59661.0,MONLTH_6_BATTRY,Tehachapi Storage Project,BATTERY,Tehachapi,Kern,CA
2,B0003,Yerba Buena Battery,59257.0,SWIFT_1_NAS,,BATTERY,San Jose,Santa Clara,CA
3,B0004,Millikan Avenue BESS,60760.0,SANTGO_2_MABBT1,Millikan Avenue BESS,BATTERY,Irvine,Orange,CA
4,B0005,Mira Loma BESS A,60661.0,MIRLOM_2_MLBBTA,Mira Loma BESS A,BATTERY,Ontario,San Bernardino,CA
...,...,...,...,...,...,...,...,...,...
1864,W0479,"Coachella Wind Holdings, LLC",64323.0,ALTWD_2_COAWD1,Coachella 1,WIND,Desert Hot Springs,Riverside,CA
1865,W0480,"Oasis Alta, LLC",63941.0,VOYAGR_2_VOAWD5,Voyager Wind Oasis Alta,WIND,Mojave,Kern,CA
1866,W0481,Point Wind,63482.0,TEHAPI_2_PW1WD1,Point Wind 1,WIND,Tehachapi,Kern,CA
1867,W0482,Altamont Winds LLC,64326.0,,,WIND,Livermore,Alameda,CA


In [44]:
# Rename column to match California Plants table

reference_table.rename(columns = {'CECPlantID': 'Plant_ID'}, inplace = True)

In [45]:
reference_table

Unnamed: 0,Plant_ID,PlantName,EIAPlantID,ResourceID,Resource ID Name,Energy Source Category,City,County,State
0,B0001,Vaca Dixon Battery Storage,59256.0,VACADX_1_NAS,,BATTERY,Vacaville,Solano,CA
1,B0002,Tehachapi Storage Project,59661.0,MONLTH_6_BATTRY,Tehachapi Storage Project,BATTERY,Tehachapi,Kern,CA
2,B0003,Yerba Buena Battery,59257.0,SWIFT_1_NAS,,BATTERY,San Jose,Santa Clara,CA
3,B0004,Millikan Avenue BESS,60760.0,SANTGO_2_MABBT1,Millikan Avenue BESS,BATTERY,Irvine,Orange,CA
4,B0005,Mira Loma BESS A,60661.0,MIRLOM_2_MLBBTA,Mira Loma BESS A,BATTERY,Ontario,San Bernardino,CA
...,...,...,...,...,...,...,...,...,...
1864,W0479,"Coachella Wind Holdings, LLC",64323.0,ALTWD_2_COAWD1,Coachella 1,WIND,Desert Hot Springs,Riverside,CA
1865,W0480,"Oasis Alta, LLC",63941.0,VOYAGR_2_VOAWD5,Voyager Wind Oasis Alta,WIND,Mojave,Kern,CA
1866,W0481,Point Wind,63482.0,TEHAPI_2_PW1WD1,Point Wind 1,WIND,Tehachapi,Kern,CA
1867,W0482,Altamont Winds LLC,64326.0,,,WIND,Livermore,Alameda,CA


In [46]:
# Join California ptables

CA_plants_with_eiaID = pd.merge(left = CA_hy_plants, right = reference_table, how = "left", on= 'Plant_ID')

In [47]:
CA_plants_with_eiaID

Unnamed: 0,Year,plant_owner,Plant_ID,plant_name,State_x,capacity_mw,Gross_MWh,Net_MWh,type,PlantName,EIAPlantID,ResourceID,Resource ID Name,Energy Source Category,City,County,State_y
0,2021,California Department of Water Resources,H0137,Devil Canyon,CA,276.4,194773,193260,Large,Devil Canyon,436.0,DVLCYN_1_UNITS,DEVIL CANYON HYDRO UNITS 1-4 AGGREGATE,WATER,San Bernardino,San Bernardino,CA
1,2021,California Department of Water Resources,H0164,"Edward C Hyatt (Unit 1,3,5 Pumping-Generating)",CA,644.3,439709,427867,Large,"Edward C Hyatt (Unit 1,3,5 Pumping-Generating)",437.0,HYTTHM_2_UNITS,HYATT-THERMALITO PUMP-GEN (AGGREGATE),WATER,Oroville,Butte,CA
2,2021,California Department of Water Resources,H0337,Mojave Siphon,CA,32.8,10991,10322,Large,Mojave Siphon,7072.0,MOJAVE_1_SIPHON,MOJAVE SIPHON POWER PLANT,WATER,Hesperia,San Bernardino,CA
3,2021,California Department of Water Resources,H0452,W R Gianelli (Pumping-Generating),CA,424.0,154911,-18027,Large,W R Gianelli (Pumping-Generating),448.0,SLUISP_2_UNITS,SAN LUIS (GIANELLI) PUMP-GEN (AGGREGATE),WATER,Unincorporated,Merced,CA
4,2021,California Department of Water Resources,H0510,Thermalito (Unit 1 HY Unit 2-3-4 Pumping-Gener...,CA,115.1,82744,80813,Large,Thermalito (Unit 1 HY Unit 2-3-4 Pumping-Gener...,438.0,HYTTHM_2_UNITS,HYATT-THERMALITO PUMP-GEN (AGGREGATE),WATER,Oroville,Butte,CA
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5641,2001,Utica Power Authority,H0008,Angels,CA,1.4,4188,4188,Small,Angels,215.0,FROGTN_7_UTICA,,WATER,Angels Camp,Calaveras,CA
5642,2001,Utica Power Authority,H0346,Murphys,CA,4.5,9270,9270,Small,Murphys,261.0,FROGTN_7_UTICA,,WATER,Murphys,Calaveras,CA
5643,2001,Yolo County Flood Control & Water Conservation...,H0243,Indian Valley Dam,CA,3.0,8605,8605,Small,Indian Valley Dam,50129.0,INDVLY_1_UNITS,Indian Valley Hydro,WATER,Clearlake Oaks,Lake,CA
5644,2001,Yolo County Flood Control & Water Conservation...,H0576,Clear Lake,CA,2.5,0,0,Small,Clear Lake,50128.0,INDVLY_1_UNITS,Indian Valley Hydro,WATER,Clearlake,Lake,CA


In [48]:
# check for nulls

CA_plants_with_eiaID[CA_plants_with_eiaID['EIAPlantID'].isnull()]

Unnamed: 0,Year,plant_owner,Plant_ID,plant_name,State_x,capacity_mw,Gross_MWh,Net_MWh,type,PlantName,EIAPlantID,ResourceID,Resource ID Name,Energy Source Category,City,County,State_y
1520,2021,Friant Power Authority,H0626,Quinten Luallen,CA,7.3,29124,29083,Small,Quinten Luallen,,WOODWR_1_HYDRO,Quinten Luallen,WATER,Friant,Madera,CA
1590,2021,Not Available,H0547,Baker Station Hydro,CA,1.5,1847,1847,Small,Baker Station Hydro,,BRDGVL_7_BAKER,Baker Station Hydro,WATER,Bridgeville,Humboldt,CA
1722,2020,Friant Power Authority,H0626,Quinten Luallen,CA,7.3,24647,24619,Small,Quinten Luallen,,WOODWR_1_HYDRO,Quinten Luallen,WATER,Friant,Madera,CA
1792,2020,Not Available,H0547,Baker Station Hydro,CA,1.5,1779,1779,Small,Baker Station Hydro,,BRDGVL_7_BAKER,Baker Station Hydro,WATER,Bridgeville,Humboldt,CA
1924,2019,Friant Power Authority,H0626,Quinten Luallen,CA,7.3,48053,47778,Small,Quinten Luallen,,WOODWR_1_HYDRO,Quinten Luallen,WATER,Friant,Madera,CA
1993,2019,Not Available,H0547,Baker Station Hydro,CA,1.5,950,950,Small,Baker Station Hydro,,BRDGVL_7_BAKER,Baker Station Hydro,WATER,Bridgeville,Humboldt,CA
2123,2018,Friant Power Authority,H0626,Quinten Luallen,CA,7.3,46986,46986,Small,Quinten Luallen,,WOODWR_1_HYDRO,Quinten Luallen,WATER,Friant,Madera,CA
2192,2018,Not Available,H0547,Baker Station Hydro,CA,1.5,3350,3350,Small,Baker Station Hydro,,BRDGVL_7_BAKER,Baker Station Hydro,WATER,Bridgeville,Humboldt,CA
2322,2017,Friant Power Authority,H0626,Quinten Luallen,CA,7.3,32976,32790,Small,Quinten Luallen,,WOODWR_1_HYDRO,Quinten Luallen,WATER,Friant,Madera,CA
2391,2017,Not Available,H0547,Baker Station Hydro,CA,1.5,4653,4653,Small,Baker Station Hydro,,BRDGVL_7_BAKER,Baker Station Hydro,WATER,Bridgeville,Humboldt,CA


In [49]:
# Remove columns with duplicated data

CA_plants_with_eiaID.drop(['State_x', 'PlantName'], axis = 1, inplace = True)

In [50]:
CA_plants_with_eiaID.dtypes

Year                        int32
plant_owner                object
Plant_ID                   object
plant_name                 object
capacity_mw               float64
Gross_MWh                   int64
Net_MWh                     int64
type                       object
EIAPlantID                float64
ResourceID                 object
Resource ID Name           object
Energy Source Category     object
City                       object
County                     object
State_y                    object
dtype: object

In [51]:
# update column names after dropping duplicate columns

CA_plants_with_eiaID.rename(columns = {'State_y': 'state'}, inplace = True)

CA_plants_with_eiaID.rename(columns = {'EIAPlantID': 'eia_id'}, inplace = True)

In [53]:
# drop data with nulls in eia_id column

CA_plants_with_eiaID.dropna(subset = 'eia_id', inplace = True)

In [54]:
# update eia id data type to int

CA_plants_with_eiaID['eia_id'] = CA_plants_with_eiaID['eia_id'].astype(int)

In [56]:
CA_plants_with_eiaID

Unnamed: 0,Year,plant_owner,Plant_ID,plant_name,capacity_mw,Gross_MWh,Net_MWh,type,eia_id,ResourceID,Resource ID Name,Energy Source Category,City,County,state
0,2021,California Department of Water Resources,H0137,Devil Canyon,276.4,194773,193260,Large,436,DVLCYN_1_UNITS,DEVIL CANYON HYDRO UNITS 1-4 AGGREGATE,WATER,San Bernardino,San Bernardino,CA
1,2021,California Department of Water Resources,H0164,"Edward C Hyatt (Unit 1,3,5 Pumping-Generating)",644.3,439709,427867,Large,437,HYTTHM_2_UNITS,HYATT-THERMALITO PUMP-GEN (AGGREGATE),WATER,Oroville,Butte,CA
2,2021,California Department of Water Resources,H0337,Mojave Siphon,32.8,10991,10322,Large,7072,MOJAVE_1_SIPHON,MOJAVE SIPHON POWER PLANT,WATER,Hesperia,San Bernardino,CA
3,2021,California Department of Water Resources,H0452,W R Gianelli (Pumping-Generating),424.0,154911,-18027,Large,448,SLUISP_2_UNITS,SAN LUIS (GIANELLI) PUMP-GEN (AGGREGATE),WATER,Unincorporated,Merced,CA
4,2021,California Department of Water Resources,H0510,Thermalito (Unit 1 HY Unit 2-3-4 Pumping-Gener...,115.1,82744,80813,Large,438,HYTTHM_2_UNITS,HYATT-THERMALITO PUMP-GEN (AGGREGATE),WATER,Oroville,Butte,CA
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5641,2001,Utica Power Authority,H0008,Angels,1.4,4188,4188,Small,215,FROGTN_7_UTICA,,WATER,Angels Camp,Calaveras,CA
5642,2001,Utica Power Authority,H0346,Murphys,4.5,9270,9270,Small,261,FROGTN_7_UTICA,,WATER,Murphys,Calaveras,CA
5643,2001,Yolo County Flood Control & Water Conservation...,H0243,Indian Valley Dam,3.0,8605,8605,Small,50129,INDVLY_1_UNITS,Indian Valley Hydro,WATER,Clearlake Oaks,Lake,CA
5644,2001,Yolo County Flood Control & Water Conservation...,H0576,Clear Lake,2.5,0,0,Small,50128,INDVLY_1_UNITS,Indian Valley Hydro,WATER,Clearlake,Lake,CA


In [None]:
# save to csv

CA_plants_with_eiaID.to_csv(f'data/cleaned/CA_plants_eiaID.csv', index = False)