# Working with the EIA Extract / Transform
This notebooks allows you to inspect the extract and transform dagster asset dataframes for the EIA 860 and 923 datasets, to make it easier to test and add new years of data, or new tables from the various spreadsheets that haven't been integrated yet.

**Note: This notebook does not rerun the ETL steps. It just loads the dataframes returned by an asset of the most recent dagster run.** To debug the EIA ETL:

    1. Materialize all EIA assets in dagit.
    2. Load and inspect the dataframe for an asset of interest in this notebook.
    3. Make some code changes to that asset.
    4. Rematerialize the asset in dagit. No need to rematerialize assets that you didn't update.
    5. Load and inspect the dataframe for the the asset of interest.
    6. Repeat steps 3 - 5 until the ETL works!

Some assets are written to the database in which case you can just pull the tables into pandas or explore them in the database. However, many assets use the default IO Manager which writes asset values to the `$DAGSTER_HOME/storage/` directory as pickle files. Dagster provides a method for inspecting asset values no matter what IO Manager the asset uses.

In [1]:
import os

assert os.environ.get("DAGSTER_HOME"), (
    "The DAGSTER_HOME env var is not set so dagster won't be able to find the assets."
    "Set the DAGSTER_HOME env var in this notebook or kill the jupyter server and set"
    " the DAGSTER_HOME env var in your shell and relaunch jupyter."
)

In [2]:
%load_ext autoreload
%autoreload 3
import pudl
import logging
import sys
from pathlib import Path
import pandas as pd
pd.options.display.max_columns = None

pudl_settings is being deprecated in favor of environment variables variables PUDL_OUTPUT and PUDL_INPUT. For more info see: https://catalystcoop-pudl.readthedocs.io/en/dev/dev/dev_setup.html
pudl_settings is being deprecated in favor of environment variables PUDL_OUTPUT and PUDL_INPUT. For more info see: https://catalystcoop-pudl.readthedocs.io/en/dev/dev/dev_setup.html
sqlite and parquet directories are no longer being used. Make sure there is a single directory named 'output' at the root of your workspace. For more info see: https://catalystcoop-pudl.readthedocs.io/en/dev/dev/dev_setup.html
pudl_settings is being deprecated in favor of environment variables variables PUDL_OUTPUT and PUDL_INPUT. For more info see: https://catalystcoop-pudl.readthedocs.io/en/dev/dev/dev_setup.html
pudl_settings is being deprecated in favor of environment variables PUDL_OUTPUT and PUDL_INPUT. For more info see: https://catalystcoop-pudl.readthedocs.io/en/dev/dev/dev_setup.html
sqlite and parquet direct

In [3]:
logger = logging.getLogger()
logger.setLevel(logging.INFO)
handler = logging.StreamHandler(stream=sys.stdout)
formatter = logging.Formatter('%(message)s')
handler.setFormatter(formatter)
logger.handlers = [handler]

In [4]:
from dagster import AssetSelection, AssetKey

import pudl
from pudl.etl import default_assets, defs
from pudl.resources import dataset_settings
from pudl.helpers import get_asset_group_keys

# EIA-860

## Inspect the raw EIA-860 / EIA-860m tables

In [5]:
get_asset_group_keys("eia860_raw_assets", default_assets)

['raw_plant_eia860',
 'raw_boiler_so2_eia860',
 'raw_generator_existing_eia860',
 'raw_multifuel_existing_eia860',
 'raw_boiler_info_eia860',
 'raw_fgp_equipment_eia860',
 'raw_boiler_cooling_eia860',
 'raw_boiler_nox_eia860',
 'raw_utility_eia860',
 'raw_boiler_generator_assn_eia860',
 'raw_emission_control_strategies_eia860',
 'raw_boiler_stack_flue_eia860',
 'raw_generator_retired_eia860',
 'raw_ownership_eia860',
 'raw_multifuel_retired_eia860',
 'raw_fgd_equipment_eia860',
 'raw_emissions_control_equipment_eia860',
 'raw_generator_proposed_eia860',
 'raw_boiler_particulate_eia860',
 'raw_stack_flue_equipment_eia860',
 'raw_cooling_equipment_eia860',
 'raw_boiler_mercury_eia860',
 'raw_generator_eia860']

In [6]:
%%time
asset_key = "raw_generator_retired_eia860"
df = defs.load_asset_value(AssetKey(asset_key))

df.head()

Context impl SQLiteImpl.
Will assume non-transactional DDL.
Context impl SQLiteImpl.
Will assume non-transactional DDL.


2023-03-17 16:21:17 -0800 - dagster - DEBUG - system - Loading file from: /Users/bendnorman/catalyst/dagster-pudl-work/dagster_home/storage/raw_generator_retired_eia860


CPU times: user 384 ms, sys: 77 ms, total: 461 ms
Wall time: 1.1 s


Unnamed: 0,associated_combined_heat_power,balancing_authority_code_eia,bypass_heat_recovery,capacity_mw,carbon_capture,cofire_fuels,county,data_maturity,deliver_power_transgrid,duct_burners,energy_source_code_1,energy_source_code_2,energy_source_code_3,energy_source_code_4,energy_source_code_5,energy_source_code_6,energy_storage_capacity_mwh,fluidized_bed_tech,generator_id,generator_operating_month,generator_operating_year,generator_retirement_month,generator_retirement_year,latitude,longitude,map_bing,map_google,minimum_load_mw,multiple_fuels,nameplate_power_factor,net_capacity_mwdc,operating_month,operating_year,operational_status_code,other_combustion_tech,other_modifications_month,other_modifications_year,other_planned_modifications,ownership_code,planned_derate_month,planned_derate_year,planned_energy_source_code_1,planned_modifications,planned_net_summer_capacity_derate_mw,planned_net_summer_capacity_uprate_mw,planned_net_winter_capacity_derate_mw,planned_net_winter_capacity_uprate_mw,planned_new_prime_mover_code,planned_repower_month,planned_repower_year,planned_uprate_month,planned_uprate_year,plant_id_eia,plant_name_eia,prime_mover_code,pulverized_coal_tech,report_year,retirement_month,retirement_year,rto_iso_lmp_node_id,rto_iso_location_wholesale_reporting_id,sector_id_eia,sector_name,sector_name_eia,solid_fuel_gasification,startup_source_code_1,startup_source_code_2,startup_source_code_3,startup_source_code_4,state,stoker_tech,subcritical_tech,summer_capacity_mw,supercritical_tech,switch_oil_gas,syncronized_transmission_grid,technology_description,time_cold_shutdown_full_load_code,topping_bottoming_code,turbines_inverters_hydrokinetics,turbines_num,ultrasupercritical_tech,unit_id_eia,uprate_derate_completed_month,uprate_derate_completed_year,uprate_derate_during_year,utility_id_eia,utility_name_eia,winter_capacity_mw
0,N,,X,272.0,,,Mobile,final,,X,BIT,,,,,,,,3,7,1959,8,2015,,,,,130,,0.85,,,,RE,,,,,S,,,,,,,,,,,,,,3.0,Barry,ST,Y,2020.0,,,,,1.0,,Electric Utility,,NG,,,,AL,,Y,249.0,,,X,Conventional Steam Coal,OVER,X,,,,,,,N,195,Alabama Power Co,249.0
1,N,,X,788.8,,,Walker,final,,X,BIT,,,,,,,,10,10,1972,4,2019,,,,,600,N,0.85,,,,RE,,,,,S,,,,,,,,,,,,,,8.0,Gorgas,ST,Y,2020.0,,,,,1.0,,Electric Utility,N,DFO,,,,AL,,,727.7,Y,,X,Conventional Steam Coal,OVER,X,,,,,,,N,195,Alabama Power Co,727.7
2,N,,X,125.0,,,Walker,final,,X,BIT,,,,,,,,6,4,1951,8,2015,,,,,50,,0.85,,,,RE,,,,,S,,,,,,,,,,,,,,8.0,Gorgas,ST,Y,2020.0,,,,,1.0,,Electric Utility,N,DFO,,,,AL,,Y,103.0,,,X,Conventional Steam Coal,OVER,X,,,,,,,N,195,Alabama Power Co,103.0
3,N,,X,125.0,,,Walker,final,,X,BIT,,,,,,,,7,7,1952,8,2015,,,,,50,,0.85,,,,RE,,,,,S,,,,,,,,,,,,,,8.0,Gorgas,ST,Y,2020.0,,,,,1.0,,Electric Utility,N,DFO,,,,AL,,Y,104.0,,,X,Conventional Steam Coal,OVER,X,,,,,,,N,195,Alabama Power Co,104.0
4,N,,X,187.5,,,Walker,final,,X,BIT,,,,,,,,8,5,1956,4,2019,,,,,90,N,0.85,,,,RE,,,,,S,,,,,,,,,,,,,,8.0,Gorgas,ST,Y,2020.0,,,,,1.0,,Electric Utility,N,DFO,,,,AL,,Y,163.0,,,X,Conventional Steam Coal,OVER,X,,,,,,,N,195,Alabama Power Co,163.0


## Inspect the clean pre-harvested EIA-860 / EIA-860m tables

In [7]:
%%time
get_asset_group_keys("pre_harvested_eia860_assets", default_assets)

CPU times: user 2.18 ms, sys: 58 µs, total: 2.23 ms
Wall time: 2.3 ms


['clean_boiler_generator_assn_eia860',
 'clean_generators_eia860',
 'clean_utilities_eia860',
 'clean_ownership_eia860',
 'clean_plants_eia860',
 'clean_boilers_eia860']

In [8]:
%%time
asset_key = "clean_generators_eia860"
df = defs.load_asset_value(AssetKey(asset_key))

df.head()

Context impl SQLiteImpl.
Will assume non-transactional DDL.
Context impl SQLiteImpl.
Will assume non-transactional DDL.


2023-03-17 16:21:17 -0800 - dagster - DEBUG - system - Loading file from: /Users/bendnorman/catalyst/dagster-pudl-work/dagster_home/storage/clean_generators_eia860


CPU times: user 251 ms, sys: 75.6 ms, total: 327 ms
Wall time: 416 ms


Unnamed: 0,associated_combined_heat_power,balancing_authority_code_eia,bypass_heat_recovery,capacity_mw,carbon_capture,cofire_fuels,county,data_maturity,deliver_power_transgrid,distributed_generation,duct_burners,energy_source_1_transport_1,energy_source_1_transport_2,energy_source_1_transport_3,energy_source_2_transport_1,energy_source_2_transport_2,energy_source_2_transport_3,energy_source_code_1,energy_source_code_2,energy_source_code_3,energy_source_code_4,energy_source_code_5,energy_source_code_6,energy_storage_capacity_mwh,ferc_cogen_docket_no,ferc_cogen_status,ferc_exempt,ferc_exempt_wholesale_generator,ferc_exempt_wholesale_generator_docket_no,ferc_other_generator,ferc_other_generator_docker_no,ferc_qualifying_facility,ferc_qualifying_facility_docket_no,ferc_small_power_producer,ferc_small_power_producer_docket_no,fluidized_bed_tech,generator_id,latitude,longitude,map_bing,map_google,minimum_load_mw,multiple_fuels,nameplate_power_factor,net_capacity_mwdc,operating_switch,operational_status_code,other_combustion_tech,other_planned_modifications,owned_by_non_utility,ownership_code,planned_energy_source_code_1,planned_modifications,planned_net_summer_capacity_derate_mw,planned_net_summer_capacity_uprate_mw,planned_net_winter_capacity_derate_mw,planned_net_winter_capacity_uprate_mw,planned_new_capacity_mw,planned_new_prime_mover_code,plant_id_eia,plant_name_eia,previously_canceled,prime_mover_code,pulverized_coal_tech,reactive_power_output_mvar,rto_iso_lmp_node_id,rto_iso_location_wholesale_reporting_id,sector_id_eia,sector_name,sector_name_eia,solid_fuel_gasification,startup_source_code_1,startup_source_code_2,startup_source_code_3,startup_source_code_4,state,stoker_tech,subcritical_tech,summer_capacity_estimate,summer_capacity_mw,summer_estimated_capability_mw,supercritical_tech,switch_oil_gas,syncronized_transmission_grid,technology_description,time_cold_shutdown_full_load_code,topping_bottoming_code,turbines_inverters_hydrokinetics,turbines_num,ultrasupercritical_tech,unit_id_eia,uprate_derate_during_year,utility_id_eia,utility_name_eia,winter_capacity_estimate,winter_capacity_mw,winter_estimated_capability_mw,current_planned_generator_operating_date,current_planned_operating_date,generator_operating_date,generator_retirement_date,operating_date,original_planned_generator_operating_date,other_modifications_date,planned_derate_date,planned_generator_retirement_date,planned_repower_date,planned_retirement_date,planned_uprate_date,retirement_date,uprate_derate_completed_date,report_date,fuel_type_code_pudl,operational_status
0,False,,False,0.9,,,Aleutians East,final,,,False,,,,,,,DFO,,,,,,,,,,,,,,,,,,,1,,,,,0.4,False,0.8,,,SB,,,,S,,,,,,,,,1.0,Sand Point,,IC,,,,,2.0,,IPP Non-CHP,,,,,,AK,,,,0.4,,,,False,Petroleum Liquids,10M,X,,,,,False,63560.0,"TDX Sand Point Generating, LLC",,0.4,,NaT,NaT,2000-12-01,NaT,NaT,NaT,NaT,NaT,NaT,NaT,NaT,NaT,NaT,NaT,2020-01-01,oil,existing
1,False,,False,0.9,,,Aleutians East,final,,,False,,,,,,,DFO,,,,,,,,,,,,,,,,,,,2,,,,,0.3,False,0.8,,,OP,,,,S,,,,,,,,,1.0,Sand Point,,IC,,,,,2.0,,IPP Non-CHP,,,,,,AK,,,,0.3,,,,False,Petroleum Liquids,10M,X,,,,,False,63560.0,"TDX Sand Point Generating, LLC",,0.3,,NaT,NaT,2000-12-01,NaT,NaT,NaT,NaT,NaT,NaT,NaT,NaT,NaT,NaT,NaT,2020-01-01,oil,existing
2,False,,False,0.5,,,Aleutians East,final,,,False,,,,,,,DFO,,,,,,,,,,,,,,,,,,,3,,,,,0.3,False,0.8,,,OP,,,,S,,,,,,,,,1.0,Sand Point,,IC,,,,,2.0,,IPP Non-CHP,,,,,,AK,,,,0.3,,,,False,Petroleum Liquids,10M,X,,,,,False,63560.0,"TDX Sand Point Generating, LLC",,0.3,,NaT,NaT,2010-12-01,NaT,NaT,NaT,NaT,NaT,NaT,NaT,NaT,NaT,NaT,NaT,2020-01-01,oil,existing
3,False,,False,0.7,,,Aleutians East,final,,,False,,,,,,,DFO,,,,,,,,,,,,,,,,,,,5,,,,,0.3,False,0.8,,,OA,,,,S,,,,,,,,,1.0,Sand Point,,IC,,,,,2.0,,IPP Non-CHP,,,,,,AK,,,,0.4,,,,False,Petroleum Liquids,10M,X,,,,,False,63560.0,"TDX Sand Point Generating, LLC",,0.3,,NaT,NaT,2000-12-01,NaT,NaT,NaT,NaT,NaT,NaT,NaT,NaT,NaT,NaT,NaT,2020-01-01,oil,existing
4,False,,False,0.5,,,Aleutians East,final,,,False,,,,,,,WND,,,,,,,,,,,,,,,,,,,WT1,,,,,0.1,,0.89,,,OA,,,,S,,,,,,,,,1.0,Sand Point,,WT,,,,,2.0,,IPP Non-CHP,,,,,,AK,,,,0.1,,,,False,Onshore Wind Turbine,,X,1.0,,,,False,63560.0,"TDX Sand Point Generating, LLC",,0.1,,NaT,NaT,2011-10-01,NaT,NaT,NaT,NaT,NaT,NaT,NaT,NaT,NaT,NaT,NaT,2020-01-01,wind,existing


# EIA-923

## Inspect the raw EIA-923 tables

In [9]:
get_asset_group_keys("eia923_raw_assets", default_assets)

['raw_generation_fuel_eia923',
 'raw_boiler_fuel_eia923',
 'raw_generator_eia923',
 'raw_stocks_eia923',
 'raw_fuel_receipts_costs_eia923']

In [10]:
%%time
asset_key = "raw_generator_eia923"
df = defs.load_asset_value(AssetKey(asset_key))

df.head()

Context impl SQLiteImpl.
Will assume non-transactional DDL.
Context impl SQLiteImpl.
Will assume non-transactional DDL.


2023-03-17 16:21:18 -0800 - dagster - DEBUG - system - Loading file from: /Users/bendnorman/catalyst/dagster-pudl-work/dagster_home/storage/raw_generator_eia923


CPU times: user 47.4 ms, sys: 13.2 ms, total: 60.6 ms
Wall time: 118 ms


Unnamed: 0,balancing_authority_code_eia,census_region,combined_heat_power,data_maturity,early_release,generator_id,naics_code,nerc_region,net_generation_mwh_april,net_generation_mwh_august,net_generation_mwh_december,net_generation_mwh_february,net_generation_mwh_january,net_generation_mwh_july,net_generation_mwh_june,net_generation_mwh_march,net_generation_mwh_may,net_generation_mwh_november,net_generation_mwh_october,net_generation_mwh_september,net_generation_mwh_year_to_date,operator_id,operator_name,plant_id_eia,plant_name_eia,plant_state,prime_mover_code,report_year,reporting_frequency_code,sector_id_eia,sector_name_eia
0,SOCO,ESC,N,final,,1,22.0,SERC,-374,-247,-335,1837,.,-323,-293,-364,-277,-321,-249,-238,-1184.0,195.0,Alabama Power Co,3,Barry,AL,ST,2020.0,M,1.0,Electric Utility
1,SOCO,ESC,N,final,,2,22.0,SERC,-163,-635,-437,-184,.,-432,-363,-174,-192,-390,-539,-652,-4161.0,195.0,Alabama Power Co,3,Barry,AL,ST,2020.0,M,1.0,Electric Utility
2,SOCO,ESC,N,final,,5,22.0,SERC,263605,353641,151277,-4596,.,348463,218139,76013,200505,344663,347333,339452,2638495.0,195.0,Alabama Power Co,3,Barry,AL,ST,2020.0,M,1.0,Electric Utility
3,SOCO,ESC,N,final,,A1ST,22.0,SERC,116621,132943,133317,119322,.,120148,123583,134425,64790,124768,87361,125618,1282896.0,195.0,Alabama Power Co,3,Barry,AL,CA,2020.0,M,1.0,Electric Utility
4,SOCO,ESC,N,final,,A1CT,22.0,SERC,113058,128868,132744,118310,.,122745,118551,122484,9175,127101,83187,124760,1200983.0,195.0,Alabama Power Co,3,Barry,AL,CT,2020.0,M,1.0,Electric Utility


## Inspect the clean pre-harvested EIA-923 tables

In [11]:
get_asset_group_keys("pre_harvested_eia923_assets", default_assets)

['clean_boiler_fuel_eia923',
 'clean_fuel_receipts_costs_eia923',
 'clean_generation_eia923',
 'clean_generation_fuel_eia923',
 'clean_generation_fuel_nuclear_eia923',
 'clean_coalmine_eia923']

In [12]:
%%time
asset_key = "clean_generation_eia923"
df = defs.load_asset_value(AssetKey(asset_key))

df.head()

Context impl SQLiteImpl.
Will assume non-transactional DDL.
Context impl SQLiteImpl.
Will assume non-transactional DDL.


2023-03-17 16:21:18 -0800 - dagster - DEBUG - system - Loading file from: /Users/bendnorman/catalyst/dagster-pudl-work/dagster_home/storage/clean_generation_eia923


CPU times: user 55.6 ms, sys: 16.6 ms, total: 72.2 ms
Wall time: 121 ms


Unnamed: 0,balancing_authority_code_eia,data_maturity,generator_id,plant_id_eia,prime_mover_code,reporting_frequency_code,sector_id_eia,sector_name_eia,net_generation_mwh,report_date
0,SOCO,final,1,3,ST,M,1.0,Electric Utility,,2020-01-01
0,SOCO,final,1,3,ST,M,1.0,Electric Utility,1837.0,2020-02-01
0,SOCO,final,1,3,ST,M,1.0,Electric Utility,-364.0,2020-03-01
0,SOCO,final,1,3,ST,M,1.0,Electric Utility,-374.0,2020-04-01
0,SOCO,final,1,3,ST,M,1.0,Electric Utility,-277.0,2020-05-01


## Inspect the final harvested EIA tables

In [13]:
get_asset_group_keys("eia_harvested_assets", default_assets)

['generators_eia860',
 'ownership_eia860',
 'coalmine_eia923',
 'plants_eia860',
 'utilities_eia860',
 'generation_fuel_eia923',
 'generation_eia923',
 'fuel_receipts_costs_eia923',
 'boilers_eia860',
 'generators_entity_eia',
 'boiler_fuel_eia923',
 'boiler_generator_assn_eia860',
 'generation_fuel_nuclear_eia923',
 'boilers_entity_eia',
 'plants_entity_eia',
 'utilities_entity_eia']

In [14]:
%%time
asset_key = "coalmine_eia923"
df = defs.load_asset_value(AssetKey(asset_key))

df.head()

Context impl SQLiteImpl.
Will assume non-transactional DDL.
Context impl SQLiteImpl.
Will assume non-transactional DDL.
CPU times: user 880 ms, sys: 21.4 ms, total: 901 ms
Wall time: 955 ms


Unnamed: 0,mine_id_pudl,mine_name,mine_type_code,state,county_id_fips,mine_id_msha,data_maturity
0,0,town creek,S,AL,1127,103376,final
1,1,dolet hills lignite company,S,LA,22031,1601031,final
2,2,calvert city terminal llc,S,KY,21157,1518639,final
3,3,cordero mine,S,WY,56005,4800992,final
4,4,el segundo,S,NM,35031,2902257,final
