# Install PUDL
Until we get our custom Docker image built, the software needs to be installed in your user environment each session. If you are using this notebook on the Catalyst JupyterHub, uncomment the commands in the following cell and run it before anything else.

In [None]:
#!conda install --yes --quiet python-snappy
#!pip install --quiet git+https://github.com/catalyst-cooperative/pudl.git@dev
#!cp ~/shared/shared-pudl.yml ~/.pudl.yml

## How to use PUDL's output layer

The PUDL databse tables are a standard, <a href="https://en.wikipedia.org/wiki/Database_normalization">normalized</a> way to store and access electricity data. Normalized tables are great for databases and storage, but we often want the de-normalized tables with names and associated info in every record when actually using the tables. Mostly it's including the names of referenced utilities and plants, instead of just their IDs and also some frequently calculated columns (like calcuating `total_fuel_cost` with `total_heat_content_mmbtu` and`fuel_cost_per_mmbtu`). The Catalyst team developed a useful tool to access denormalized tables that we call the PUDL output object.

There are three main layers of the PUDL output object:
- denormalized tables
- compiled analysis
- preliminary access to partially itegrated PUDL datasets

Some benefits of using the outputs:
 - Not having to continually merge the same tables together over and over again for analysis.
 - Caching tables: many analyses rely on using the same table multiple times. The PUDL output object caches a tables so (within the same instance of the object) you don't have to read tabels from the database over and over again.
 - Standardize frequency of tables: Some tables are annual, some monthly, some hourly. The PUDL output object takes a requested frequency and ensures the tabels are aggregated to that frequency.
 - Standardize filling-in methodology. There is a ton of missing or incomplete data. We try to fill some of that in with the output methods.

This notebook assumes you have access to an instance of PUDL database, have an installed pudl python package and have an EIA API key stored as an environmental variable as `API_KEY_EIA`. If you don't have an EIA API key, <a href="https://www.eia.gov/opendata/register.php">register for one here</a>. If you'd rather set your environment variable just in this notebook, you can set it below with `%env API_KEY_EIA={your key}`, otherwise follow instrustions for setting an <a href="https://www.twilio.com/blog/2017/01/how-to-set-environment-variables.html">environmental variable</a> for your setup.

If you have any questions please reach out to: pudl@catalyst.coop

In [None]:
# import the necessary packages
%load_ext autoreload
%autoreload 2

import pandas as pd
import sqlalchemy as sa
import random
import pudl

# Set EIA API key. If you want to set the API key in this notebook, add your key and remove comment (#)
# %env API_KEY_EIA={your key}

In [None]:
# setup for python logging
logger=logging.getLogger()
logger.setLevel(logging.INFO)
handler = logging.StreamHandler(stream=sys.stdout)
formatter = logging.Formatter('%(message)s')
handler.setFormatter(formatter)
logger.handlers = [handler]

In [None]:
pudl_settings = pudl.workspace.setup.get_defaults()
pudl_engine = sa.create_engine(pudl_settings["pudl_db"])

### Baseline Access to pudl_out

In [None]:
# this configuration will return tables without aggregating by a time frequency... we'll explore that more below.
pudl_out = pudl.output.pudltabl.PudlTabl(pudl_engine=pudl_engine)

In [None]:
# if you want to see all of the docstrings for the public functions in pudl_out
# you can run this (it is commented out because it is quite long):
#help(pudl_out)

In [None]:
# this is the master list of all of the methods in the pudl_out object
# they all return a table cooresponding to their name
methods_pudl_out = [
    method_name for method_name in dir(pudl_out)
    if callable(getattr(pudl_out, method_name))    # if it is a method
    and '__' not in method_name                    # remove the internal methods
]
methods_pudl_out

In [None]:
# you can run any of them to get their table
gens_eia860 = pudl_out.gens_eia860()
gens_eia860.head()

### Exploring pudl_out Arguments
Below, we'll explore the main arguments that are used to customize the PUDL output object. You can mix and match these options.

In [None]:
# here are the default arguments for the pudl_out object
pudl_out = pudl.output.pudltabl.PudlTabl(
    pudl_engine=pudl_engine, # we always need a pudl_engine
    freq=None,               # Desired time grouping to aggregate PUDL tables to.
    start_date=None,         # Beginning date for data to pull from the PUDL DB.
    end_date=None,           # End date for data to pull from the PUDL DB.
    fill=False,              # Whether or not to fill in missing fuel costs with EIA monthly state-level averages
    roll=False,              # Whether or not to fill in monthly missing fuel costs with a 12-month rolling average.
)

#### Frequency Exploration

The PUDL output object accepts any <a href="https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-offset-aliases">frequency</a>, but the only tested and supported options are `AS` (annual starting at the begining of the calander year) and `MS` (monthly starting at the begining of the month).

In [None]:
pudl_out_as = pudl.output.pudltabl.PudlTabl(
    pudl_engine=pudl_engine, # we always need a pudl_engine
    freq='AS',               # Aggregate tables annually
)

In [None]:
gen_eia_923_as = pudl_out_as.gen_eia923()
gen_eia_923_as.head()

In [None]:
pudl_out_ms = pudl.output.pudltabl.PudlTabl(
    pudl_engine=pudl_engine, # we always need a pudl_engine
    freq='MS',               # Aggregate tables monthly
)

In [None]:
gen_ms = pudl_out_ms.gen_eia923()
gen_ms.head()

#### Filling in Missing Values

The `fill=True` argument below is where the EIA API key is required. If you don't have an EIA API key, <a href="https://www.eia.gov/opendata/register.php">register for one here</a> and set it as an <a href="https://www.twilio.com/blog/2017/01/how-to-set-environment-variables.html">environemt variable</a>.

In [None]:
pudl_out_fill = pudl.output.pudltabl.PudlTabl(
    pudl_engine=pudl_engine, # we always need a pudl_engine
    freq='MS',               # Aggregate tables monthly
    fill=True,               # Fill in missing fuel cost records with state-level averages from EIA's API
    roll=True,               # Fill in missing fuel cost records with a 12-month rolling average.
)

In [None]:
frc_eia923 = pudl_out_fill.frc_eia923()
frc_eia923.head()

### Denormalized PUDL Output Tables
Below, we'll extract and show a sample of each of the denormalized PUDL output tables. We'll do this with an annual frequency, but you can customize the object below for any/all of these tables.

In [None]:
pudl_out = pudl.output.pudltabl.PudlTabl(
    pudl_engine=pudl_engine, # we always need a pudl_engine
    freq='AS',               # Aggregate tables monthly
)

#### EIA Tables

In [None]:
# here are all of the EIA tables
tables_eia = [
    t for t in methods_pudl_out 
    if '_eia' in t 
    and '_eia861' not in t       # avoid the EIA 861 tables for now bc it is preliminary
]
tables_eia

In [None]:
# Pull a dataframe of EIA plant-utility associations.
pu_assn_eia = pudl_out.pu_eia860()
pu_assn_eia.sample(4)

In [None]:
# Pull a dataframe of boiler-generator associations from EIA 860.
bga_eia860 = pudl_out.bga_eia860()
bga_eia860.sample(4)

In [None]:
# Pull a dataframe of plant level info reported in EIA 860.
plants_eia860 = pudl_out.plants_eia860()
plants_eia860.sample(4)

In [None]:
# Pull a dataframe describing generators, as reported in EIA 860.
gens_eia860 = pudl_out.gens_eia860()
gens_eia860.sample(4)

In [None]:
# Pull a dataframe of generator level ownership data from EIA 860.
own_eia860 = pudl_out.own_eia860()
own_eia860.sample(4)

In [None]:
# Pull EIA 923 generation and fuel consumption data.
gf_eia923 = pudl_out.gf_eia923()
gf_eia923.sample(4)

In [None]:
# Pull EIA 923 fuel receipts and costs data.
frc_eia923 = pudl_out.frc_eia923()
frc_eia923.sample(4)

In [None]:
# Pull EIA 923 boiler fuel consumption data.
bf_eia923 = pudl_out.bf_eia923()
bf_eia923.sample(4)

In [None]:
# Pull EIA 923 net generation data by generator.
gen_eia923 = pudl_out.gen_eia923()
gen_eia923.sample(4)

#### FERC Form 1 Tables

In [None]:
# here are all of the EIA 861 tables
tables_ferc1 = [
    t for t in methods_pudl_out 
    if '_ferc1' in t 
]
tables_ferc1

In [None]:
# Pull the FERC Form 1 steam plants data.
plants_steam_ferc1 = pudl_out.plants_steam_ferc1()
plants_steam_ferc1.sample(4)

In [None]:
# Summarize FERC Form 1 fuel usage by plant.
fbp_ferc1 = pudl_out.fbp_ferc1()
fbp_ferc1.sample(4)

### Pull Analysis Tables
Catalyst typically does not add calculated fields into the PUDL database, but there are many calculated fields and analyses that are useful and necessary to ask certain questions. One focus of our analysis has been compiling MCOE data. 


In [None]:
# Calculate and return generation unit level heat rates.
hr_by_unit = pudl_out.hr_by_unit()
hr_by_unit.sample(4)

In [None]:
# Calculate and return generator level heat rates (mmBTU/MWh).
hr_by_gen = pudl_out.hr_by_gen()
hr_by_gen.sample(4)

In [None]:
# Calculate and return generator level fuel costs per MWh.
fuel_cost = pudl_out.fuel_cost()
fuel_cost.sample(4)

In [None]:
# Calculate and return generator level capacity factors.
capacity_factor = pudl_out.capacity_factor()
capacity_factor.sample(4)

In [None]:
# Calculate and return generator level MCOE based on EIA data.
mcoe = pudl_out.mcoe()
mcoe.sample(4)

### Access preliminary tables
Integrating new datasets into the PUDL database requires many steps (datastore, extract, transform, load, outputs). Sometimes we need to use tables from new datasets as soon as possible for analysis. The output layer allows us to skip loading tables into the databse. This means we can have access to these data tables in a similar format as we will once the datasets are fully integrated. Although, buyers beware; we make no guarentees that these tables will be exactaly the same when we make that transition.

As of December 2020, we have preliminarily integrated EIA 861 and FERC 714 in this format.

#### Preliminary EIA 861

In [None]:
# here are all of the EIA 861 tables
methods_eia861 = [t for t in methods_pudl_out if '_eia861' in t]
methods_eia861

In [None]:
# The first EIA 861 method runs an ET (extract/transform) process for all EIA 861 tables.
# It takes ~2 minutes... but after that they are all cached

# grab the balancing authority table
ba_eia861 = pudl_out.balancing_authority_eia861()
ba_eia861.sample(4)

In [None]:
# grab the advanced metering infrastructure table
ami_eia861 = pudl_out.advanced_metering_infrastructure_eia861()
ami_eia861.head()

#### Preliminary FERC 714 Tables

In [None]:
# here are all of the FERC 714 tables
methods_ferc714 = [t for t in methods_pudl_out if '_ferc714' in t]
methods_ferc714

In [None]:
# like EIA 861, the first table you try to grad will run the ET (extract/transform) process
# it should take ~15 minutes, but all of the 714 tables will then be cached and ready
respondent_id_ferc714 = pudl_out.respondent_id_ferc714()
respondent_id_ferc714.head()

In [None]:
demand_hourly_pa_ferc714 = pudl_out.demand_hourly_pa_ferc714()
demand_hourly_pa_ferc714.head()