# CS2316 Final Project — Phase II  
### Data cleaning, inflation adjustment, and merge prep

**Team Members**  
- Sushanth Naga Sai Chunduri  
- Jackson O'Connell

**What this notebook is for**  
This is our Phase II notebook. The goal is to collect the required datasets, clean them, get them on the same structure, fix inconsistencies, and export cleaned versions for Phase III.  
We are doing everything inside this one notebook since the TA grades only what is shown here.

**Work split between us (based on our messages)**  
- **Jackson**: EIA API code, CPI inflation adjustment, and setting up the merge structure  
- **Sushanth**: Airfare CSV cleaning and BTS baggage-fee data from web (HTML/excel scrape or alternate per TA feedback)

**Expected exported files from this notebook**  
Jackson work:
- `cpi_clean.csv`
- `eia_clean.csv`

Teammate work (coming later):
- `airfare_clean.csv` (**TBD**)
- `baggage_clean.csv` (**TBD**)
- if TA makes us change baggage source we will update here (**TBD**)

**What happens after cleaning**  
Everything will be converted to quarterly format with matching keys so we can combine them in Phase III and start the actual analysis.  
At the bottom of this notebook we will also list the 3–5 inconsistencies we fixed since that is required for Phase II.

Once both parts are dropped in here and run top to bottom the notebook will be ready to submit for the Phase II deadline.

## Part 0: Prep Work
For Imports, Keys, Helper Functions, other basic stuff

Step 1: Imports

In [None]:
import json
import requests
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup as bs4

#NOTE to self: Getting lot of warnings about NumPy deprecations due to other things on my laptop requiring different versions.
# If this happens again, make a new kernel specific to this folder.

Step 2: Set Keys (for CPI, EIA, BTS, etc.)

In [None]:
CPI_URL = "https://api.stlouisfed.org/fred/series/observations?series_id=CPIAUCSL&api_key=8089a399255548ca15cfbc5a36571116&file_type=json"
EIA_URL = "https://api.eia.gov/v2/petroleum/pri/spt/data/?frequency=monthly&data[0]=value&facets[series][]=PET.RWTC&api_key=gc3m0emd44qaMHTNQxwmLqupbzdCtHHB2G6aOCNl"
# BTS = #not sure we need this, TBD

Step 3: Helper functions (date formatting, file IO, etc.)

In [8]:
# formatting dates, need to use universally
# date format: YYYYQn (e.g. 2023Q1, 2025Q2)
def to_quarter_label(ts):
    q = ((ts.month - 1) // 3) + 1
    return f"{ts.year}Q{q}"

# save dataframe to csv with an explicit mention of the path
def save_csv(df, path):
    df.to_csv(path, index=False)
    print(f"rewrote {path}")


# Add more here when needed. Potentially for another type of data normalization or math for inflation adjustment.


## CPI Functionality
For handling CPI intake and the following:
- Get the JSON data from the CSV
- Convert the monthly to a quarterly mean
- Maintain table setup for joining other tables later

Notes to save time / error logging:
- first try I used response.json(); switching to json.loads() to match our plan
- CPI values come as strings and sometimes "."; coerce then drop
- base CPI = latest 2025 quarter if present, else use the latest available

In [14]:
def get_cpi_quarterly(cpi_url, base_year=2025, export_path="cpi_clean.csv"):
    r = requests.get(cpi_url)
    raw = json.loads(r.text)

    obs = raw.get("observations", [])
    cpi = pd.DataFrame(obs)[["date", "value"]].copy()

    cpi["date"] = pd.to_datetime(cpi["date"])
    cpi["value"] = pd.to_numeric(cpi["value"], errors="coerce")
    cpi = cpi.dropna(subset=["value"])

    # monthly to quartlery normalization
    # if we need to do this later, we might want to figure out how to make it a call-able helper function instead
    cpi["qstart"] = cpi["date"].dt.to_period("Q").dt.start_time
    cpi_q = (
        cpi.groupby("qstart", as_index=False)["value"]
           .mean()
           .rename(columns={"qstart": "quarter_start_date", "value": "cpi_index"})
    )
    cpi_q["quarter"] = cpi_q["quarter_start_date"].apply(to_quarter_label)

    # choose base year
    has_base_year = cpi_q["quarter"].str.startswith(str(base_year)).any()
    if has_base_year:
        base_row = cpi_q[cpi_q["quarter"].str.startswith(str(base_year))].iloc[-1]
    else:
        base_row = cpi_q.iloc[-1]

    cpi_clean = cpi_q[["quarter", "cpi_index"]].copy()
    cpi_clean["cpi_base_year"] = base_year
    cpi_clean["cpi_base_value"] = float(base_row["cpi_index"])
    cpi_clean["cpi_base_quarter_used"] = base_row["quarter"]

    save_csv(cpi_clean, export_path)
    return cpi_clean

## EIA WTI Function
For handling EIA data and doing the following:
- Get JSON data (monthly)
- Convert monthly to quarterly mean
- Merge CPI from outside (per our split)
- Compute real 2025 dollars: real = nominal * (CPI_base / CPI_q)

Notes / error logging:
- first attempt assumed top-level "data"; some responses nest under response.data so handle both in a very basic way
- if CPI is missing a quarter, real will be NaN; fine until teammate data lands


In [15]:
def get_eia_wti_quarterly(eia_url, cpi_df, export_path="eia_clean.csv", base_year=2025):
    r = requests.get(eia_url)
    parsed = json.loads(r.text)

    # try both shapes
    if isinstance(parsed, dict) and "response" in parsed and "data" in parsed["response"]:
        rows = parsed["response"]["data"]
    elif isinstance(parsed, dict) and "data" in parsed:
        rows = parsed["data"]
    else:
        rows = []

    eia = pd.DataFrame(rows)

    # normalize expected columns
    if "period" not in eia.columns and "date" in eia.columns:
        eia["period"] = eia["date"]

    eia["period"] = pd.to_datetime(eia["period"], errors="coerce")
    eia["value"] = pd.to_numeric(eia.get("value", np.nan), errors="coerce")
    eia = eia.dropna(subset=["period", "value"])

    # monthly -> quarterly
    eia["qstart"] = eia["period"].dt.to_period("Q").dt.start_time
    wti_q = (
        eia.groupby("qstart", as_index=False)["value"]
           .mean()
           .rename(columns={"qstart": "quarter_start_date", "value": "wti_usd_nominal"})
    )
    wti_q["quarter"] = wti_q["quarter_start_date"].apply(to_quarter_label)

    # bring in CPI
    cpi_small = cpi_df[["quarter", "cpi_index", "cpi_base_value"]].copy()
    merged = wti_q.merge(cpi_small, on="quarter", how="left")

    # inflation adjust to constant 2025 dollars
    merged["wti_usd_real_2025"] = merged["wti_usd_nominal"] * (merged["cpi_base_value"] / merged["cpi_index"])

    eia_clean = merged[["quarter", "wti_usd_nominal", "wti_usd_real_2025"]].sort_values("quarter").reset_index(drop=True)

    save_csv(eia_clean, export_path)
    return eia_clean


## Merging Tables and Normalizing Data
This part is for alinging and normalizing all the data by doing the following:
- show the plan to join once teammate drops in their CSVs
- NOTE TO COME BACK TO LATER: for now just try to read and print status if not there


Notes / error logging:
- first run failed when files were missing; now it's optional

In [16]:
def merge_quarterly_skeleton(eia_df, airfare_path="airfare_clean.csv", baggage_path="baggage_clean.csv"):
    out = eia_df.copy()
    print("base:", out.shape)

    try:
        airfare = pd.read_csv(airfare_path)
        out = out.merge(airfare, on="quarter", how="left")
        print("merged airfare:", out.shape)
    except Exception:
        print(f"airfare not ready yet -> expecting {airfare_path}")

    try:
        baggage = pd.read_csv(baggage_path)
        out = out.merge(baggage, on="quarter", how="left")
        print("merged baggage:", out.shape)
    except Exception:
        print(f"baggage not ready yet -> expecting {baggage_path}")

    return out

## Printing / Result Checking
- COME BACK TO: need to clean up all formatting and such to make it look nice

In [31]:
#Need to reset URL vars in this part:
CPI_URL = "https://api.stlouisfed.org/fred/series/observations?series_id=CPIAUCSL&api_key=8089a399255548ca15cfbc5a36571116&file_type=json"

# EIA WTI URL you provided (monthly spot price)
EIA_URL = "https://api.eia.gov/v2/petroleum/pri/spt/data/?frequency=monthly&data[0]=value&facets[series][]=PET.RWTC&api_key=gc3m0emd44qaMHTNQxwmLqupbzdCtHHB2G6aOCNl"


cpi_df = get_cpi_quarterly(CPI_URL, base_year=2025, export_path="cpi_clean.csv")
print(cpi_df.head())

# 2) EIA quarterly table with real-dollar column
# having issues with the period col
eia_df = get_eia_wti_quarterly(EIA_URL, cpi_df, export_path="eia_clean.csv", base_year=2025)
print(eia_df.head())

# 3) Merge scaffold (will print notes if teammate files are not there yet)
combined_preview = merge_quarterly_skeleton(eia_df)
print(combined_preview.head())


rewrote cpi_clean.csv
  quarter  cpi_index  cpi_base_year  cpi_base_value cpi_base_quarter_used
0  1947Q1  21.700000           2025         323.288                2025Q3
1  1947Q2  22.010000           2025         323.288                2025Q3
2  1947Q3  22.490000           2025         323.288                2025Q3
3  1947Q4  23.126667           2025         323.288                2025Q3
4  1948Q1  23.616667           2025         323.288                2025Q3
rewrote eia_clean.csv
Empty DataFrame
Columns: [quarter, wti_usd_nominal, wti_usd_real_2025]
Index: []
base: (0, 3)
airfare not ready yet -> expecting airfare_clean.csv
baggage not ready yet -> expecting baggage_clean.csv
Empty DataFrame
Columns: [quarter, wti_usd_nominal, wti_usd_real_2025]
Index: []


In [None]:
# After like ten attempts, I give up on the weird URL format and just do it with params instead.
def get_eia_wti_quarterly_from_params(api_key, cpi_df, export_path="eia_clean.csv"):
    base = "https://api.eia.gov/v2/petroleum/pri/spt/data/"
    params = {
        "api_key": api_key,
        "frequency": "monthly",
        "data[0]": "value",
        # pull all series, filter in pandas after
        "length": 5000,  # default page size is small; this avoids truncation
        # optional: start at a reasonable date to reduce payload
        "start": "2000-01-01"
    }
    r = requests.get(base, params=params)
    parsed = json.loads(r.text)

    rows = []
    if isinstance(parsed, dict):
        if "response" in parsed and isinstance(parsed["response"], dict) and "data" in parsed["response"]:
            rows = parsed["response"]["data"]
        elif "data" in parsed:
            rows = parsed["data"]

    eia = pd.DataFrame(rows)

    # quick prints for sanity during Phase II
    print("EIA total rows:", len(eia))
    if not eia.empty:
        print("EIA columns:", list(eia.columns)[:10])

    if eia.empty:
        print("EIA returned no rows. Double-check API key and endpoint. Will export empty table.")
        empty = pd.DataFrame(columns=["quarter","wti_usd_nominal","wti_usd_real_2025"])
        save_csv(empty, export_path)
        return empty

    # keep only the WTI spot price series
    if "series" in eia.columns:
        eia = eia[eia["series"].astype(str).str.upper().eq("RWTC")]


    # normalize date column
    if "period" not in eia.columns:
        if "date" in eia.columns:
            eia["period"] = eia["date"]
        else:
            print("No period/date column found in EIA data. Exporting empty.")
            empty = pd.DataFrame(columns=["quarter","wti_usd_nominal","wti_usd_real_2025"])
            save_csv(empty, export_path)
            return empty

    eia["period"] = pd.to_datetime(eia["period"], errors="coerce")
    eia["value"] = pd.to_numeric(eia.get("value", np.nan), errors="coerce")
    eia = eia.dropna(subset=["period", "value"])

    # monthly -> quarterly mean
    eia["qstart"] = eia["period"].dt.to_period("Q").dt.start_time
    wti_q = (
        eia.groupby("qstart", as_index=False)["value"]
           .mean()
           .rename(columns={"qstart": "quarter_start_date", "value": "wti_usd_nominal"})
    )
    wti_q["quarter"] = wti_q["quarter_start_date"].apply(to_quarter_label)

    # inflation adjust with CPI already built
    cpi_small = cpi_df[["quarter", "cpi_index", "cpi_base_value"]].copy()
    merged = wti_q.merge(cpi_small, on="quarter", how="left")
    merged["wti_usd_real_2025"] = merged["wti_usd_nominal"] * (merged["cpi_base_value"] / merged["cpi_index"])

    eia_clean = merged[["quarter", "wti_usd_nominal", "wti_usd_real_2025"]].sort_values("quarter").reset_index(drop=True)
    save_csv(eia_clean, export_path)
    return eia_clean


In [30]:
# use this instead of passing a prebuilt URL
EIA_API_KEY = "gc3m0emd44qaMHTNQxwmLqupbzdCtHHB2G6aOCNl"

cpi_df = get_cpi_quarterly(CPI_URL, base_year=2025, export_path="cpi_clean.csv")
eia_df = get_eia_wti_quarterly_from_params(EIA_API_KEY, cpi_df, export_path="eia_clean.csv")
print(eia_df.head())


rewrote cpi_clean.csv
EIA total rows: 3204
EIA columns: ['period', 'duoarea', 'area-name', 'product', 'product-name', 'process', 'process-name', 'series', 'series-description', 'value']
rewrote eia_clean.csv
  quarter  wti_usd_nominal  wti_usd_real_2025
0  2000Q1        28.823333          54.780939
1  2000Q2        28.776667          54.266873
2  2000Q3        31.613333          59.076366
3  2000Q4        31.990000          59.357087
4  2001Q1        28.816667          52.962379


### TESTING

In [28]:
r = requests.get("https://api.eia.gov/v2/petroleum/pri/spt/data/", params={
    "api_key": EIA_API_KEY,
    "frequency": "monthly",
    "data[0]": "value",
    "length": 5000
})
parsed = json.loads(r.text)
rows = parsed["response"]["data"]
tmp = pd.DataFrame(rows)
print(tmp["series"].unique()[:30])


['EER_EPMRU_PF4_RGC_DPG' 'RBRTE' 'EER_EPD2DC_PF4_Y05LA_DPG'
 'EER_EPJK_PF4_RGC_DPG' 'EER_EPMRR_PF4_Y05LA_DPG'
 'EER_EPD2DXL0_PF4_Y35NY_DPG' 'EER_EPLLPA_PF4_Y44MB_DPG'
 'EER_EPMRU_PF4_Y35NY_DPG' 'EER_EPD2F_PF4_Y35NY_DPG'
 'EER_EPD2DXL0_PF4_RGC_DPG' 'RWTC']


In [32]:
# testing raw data fetch

EIA_URL = "https://api.eia.gov/v2/petroleum/pri/spt/data/?frequency=monthly&data[0]=value&facets[series][]=PET.RWTC&api_key=gc3m0emd44qaMHTNQxwmLqupbzdCtHHB2G6aOCNl"

_raw = json.loads(requests.get(EIA_URL).text)
if "response" in _raw and isinstance(_raw["response"], dict) and "data" in _raw["response"]:
    sample = _raw["response"]["data"][:3]
elif "data" in _raw:
    sample = _raw["data"][:3]
else:
    sample = _raw
print("sample rows/keys from EIA:", sample[:1])

sample rows/keys from EIA: []
