# 05 U.S. Recession Indicator â€” Cleaning & Export

Goal: Clean US recession indicator, create `date_key`, export `usrec_cleaned.csv` and optionally `dim_usrec.csv` for easy joins in Power BI.

Guardrails: work on copies, no inplace, explicit conversions, validate early, conservative cleaning, document decisions.


What: Import libraries and paths.
Why: Standardize environment and outputs location.


In [1]:
import os
import pandas as pd

pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 160)

ROOT = os.path.abspath(os.path.join(os.getcwd(), "..")) if os.path.basename(os.getcwd()) == "notebooks" else os.getcwd()
RAW = os.path.join(ROOT, "original_data")
CLEAN = os.path.join(ROOT, "data", "cleaned")
os.makedirs(CLEAN, exist_ok=True)


What: Load CSV into a working copy and inspect.
Why: Validate structure and dtypes before transformations.


In [2]:
# Load
raw_path = os.path.join(RAW, "USREC.csv")
df_raw = pd.read_csv(raw_path)
df = df_raw.copy()
print('Shape:', df.shape)
print(df.head())
print(df.info())


Shape: (2051, 2)
  observation_date  USREC
0       1854-12-01      1
1       1855-01-01      0
2       1855-02-01      0
3       1855-03-01      0
4       1855-04-01      0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2051 entries, 0 to 2050
Data columns (total 2 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   observation_date  2051 non-null   object
 1   USREC             2051 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 32.2+ KB
None


What: Detect date and recession columns; coerce; create date_key.
Why: Build reliable joins and a clean binary flag for modeling.


In [3]:
# Detect date and recession columns (robust)
date_col = next((c for c in df.columns if 'date' in c.strip().lower()), None)
rec_col = next((c for c in df.columns if 'rec' in c.strip().lower()), None)
assert date_col and rec_col, 'Could not detect date or recession indicator columns; set manually.'

# Parse date, set flag as int
s_date = pd.to_datetime(df[date_col], errors='coerce')
flag = pd.to_numeric(df[rec_col], errors='coerce').fillna(0).astype(int)

fact = df.copy()
fact[date_col] = s_date
fact[rec_col] = flag

# Create date_key
fact['date_key'] = fact[date_col].dt.strftime('%Y%m%d').astype(int)

# Tidy and export
fact.columns = [str(c).strip().lower().replace(' ', '_') for c in fact.columns]
out_path = os.path.join(CLEAN, 'usrec_cleaned.csv')
fact.to_csv(out_path, index=False, encoding='utf-8')
print('Wrote:', out_path)

# Optional dim_usrec
Dim = fact[['date_key', rec_col.lower()]].drop_duplicates().rename(columns={rec_col.lower(): 'recession_flag'})
dim_path = os.path.join(CLEAN, 'dim_usrec.csv')
Dim.to_csv(dim_path, index=False, encoding='utf-8')
print('Wrote:', dim_path)


Wrote: G:\ACADEMIA\VA 5122\Final Project\phase1_cleaning_preprocessing\data\cleaned\usrec_cleaned.csv
Wrote: G:\ACADEMIA\VA 5122\Final Project\phase1_cleaning_preprocessing\data\cleaned\dim_usrec.csv


What: Export fact and optional `dim_usrec` CSVs.
Why: Provide simple relationship for recession overlays in Power BI.


In [4]:
# Validation
print('Shape:', fact.shape)
print('Min/Max date_key:', fact['date_key'].min(), fact['date_key'].max())
print('Recession flag values:', fact['recession'].unique() if 'recession' in fact.columns else 'check name')
print('Nulls (top 10):')
print(fact.isna().sum().sort_values(ascending=False).head(10))


Shape: (2051, 3)
Min/Max date_key: 18541201 20251001
Recession flag values: check name
Nulls (top 10):
observation_date    0
usrec               0
date_key            0
dtype: int64


What: Validate shape, date_key ranges, recession values, and nulls.
Why: Ensure correctness and smooth visuals integration.


## Decisions & Notes

- Parsed `DATE` to datetime; generated `date_key`.
- Recession indicator coerced to integer 0/1; document any corrections.
- Exported optional `dim_usrec.csv` for simple relationship into `dim_date` (or as a standalone dim).
