# Exploratory Data Analysis: US Covariates

This notebook provides a detailed exploratory data analysis (EDA) for the United States (US) macroeconomic time series used as covariates in the forecasting models.

### Key Objectives:
1.  **Data Loading**: Load the final (raw levels), transformed (differenced), and winsorized datasets.
2.  **Availability Analysis**: Systematically check for missing values across all variables and visualize data availability over the entire time span.
3.  **Grouped Time Series Visualization**: Group related variables and plot their time series to visually inspect trends, seasonality, and structural breaks.
4.  **Transformation Impact**: Create side-by-side comparisons of the transformed vs. winsorized data to understand the effect of the data preparation steps.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sys
from pathlib import Path

# --- Setup Paths ---
# The notebook is in 'notebooks/', so we go up one level to the project root
ROOT_DIR = Path.cwd().parent
sys.path.append(str(ROOT_DIR))

from helpers import split_group, plot_time_series, plot_group_side_by_side

# --- Load all three datasets ---
DATA_DIR = ROOT_DIR / 'data' / 'processed'

df_final = pd.read_csv(DATA_DIR / 'us_data_final.csv')
df_transformed = pd.read_csv(DATA_DIR / 'us_data_transformed.csv')
df_winsorized = pd.read_csv(DATA_DIR / 'us_data_transformed_win.csv')

# Convert date columns to datetime objects
df_final['date'] = pd.to_datetime(df_final['date'])
df_transformed['date'] = pd.to_datetime(df_transformed['date'])
df_winsorized['date'] = pd.to_datetime(df_winsorized['date'])

print("Dataframes loaded successfully.")
print("Final (Levels) Shape:", df_final.shape)
print("Transformed Shape:", df_transformed.shape)
print("Winsorized Shape:", df_winsorized.shape)

### Data Availability Analysis

First, we'll analyze the raw data (`us_data_final.csv`) to determine the time range of our target variable (`cpi_all_yoy`) and check for missing values across all potential covariates. This helps us understand the effective sample size available for modeling.

In [None]:
# --- Analyze Time Spans and Splits ---
cpi_all_yoy = df_final[['date', 'cpi_all_yoy']].dropna()
length = len(cpi_all_yoy)
splits = [int(length * p) for p in [0.2, 0.4, 0.6, 0.8]]
split_dates = [cpi_all_yoy.iloc[split]['date'] for split in splits]

start_date = cpi_all_yoy['date'].iloc[0]
end_date = cpi_all_yoy['date'].iloc[-1]

print(f"Target Variable ('cpi_all_yoy') available from: {start_date.strftime('%Y-%m')} to {end_date.strftime('%Y-%m')}")
print(f"Evaluation splits occur at: {[d.strftime('%Y-%m') for d in split_dates]}")

# --- Visualize Data Availability ---
availability_table = df_final.set_index('date').notnull().astype(int).T
date_range = availability_table.columns

plt.figure(figsize=(18, 12))
plt.imshow(availability_table, aspect='auto', cmap='Greens', interpolation='none')

# Format axes
tick_positions = range(0, len(date_range), 60) # Ticks every 5 years
tick_labels = [date.strftime('%Y') for date in date_range[tick_positions]]
plt.xticks(tick_positions, tick_labels, rotation=45)
plt.yticks(range(len(availability_table.index)), availability_table.index)
plt.title("Variable Availability Over Time (United States)", fontsize=16)
plt.xlabel("Year")
plt.ylabel("Variables")

# Add vertical lines for the split dates
for split_date in split_dates:
    try:
        split_pos = date_range.get_loc(split_date)
        plt.axvline(x=split_pos, color='red', linestyle='--', linewidth=1)
    except KeyError:
        continue # Ignore if split date is not in index

plt.tight_layout()
plt.show()

### Grouped Time Series Plots

To make the data easier to inspect, we group the variables into logical categories and plot them. This helps in visually identifying trends, seasonality, and potential co-movements.

In [None]:
# --- Define Variable Groups Once ---
VARIABLE_GROUPS = {
    "Monetary Stocks": [col for col in df_final.columns if col.lower() in ["m1sl", "m2sl", "m2real"]],
    "Credit": [col for col in df_final.columns if col.lower() in ["busloans", "realln", "conspi", "invest", "nonrevsl", "dtcolnvhfnm", "dtcthfnm"]],
    "Interest Rates": [col for col in df_final.columns if col.lower() in ["fedfunds", "tb3ms", "tb6ms", "gs1", "gs5", "gs10"]],
    "Price Indices": [col for col in df_final.columns if col.lower() in ["ppicmm", "oilpricex", "cpiappsl", "cpitrnsl", "cpimedsl", "cusr0000sac", "cusr0000sad", "cusr0000sas", "pcepi"] and col != 'cpi_all_yoy']
}

# --- Plot Time Series for Each Group from the Raw Data ---
for group_name, columns in VARIABLE_GROUPS.items():
    # Split group if it's too large to plot clearly
    if len(columns) > 8:
        group1, group2 = split_group(columns)
        plot_time_series(group1, f"{group_name} - Part 1", df_final)
        plot_time_series(group2, f"{group_name} - Part 2", df_final)
    else:
        plot_time_series(columns, group_name, df_final)

### Transformation and Winsorization Comparison

The models are trained on transformed (differenced) and winsorized data. Here, we create side-by-side plots to visually inspect the impact of these transformations on the raw series. This is a crucial step to confirm that the transformations have stabilized the series and handled extreme outliers as expected.

In [None]:
# --- Create Side-by-Side Plots for Transformed vs. Winsorized Data ---
# Add "_t" to the column names for the transformed dataframes
for group_name, columns in VARIABLE_GROUPS.items():
    transformed_cols = [f"{col}_t" for col in columns]
    # Split group if it's too large
    if len(transformed_cols) > 8:
        group1, group2 = split_group(transformed_cols)
        plot_group_side_by_side(group1, f"{group_name} - Part 1", df_transformed, df_winsorized)
        plot_group_side_by_side(group2, f"{group_name} - Part 2", df_transformed, df_winsorized)
else:
    plot_group_side_by_side(transformed_cols, group_name, df_transformed, df_winsorized)