# EDA - Week 4: Feature Analysis and Model Insights

### Overview
This notebook extends our exploratory data analysis with a focus on feature importance and correlation analysis for our economic forecasting models. We analyze three key economic indicators: CPI (Consumer Price Index), GDP (Gross Domestic Product), and FEDFUNDS (Federal Funds Rate).

### Objectives
- Analyze feature correlations to understand relationships between economic indicators
- Identify the most influential features for each target variable
- Generate visualizations to communicate findings effectively

### Contents
1. **Data Loading**: Load and preprocess the feature datasets
2. **Correlation Analysis**: Generate heatmaps to visualize feature correlations
3. **Feature Importance**: Train Random Forest models to determine feature significance
4. **Visualizations**: Create plots for analysis (feature importance and correlation heatmaps)
5. **Saved Plots**: Plots saved to `docs/figures/week4/`

In [95]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import os

# Set plotting style
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['axes.grid'] = True
plt.rcParams['grid.alpha'] = 0.3

# Create output directory for week4 figures
output_dir = '../docs/figures/week4'
os.makedirs(output_dir, exist_ok=True)

In [96]:
def load_target_data(target):
    """Load the feature data for a specific target."""
    file_path = f'../data/features_{target.lower()}.parquet'
    df = pd.read_parquet(file_path)
    
    # Drop datetime columns if they exist
    datetime_cols = df.select_dtypes(include=['datetime64[ns]', 'datetime']).columns
    if len(datetime_cols) > 0:
        df = df.drop(columns=datetime_cols)
    
    return df

In [97]:
def plot_correlation_heatmap(df, target_name):
    """Plot correlation heatmap in the style of eda_targets.ipynb"""
    # Calculate correlations
    corr = df.corr()
    
    # Create figure
    plt.figure(figsize=(10, 8))
    
    # Create heatmap
    sns.heatmap(
        corr,
        cmap='coolwarm',
        center=0,
        square=True,
        linewidths=0.5,
        annot=True,
        fmt='.2f',
    )

    
    # Customize plot
    plt.title(f'Correlation Heatmap - {target_name}', pad=15, fontsize=14)
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    
    # Save figure
    plt.savefig(f'{output_dir}/correlation_heatmap_{target_name}.png', dpi=300)
    plt.close()

def plot_feature_importance(X, y, target_name):
    """Train a Random Forest model and plot feature importance."""
    from sklearn.ensemble import RandomForestRegressor
    from sklearn.model_selection import train_test_split
    
    # Split the data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, shuffle=False
    )
    
    # Train the model
    print(f"Training Random Forest model for {target_name}...")
    model = RandomForestRegressor(n_estimators=100, random_state=42)
    model.fit(X_train, y_train)
    
    # Get feature importances
    importances = model.feature_importances_
    indices = np.argsort(importances)[::-1]
    
    # Plot the feature importances
    plt.figure(figsize=(12, 8))
    plt.title(f'Feature Importances - {target_name}')
    plt.bar(range(X.shape[1]), importances[indices], align='center')
    plt.xticks(range(X.shape[1]), [X.columns[i] for i in indices], rotation=90)
    plt.tight_layout()
    
    # Save the figure
    plt.savefig(f'{output_dir}/feature_importance_{target_name}.png', dpi=300)
    plt.close()
    
    return model

In [98]:
# Main execution
if __name__ == "__main__":
    targets = ['CPI', 'GDP', 'FEDFUNDS']
    
    for target in targets:
        try:
            print(f"Processing {target}...")
            
            # Load the data
            df = load_target_data(target)
            
            # Prepare features and target
            X = df.drop(columns=[target])
            y = df[target]
            
            # Generate visualizations
            print("\nGenerating correlation heatmap...")
            plot_correlation_heatmap(df, target)
            
            print("\nGenerating feature importance plot...")
            plot_feature_importance(X, y, target)
            
            print(f"\nCompleted processing for {target}")
            
        except Exception as e:
            print(f"Error processing {target}: {str(e)}")
            import traceback
            traceback.print_exc()
    
    print("\nAll visualizations have been generated and saved to:", output_dir)

Processing CPI...

Generating correlation heatmap...

Generating feature importance plot...
Training Random Forest model for CPI...

Completed processing for CPI
Processing GDP...

Generating correlation heatmap...

Generating feature importance plot...
Training Random Forest model for GDP...

Completed processing for GDP
Processing FEDFUNDS...

Generating correlation heatmap...

Generating feature importance plot...
Training Random Forest model for FEDFUNDS...

Completed processing for FEDFUNDS

All visualizations have been generated and saved to: ../docs/figures/week4
