# Employment Prediction Model

This notebook builds a linear regression model to predict next week's employment levels based on spending patterns.

## Overview
- **Objective**: Predict employment (`emp_next_week`) from spending features
- **Approach**: Linear regression with lagged spending features
- **Data**: Affinity (daily spending) and Employment (weekly) datasets
- **Train/Test Split**: 2021 and earlier for training, 2022+ for testing

## 1. Import Libraries

In [None]:
"""
Import all necessary libraries for data processing, modeling, and utilities.
"""

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

import joblib
import os
import logging
from datetime import datetime
import json
import yaml
from pathlib import Path
import time

## 2. Load and Prepare Data

### Load Raw Data
Load Affinity (daily spending) and Employment (weekly) datasets from CSV files.

In [None]:
# Load raw data from CSV files
affinity_df = pd.read_csv('data/Affinity - State - Daily.csv')
employment_df = pd.read_csv('data/Employment - State - Weekly.csv')

print(f"Affinity data shape: {affinity_df.shape}")
print(f"Employment data shape: {employment_df.shape}")

### Affinity Data Cleaning

Convert daily spending data to weekly aggregates by:
1. Creating proper datetime column from year/month/day
2. Grouping by week ending on Friday
3. Computing mean spending per state per week

In [None]:
# Convert year, month, day columns to datetime
affinity_df['date'] = pd.to_datetime(affinity_df[['year', 'month', 'day']])

# Convert daily data to weekly by grouping by week ending on Friday
# and calculating mean spending per state
affinity_df['week_end'] = (
    affinity_df['date']
    .dt.to_period('W-FRI')
    .dt.end_time
    .dt.normalize()
)
aff_weekly = (
    affinity_df
    .groupby(['statefips', 'week_end'])
    .mean(numeric_only=True)
    .reset_index()
)

print(f"Weekly affinity data shape: {aff_weekly.shape}")

### Employment Data Cleaning

Convert employment data types and create datetime column to align with affinity data.

In [None]:
# Convert year, month, day_endofweek to integer type
for col in ['year', 'month', 'day_endofweek']:
    employment_df[col] = employment_df[col].astype('int64')

# Create datetime column from year, month, day components
employment_df['date'] = pd.to_datetime({
    'year': employment_df['year'],
    'month': employment_df['month'],
    'day': employment_df['day_endofweek']
})

print(f"Employment data date range: {employment_df['date'].min()} to {employment_df['date'].max()}")

### Merge Datasets

Align the employment and affinity data on state and date for feature engineering.

In [None]:
# Merge employment and affinity data on state and date
merged = employment_df.merge(
    aff_weekly, 
    left_on=['statefips', 'date'], 
    right_on=['statefips', 'week_end'], 
    how='left'
)

# Drop the redundant week_end column
merged = merged.drop(columns=['week_end'])

print(f"Merged data shape: {merged.shape}")
print(f"Date range: {merged['date'].min()} to {merged['date'].max()}")

## 3. Feature Engineering

### Process Spending Data

Convert spending to numeric format and create lagged features for the model.

In [None]:
# Convert spend_all to numeric format (handles non-numeric values like '.')
merged['spend_all'] = pd.to_numeric(merged['spend_all'].replace('.', pd.NA), errors='coerce')

# Check for missing values
missing_count = merged['spend_all'].isna().sum()
print(f"Missing values in spend_all: {missing_count}")

# Create lagged features (previous week and 3 weeks ago spending)
merged['spend_all_lag_1'] = merged.groupby('statefips')['spend_all'].shift(1)
merged['spend_all_lag_3'] = merged.groupby('statefips')['spend_all'].shift(3)

# Create target variable (employment for next week)
merged['emp_next_week'] = merged.groupby('statefips')['emp'].apply(
    lambda x: pd.to_numeric(x, errors='coerce')
).shift(-1)

### Data Validation and Cleaning

Remove rows with missing values in features or target to ensure model trains on complete data only.

In [None]:
# Define feature columns for clarity
feature_cols = ['spend_all', 'spend_all_lag_1', 'spend_all_lag_3']

# Remove rows where ANY feature or target is NaN
# This ensures we only train on complete data
rows_before = len(merged)
valid_rows = merged[feature_cols + ['emp_next_week']].notna().all(axis=1)
merged = merged[valid_rows]

rows_removed = rows_before - len(merged)
print(f"Rows before filtering: {rows_before}")
print(f"Rows after filtering: {len(merged)}")
print(f"Rows removed due to missing values: {rows_removed} ({rows_removed/rows_before*100:.1f}%)")

## 4. Train/Test Split

Split data into training (â‰¤2021) and test (>2021) periods to simulate production performance.

In [None]:
"""
Split data based on temporal cutoff: 2021 for training, 2022+ for testing.
This simulates real-world scenarios where we predict future data.
"""

print(f"\nDate range in data: {merged['date'].min()} to {merged['date'].max()}")

# Split by year cutoff
train = merged[merged['date'] <= '2021-12-31']
test = merged[merged['date'] > '2021-12-31']

print(f"Train period: {train['date'].min()} to {train['date'].max()}")
print(f"Train samples: {len(train)}")
print(f"Test period: {test['date'].min() if len(test) > 0 else 'N/A'} to {test['date'].max() if len(test) > 0 else 'N/A'}")
print(f"Test samples: {len(test)}")

## 5. Prepare Features for Modeling

Extract feature and target matrices, with safety checks to ensure no missing values.

In [None]:
# Extract feature and target matrices
X_train = train[feature_cols]
y_train = train['emp_next_week']

X_test = test[feature_cols]
y_test = test['emp_next_week']

# Safety check: remove any remaining NaN rows (should be none after earlier filtering)
train_mask = X_train.notna().all(axis=1) & y_train.notna()
test_mask = X_test.notna().all(axis=1) & y_test.notna()

X_train = X_train[train_mask]
y_train = y_train[train_mask]
X_test = X_test[test_mask]
y_test = y_test[test_mask]

print(f"Training samples: {len(X_train)}, Test samples: {len(X_test)}")
print(f"Target stats - Mean: {y_train.mean():.6f}, Std: {y_train.std():.6f}, Range: [{y_train.min():.6f}, {y_train.max():.6f}]")

## 6. Train Linear Regression Model

Fit the model on training data and evaluate performance on test set.

In [None]:
# Train the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on test set
y_pred = model.predict(X_test)

# Calculate mean absolute error
mae = mean_absolute_error(y_test, y_pred)

print(f'\nModel Performance:')
print(f'Mean Absolute Error (MAE): {mae:.6f}')
if y_test.mean() != 0:
    print(f'MAE as % of mean target: {mae / abs(y_test.mean()):.2%}')
else:
    print('MAE as % of mean target: N/A (mean is 0)')
print(f'MAE as % of std target: {mae / y_test.std():.2%}')

# Display model coefficients
print(f'\nModel Coefficients:')
for col, coef in zip(feature_cols, model.coef_):
    print(f'  {col}: {coef:.6f}')
print(f'Intercept: {model.intercept_:.6f}')

# Calculate baseline performance for comparison
baseline_pred = np.full_like(y_test, y_train.mean())
baseline_mae = mean_absolute_error(y_test, baseline_pred)
improvement = ((baseline_mae - mae) / baseline_mae * 100) if baseline_mae > 0 else 0

print(f'\nBaseline Comparison:')
print(f'Baseline (mean) MAE: {baseline_mae:.6f}')
print(f'Model improvement over baseline: {improvement:.1f}%')