# 1. Modeling Strategy & Implementation

## 1.1 Model Selection Strategy

### Target (y)
- **unemployment_rate** (county-level continuous variable)

### Features (X)
- `foreign_born_pct`
- `sanctuary_status` (1 = sanctuary, 0 = non-sanctuary)
- (optional) poverty_rate, median_income, etc.

### Why Regression?
The target variable is numeric (unemployment_rate), so a **regression** approach is required.

### Candidate Models
1. **Linear Regression**  
   - Baseline model  
   - Highly interpretable  
   - Good for understanding overall relationships  

2. **Random Forest Regressor**  
   - Captures interactions & non-linear patterns  
   - Handles mixed feature types  
   - Usually higher performance than linear models  

3. **Gradient Boosting Regressor (Optional)**  
   - Stronger ensemble model  
   - Often leads to best performance  

### Tradeoffs
- Linear Regression → Simple, fast, interpretable  
- Random Forest → Higher accuracy, less interpretable  
- Gradient Boosting → Best accuracy, highest complexity  

### Why these models?
These models are widely used in policy and economics modeling and are appropriate for mixed demographic + policy datasets.  




## 1.2 Hyperparameter & Design Decisions

### Linear Regression
- Using default sklearn settings
- No hyperparameters to tune

### Random Forest Regressor
Key hyperparameters:
- `n_estimators = 200`
- `max_depth = None` (let forest grow fully)
- `min_samples_split = 2`

Reasoning:
- Default parameters work for initial model testing  
- Forest size increased slightly for stability  

### Gradient Boosting (optional)
- `n_estimators = 300`
- `learning_rate = 0.05`
- `max_depth = 3`

Reasoning:
- Lower learning rate → better generalization  
- Shallow trees reduce overfitting  


## 1.3 Data Splitting Strategy

### Train/Test Split
- `train_test_split(test_size=0.2, random_state=42)`
- 80% training, 20% testing

### Why?
- Standard ratio for medium-size datasets  
- Ensures enough data for training while keeping a representative test set  

### Cross-Validation
- Using **5-fold cross-validation** for Random Forest and Gradient Boosting  
- Provides more stable performance estimates  
- Helps avoid overfitting  


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Load data
df = pd.read_csv("../data/BLS_clean.csv")

# Select features
X = df[["foreign_born_pct", "sanctuary_status"]]
y = df["unemployment_rate"]

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Baseline model
linreg = LinearRegression()
linreg.fit(X_train, y_train)
pred_lr = linreg.predict(X_test)

# Random Forest model
rf = RandomForestRegressor(n_estimators=200, random_state=42)
rf.fit(X_train, y_train)
pred_rf = rf.predict(X_test)

# Optional Gradient Boosting
gb = GradientBoostingRegressor(n_estimators=300, learning_rate=0.05, max_depth=3)
gb.fit(X_train, y_train)
pred_gb = gb.predict(X_test)

print("Linear Regression R²:", r2_score(y_test, pred_lr))
print("Random Forest R²:", r2_score(y_test, pred_rf))
print("Gradient Boosting R²:", r2_score(y_test, pred_gb))



1.2 Hyperparameter & Design Decisions

Before training the models, I identified the key settings that control how each algorithm learns. Some hyperparameters were left at their default values (due to the small dataset), while others were selected based on best practices in regression tasks.

1. Linear Regression

No major hyperparameters.

Chosen as a baseline because it is simple, fast, and highly interpretable.

No tuning required.

2. Random Forest Regressor

Key hyperparameters considered:

n_estimators (number of trees): starting with 100

max_depth: controls level of tree splitting

min_samples_split: minimum samples needed to split

random_state: ensures reproducibility

Design choice:
For Sprint 3, I will begin with default parameters plus a small test adjustment to n_estimators (e.g., 100 → 200).
A full GridSearch is possible in Sprint 4 but may be slow given the dataset size.

3. Gradient Boosting Regressor (optional)

Key hyperparameters:

learning_rate

n_estimators

max_depth

Design choice:
Defaults will be used for Sprint 3 to avoid overfitting and because GBMs can be sensitive to tuning.
If performance improves meaningfully, deeper tuning can be done in Sprint 4.
