# Notebook 4: Modeling & Optimization

## Purpose
This notebook focuses on building, training, and optimizing machine learning models to predict movie box office revenue. Multiple algorithms are compared to identify the best performing model.

## Objectives
1. Prepare train/test split (time-based: 2010-2021 train, 2022-2024 test)
2. Create preprocessing pipelines (scaling, encoding)
3. Train baseline Linear Regression model
4. Train and tune Random Forest Regressor
5. Train and tune XGBoost/Gradient Boosting model
6. Train Ridge Regression for comparison
7. Perform hyperparameter tuning with cross-validation
8. Compare all models on consistent metrics
9. Select best model and save to file

## Success Criteria
- R-squared > 0.70 on test set
- Mean Absolute Error (MAE) < $25M
- Model generalizes well (train vs test performance)

## Evaluation Metrics
- **R-squared**: Proportion of variance explained
- **MAE**: Mean Absolute Error (in millions of dollars)
- **RMSE**: Root Mean Squared Error (penalizes large errors)
- **Cross-validation scores**: 5-fold CV on training set

## Outputs
- `models/best_model.pkl`
- Model comparison table
- Training/validation performance metrics

## Notes
- Use time-based split to simulate real prediction scenario
- Document all preprocessing decisions
- Track training time for each model
- Save hyperparameter configurations

---
## Setup and Imports

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import pickle

# Scikit-learn imports
from sklearn.model_selection import train_test_split, cross_val_score, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

# XGBoost
import xgboost as xgb

# Set random seed for reproducibility
np.random.seed(42)

# Visualization settings
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

---
## Load Feature-Engineered Data

In [None]:
# Load dataset from previous notebook
# df = pd.read_csv('data/processed/movies_features.csv')

---
## Train/Test Split

In [None]:
# Time-based split: 2010-2021 for training, 2022-2024 for testing
# This simulates real-world prediction scenario

---
## Preprocessing Pipelines

In [None]:
# Create preprocessing pipelines for different model types
# - Linear models: StandardScaler + One-Hot Encoding
# - Tree-based models: Label Encoding (no scaling needed)

---
## Model 1: Linear Regression (Baseline)

In [None]:
# Train baseline Linear Regression
# Evaluate on train and test sets
# Analyze residuals

---
## Model 2: Random Forest Regressor

### Initial Training

In [None]:
# Train Random Forest with default parameters
# Get baseline performance

### Hyperparameter Tuning

In [None]:
# RandomizedSearchCV for hyperparameter tuning
# Parameters: n_estimators, max_depth, min_samples_split, min_samples_leaf
# 5-fold cross-validation

### Best Random Forest Model

In [None]:
# Train with best parameters
# Evaluate on test set

---
## Model 3: XGBoost

### Initial Training

In [None]:
# Train XGBoost with default parameters

### Hyperparameter Tuning

In [None]:
# RandomizedSearchCV for XGBoost
# Parameters: learning_rate, max_depth, n_estimators, subsample

### Best XGBoost Model

In [None]:
# Train with best parameters
# Evaluate on test set

---
## Model 4: Ridge Regression

In [None]:
# Train Ridge Regression with alpha tuning
# Compare with Linear Regression baseline

---
## Model Comparison

In [None]:
# Create comparison table with all models
# Metrics: R2 (train), R2 (test), MAE, RMSE, Training Time
# Visualize model comparison

---
## Select Best Model

In [None]:
# Based on test set performance, select best model
# Document selection rationale

---
## Save Best Model

In [None]:
# Save model to pickle file
# with open('models/best_model.pkl', 'wb') as f:
#     pickle.dump(best_model, f)

---
## Summary

In [None]:
# Document:
# - Best model and hyperparameters
# - Final test set performance
# - Whether success criteria were met
# - Key insights from model comparison