## Model Optimization and Refinement

### Introduction

In this notebook, we focus on optimizing and refining the models developed in previous steps. While the initial models provided a good baseline, further fine-tuning is required to improve their performance and ensure they generalize well to new data.

### Objectives:
- **Hyperparameter Tuning**: We will explore different sets of hyperparameters using methods such as grid search or random search to identify the best configurations for each model.
- **Cross-Validation**: To ensure the models are robust and not overfitting, we will use cross-validation to evaluate their performance on multiple data subsets.
- **Model Comparison**: After optimization, we will compare the models based on key metrics such as accuracy, precision, recall, F1-score, and more, ensuring we select the best-performing model.

In [16]:
# Imports 
import pandas as pd
import numpy as np
import joblib
from sklearn.model_selection import RandomizedSearchCV, cross_val_score
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.exceptions import ConvergenceWarning
import warnings
from tabulate import tabulate

warnings.filterwarnings("ignore", category=ConvergenceWarning)

In [17]:
# Load the pre-split data
X_train = joblib.load('../outputs/X_train_encoded.joblib')
X_test = joblib.load('../outputs/X_test_encoded.joblib')
y_train = joblib.load('../outputs/y_train.joblib')
y_test = joblib.load('../outputs/y_test.joblib')

def evaluate_model(model, X, y):
    y_pred = model.predict(X)
    mse = mean_squared_error(y, y_pred)
    r2 = r2_score(y, y_pred)
    mae = mean_absolute_error(y, y_pred)
    return mse, r2, mae

# Initialize results tracking
results = []

In [18]:
# Load and evaluate baseline models
print("Loading and evaluating baseline models...")
baseline_models = {
    'RandomForest': joblib.load('../models/random_forest_model.joblib'),
    'XGBoost': joblib.load('../models/xgboost_model.joblib')
}

for name, model in baseline_models.items():
    train_mse, train_r2, train_mae = evaluate_model(model, X_train, y_train)
    test_mse, test_r2, test_mae = evaluate_model(model, X_test, y_test)
    
    # Perform cross-validation
    cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
    cv_mse = -cv_scores.mean()
    
    results.append({
        'Model': name,
        'Iteration': 'Baseline',
        'Train MSE': train_mse,
        'Train R2': train_r2,
        'Train MAE': train_mae,
        'Test MSE': test_mse,
        'Test R2': test_r2,
        'Test MAE': test_mae,
        'CV MSE': cv_mse
    })
    print(f"Baseline {name} - Test MSE: {test_mse:.4f}, Test R2: {test_r2:.4f}, Test MAE: {test_mae:.4f}, CV MSE: {cv_mse:.4f}")

Loading and evaluating baseline models...
Baseline RandomForest - Test MSE: 0.0018, Test R2: 0.9003, Test MAE: 0.0314, CV MSE: 0.0016
Baseline XGBoost - Test MSE: 0.0021, Test R2: 0.8804, Test MAE: 0.0344, CV MSE: 0.0018
