## Model Optimization and Refinement

### Introduction

In this notebook, we focus on optimizing and refining the models developed in previous steps. While the initial models provided a good baseline, further fine-tuning is required to improve their performance and ensure they generalize well to new data.

### Objectives:
- **Hyperparameter Tuning**: We will explore different sets of hyperparameters using methods such as grid search or random search to identify the best configurations for each model.
- **Cross-Validation**: To ensure the models are robust and not overfitting, we will use cross-validation to evaluate their performance on multiple data subsets.
- **Model Comparison**: After optimization, we will compare the models based on key metrics such as accuracy, precision, recall, F1-score, and more, ensuring we select the best-performing model.

In [1]:
# Imports 
import pandas as pd
import numpy as np
import joblib
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
# Load the pre-split data
X_train = joblib.load('../data/X_train_encoded.joblib')
X_test = joblib.load('../data/X_test_encoded.joblib')
y_train = joblib.load('../outputs/y_train.joblib')
y_test = joblib.load('../outputs/y_test.joblib')

In [None]:
# Define models and their parameter grids
models = {
    'RandomForest': (RandomForestRegressor(random_state=42), {
        'n_estimators': [100, 200, 300, 400, 500],
        'max_depth': [10, 20, 30, 40, 50, None],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4],
        'max_features': ['auto', 'sqrt', 'log2']
    }),
    'XGBoost': (XGBRegressor(random_state=42), {
        'n_estimators': [100, 200, 300, 400, 500],
        'max_depth': [3, 4, 5, 6, 7],
        'learning_rate': [0.01, 0.05, 0.1, 0.2],
        'subsample': [0.8, 0.9, 1.0],
        'colsample_bytree': [0.8, 0.9, 1.0]
    })
}