# Hyperparameter Tuning

With the best-performing model(s) identified, the next step is to optimize their performance through hyperparameter tuning and cross-validation. This process helps ensure that the model is as accurate and generalizable as possible.
Approach to Hyperparameter Tuning

Hyperparameter Research:
Initial hyperparameter ranges were determined through research and reviewing commonly recommended settings for each model type. This provided a solid foundation for the tuning process.

Tuning Methods:
   Two primary methods were considered for hyperparameter optimization:
    GridSearchCV: Systematically explores a predefined set of hyperparameter combinations.
     RandomizedSearchCV: Randomly samples hyperparameter combinations within defined ranges, offering a faster alternative for larger search spaces.

   Cross-Validation:
   Both tuning methods were combined with cross-validation to ensure model robustness and avoid overfitting. This approach validates the model's performance across multiple data splits, providing a reliable estimate of its generalization capabilities.

Considerations for This Dataset

    Due to the complexity and size of the dataset, there are potential challenges with this approach, such as high computational costs and the risk of overfitting during extensive grid searches.
    For this iteration, the default Scikit-learn tuning methods were used to establish a basic and functional pipeline. More advanced and customized tuning strategies may be explored in future iterations for enhanced optimization.

## Preventing Data Leakage in Tuning - STRETCH

> **This step doesn't need to be part of your Minimum Viable Product (MVP), but its highly recommended you complete it, if you have time!**

BUT we have a problem - if we calculated a numerical value to encode city (such as the mean of sale prices in that city) on the training data, we can't cross validate 
- The rows in each validation fold were part of the original calculation of the mean for that city - that means we're leaking information!
- While sklearn's built in functions are extremely useful, sometimes it is necessary to do things ourselves

You need to create two functions to replicate what Gridsearch does under the hood. This is a challenging, real world data problem! To help you out, we've created some psuedocode and docstrings to get you started. 

**`custom_cross_validation()`**
- Should take the training data, and divide it into multiple train/validation splits. 
- Look into `sklearn.model_selection.KFold` to accomplish this - the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) shows how to split a dataframe and loop through the indexes of your split data. 
- Within your function, you should compute the city means on the training folds just like you did in Notebook 1 - you may have to re-join the city column to do this - and then join these values to the validation fold

This psuedocode may help you fill in the function:

```python
kfold = KFold() # fit sklearn k folds on X_train
train_folds = []
val_folds = []
for training_index, val_index in kfold.split(X_train):
    train_fold, val_fold = #.iloc loop variables on X_train

    # recompute training city means like you did in notebook 1 
    # merge to validation fold
        
    train_folds.append(train_fold)
    val_folds.append(val_fold)

    return train_folds, val_folds
```


**`hyperparameter_search()`**
- Should take the validation and training splits from your previous function, along with your dictionary of hyperparameter values
- For each set of hyperparameter values, fit your chosen model on each set of training folds, and take the average of your chosen scoring metric. [itertools.product()](https://docs.python.org/3/library/itertools.html) will be helpful for looping through all combinations of hyperparameter values
- Your function should output the hyperparameter values corresponding the highest average score across all folds. Alternatively, it could also output a model object fit on the full training dataset with these parameters.


This psuedocode may help you fill in the function:

```python
hyperparams = # Generate hyperparam options with itertools
hyperparam-scores = []
for hyperparam-combo in hyperparams:

    scores = []

    for folds in allmyfolds:
        # score fold the fold with the model/ hyperparams
        scores.append(score-fold)
        
    score = scores.mean()
    hyperparam-scores.append(score)
# After loop, find max of hyperparam-scores. Best params are at same index in `hyperparams` loop iteratble
```

Docstrings have been provided below to get you started. Once you're done developing your functions, you should move them to `functions_variables.py` to keep your notebook clean 

Bear in mind that these instructions are just one way to tackle this problem - the inputs and output formats don't need to be exactly as specified here.

In [1]:
# develop your custom functions here

def custom_cross_validation(training_data, n_splits =5):
    '''creates n_splits sets of training and validation folds

    Args:
      training_data: the dataframe of features and target to be divided into folds
      n_splits: the number of sets of folds to be created

    Returns:
      A tuple of lists, where the first index is a list of the training folds, 
      and the second the corresponding validation fold

    Example:
        >>> output = custom_cross_validation(train_df, n_splits = 10)
        >>> output[0][0] # The first training fold
        >>> output[1][0] # The first validation fold
        >>> output[0][1] # The second training fold
        >>> output[1][1] # The second validation fold... etc.
    '''

    return training_folds, validation_folds

def hyperparameter_search(training_folds, validation_folds, param_grid):
    '''outputs the best combination of hyperparameter settings in the param grid, 
    given the training and validation folds

    Args:
      training_folds: the list of training fold dataframes
      validation_folds: the list of validation fold dataframes
      param_grid: the dictionary of possible hyperparameter values for the chosen model

    Returns:
      A list of the best hyperparameter settings based on the chosen metric

    Example:
        >>> param_grid = {
          'max_depth': [None, 10, 20, 30],
          'min_samples_split': [2, 5, 10],
          'min_samples_leaf': [1, 2, 4],
          'max_features': ['sqrt', 'log2']} # for random forest
        >>> hyperparameter_search(output[0], output[1], param_grid = param_grid) 
        # assuming 'ouput' is the output of custom_cross_validation()
        [20, 5, 2, 'log2'] # hyperparams in order
    '''

    return hyperparameters


## Hyperparam Tuning

In [2]:
#Import preprocessed data
import pandas as pd
from functions import get_error_scores, display_results_sample, find_best_regression_model

#Independant variable training data
X_train = pd.read_csv("../data/processed/X_train_selected.csv")
X_train = X_train.drop(columns=["Unnamed: 0"])
print(f"X_train shape: {X_train.shape}")

#Target training data
y_train = pd.read_csv("../data/preprocessed/y_train.csv")
y_train = y_train.drop(columns=["Unnamed: 0"])
print(f"y_train shape: {y_train.shape}")

#Independant variable test data
X_test = pd.read_csv("../data/processed/X_test_selected.csv")
X_test = X_test.drop(columns=["Unnamed: 0"])
print(f"X_test shape: {X_test.shape}")

#Target test data
y_test = pd.read_csv("../data/preprocessed/y_test.csv")
y_test = y_test.drop(columns=["Unnamed: 0"])
print(f"y_test shape: {y_test.shape}")

X_train shape: (3381, 15)
y_train shape: (3381, 1)
X_test shape: (1450, 15)
y_test shape: (1450, 1)


In [3]:
from xgboost import XGBRegressor

# Create a new model instance
loaded_xg = XGBRegressor()

# Load the saved model
loaded_xg.load_model("../models/xgboost_model.json")

#Run model
y_train_pred = loaded_xg.predict(X_train)
y_test_pred = loaded_xg.predict(X_test)

# Check if it performs the same
from sklearn.metrics import r2_score
print("TRAIN R² Score (Loaded Model):", r2_score(y_train, y_train_pred))
print("TEST R² Score (Loaded Model):", r2_score(y_test, y_test_pred))


TRAIN R² Score (Loaded Model): 0.9886824488639832
TEST R² Score (Loaded Model): 0.9008165597915649


In [12]:
#Create grid for desired selection of hyperparameters
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],  # Number of trees
    'max_depth': [3, 5, 7],  # Tree depth
    'learning_rate': [0.01, 0.05, 0.1],  # Step size shrinkage
    'subsample': [0.7, 0.8, 1.0],  # Fraction of samples per tree
    'colsample_bytree': [0.7, 0.8, 1.0],  # Fraction of features per tree
    'reg_alpha': [0, 0.1, 0.5],   # L1 regularization (to reduce overfitting)
    'reg_lambda': [0.1, 0.5, 1.0]  # L2 regularization
}


In [None]:
#Comment out Gridsearch to prevent 10min run on 10,000 fits
#Best parameters listed below as result from search

# # Initialize XGBoost model
# xg = XGBRegressor()

# # Perform Grid Search with 5-fold cross-validation
# grid_search = GridSearchCV(
#     estimator=xg,
#     param_grid=param_grid,
#     scoring='r2',  # Using R² as the evaluation metric
#     cv=5,  # 5-fold cross-validation
#     verbose=1,  # Show training process
#     n_jobs=-1  # Use all available CPU core
# )

# # Fit GridSearchCV to the training data
# grid_search.fit(X_train, y_train)


Fitting 5 folds for each of 2187 candidates, totalling 10935 fits


In [None]:
#Comment out below as this was used to extract best parameters from the GridSearch. Use only if running fresh search

# # Print the best parameters found
# print("Best Parameters:", grid_search.best_params_)

# # Get the best model
# best_xg_model = grid_search.best_estimator_

# print(grid_search.best_estimator_)


Best Parameters: {'colsample_bytree': 0.7, 'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 300, 'reg_alpha': 0.5, 'reg_lambda': 0.5, 'subsample': 0.7}
XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=0.7, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=0.1, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=7, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             multi_strategy=None, n_estimators=300, n_jobs=None,
             num_parallel_tree=None, random_state=None, ...)


In [None]:
# Best hyperparameters from GridSearchCV
best_params = {
    'colsample_bytree': 0.7,
    'learning_rate': 0.1,
    'max_depth': 7,
    'n_estimators': 300,
    'reg_alpha': 0.5,
    'reg_lambda': 0.5,
    'subsample': 0.7
}

# Initialize and train the model with the best parameters
best_xg_model = XGBRegressor(**best_params)
best_xg_model.fit(X_train, y_train)


In [30]:
# Get predictions using the best model
y_train_pred_best = best_xg_model.predict(X_train)
y_test_pred_best = best_xg_model.predict(X_test)

# Check R² score
print("TRAIN R² Score (Best Model):", r2_score(y_train, y_train_pred_best))
print("TEST R² Score (Best Model):", r2_score(y_test, y_test_pred_best))

get_error_scores(y_train, y_train_pred_best, y_test, y_test_pred_best)

TRAIN R² Score (Best Model): 0.9966115951538086
TEST R² Score (Best Model): 0.9219232797622681
R SQUARED
	Train R²:	0.9966
	Test R²:	0.9219
MEAN AVERAGE ERROR
	Train MAE:	7535.09
	Test MAE:	32921.31
ROOT MEAN SQUARED ERROR
	Train RMSE:	10644.53
	Test RMSE:	51815.83

10 Randomly selected results.
Index: 742 	- 	Prediction: $740,569 	Actual: $619,000 	Difference: 121,569, -16.42%
Index: 176 	- 	Prediction: $109,355 	Actual: $110,000 	Difference: -645, 0.59%
Index: 358 	- 	Prediction: $433,974 	Actual: $485,000 	Difference: -51,026, 11.76%
Index: 281 	- 	Prediction: $488,667 	Actual: $450,000 	Difference: 38,667, -7.91%
Index: 1362 	- 	Prediction: $210,587 	Actual: $260,000 	Difference: -49,413, 23.46%
Index: 203 	- 	Prediction: $113,529 	Actual: $106,900 	Difference: 6,629, -5.84%
Index: 10 	- 	Prediction: $219,036 	Actual: $220,000 	Difference: -964, 0.44%
Index: 49 	- 	Prediction: $236,012 	Actual: $230,000 	Difference: 6,012, -2.55%
Index: 683 	- 	Prediction: $88,389 	Actual: $89,900 

In [8]:
# Save the best model
best_xg_model.save_model("../models/xgboost_best_model.json")

## Building a Pipeline (Stretch)

> **This step doesn't need to be part of your Minimum Viable Product (MVP), but its highly recommended you complete it if you have time!**

Once you've identified which model works the best, implement a prediction pipeline to make sure that you haven't leaked any data, and that the model could be easily deployed if desired.
- Your pipeline should load the data, process it, load your saved tuned model, and output a set of predictions
- Assume that the new data is in the same JSON format as your original data - you can use your original data to check that the pipeline works correctly
- Beware that a pipeline can only handle functions with fit and transform methods.
- Classes can be used to get around this, but now sklearn has a wrapper for user defined functions.
- You can develop your functions or classes in the notebook here, but once they are working, you should import them from `functions_variables.py` 

In [9]:
# Build pipeline here

Pipelines come from sklearn.  When a pipeline is pickled, all of the information in the pipeline is stored with it.  For example, if we were deploying a model, and we had fit a scaler on the training data, we would want the same, already fitted scaling object to transform the new data with.  This is all stored when the pipeline is pickled.
- save your final pipeline in your `models/` folder

In [10]:
# save your pipeline here