## Hyperparam Tuning

Now that we know which models are performing better, it's time to perform cross validation and tune hyperparameters.
- Do a google search for hyperparameter ranges for each type of model.

GridSearch/RandomSearch are a great methods for checking off both of these tasks.

There is a fairly significant issue with this approach for this particular problem (described below). But in the interest of creating a basic functional pipeline, you can just use the default Sklearn methods for now.

## Preventing Data Leakage in Tuning - STRETCH

> **This step doesn't need to be part of your Minimum Viable Product (MVP), but its highly recommended you complete it, if you have time!**

BUT we have a problem - if we calculated a numerical value to encode city (such as the mean of sale prices in that city) on the training data, we can't cross validate 
- The rows in each validation fold were part of the original calculation of the mean for that city - that means we're leaking information!
- While sklearn's built in functions are extremely useful, sometimes it is necessary to do things ourselves

You need to create two functions to replicate what Gridsearch does under the hood. This is a challenging, real world data problem! To help you out, we've created some psuedocode and docstrings to get you started. 

**`custom_cross_validation()`**
- Should take the training data, and divide it into multiple train/validation splits. 
- Look into `sklearn.model_selection.KFold` to accomplish this - the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) shows how to split a dataframe and loop through the indexes of your split data. 
- Within your function, you should compute the city means on the training folds just like you did in Notebook 1 - you may have to re-join the city column to do this - and then join these values to the validation fold

This psuedocode may help you fill in the function:

```python
kfold = KFold() # fit sklearn k folds on X_train
train_folds = []
val_folds = []
for training_index, val_index in kfold.split(X_train):
    train_fold, val_fold = #.iloc loop variables on X_train

    # recompute training city means like you did in notebook 1 
    # merge to validation fold
        
    train_folds.append(train_fold)
    val_folds.append(val_fold)

    return train_folds, val_folds
```


**`hyperparameter_search()`**
- Should take the validation and training splits from your previous function, along with your dictionary of hyperparameter values
- For each set of hyperparameter values, fit your chosen model on each set of training folds, and take the average of your chosen scoring metric. [itertools.product()](https://docs.python.org/3/library/itertools.html) will be helpful for looping through all combinations of hyperparameter values
- Your function should output the hyperparameter values corresponding the highest average score across all folds. Alternatively, it could also output a model object fit on the full training dataset with these parameters.


This psuedocode may help you fill in the function:

```python
hyperparams = # Generate hyperparam options with itertools
hyperparam-scores = []
for hyperparam-combo in hyperparams:

    scores = []

    for folds in allmyfolds:
        # score fold the fold with the model/ hyperparams
        scores.append(score-fold)
        
    score = scores.mean()
    hyperparam-scores.append(score)
# After loop, find max of hyperparam-scores. Best params are at same index in `hyperparams` loop iteratble
```

Docstrings have been provided below to get you started. Once you're done developing your functions, you should move them to `functions_variables.py` to keep your notebook clean 

Bear in mind that these instructions are just one way to tackle this problem - the inputs and output formats don't need to be exactly as specified here.

In [2]:
import numpy as np
import pandas as pd
import pickle
import os

from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer, StandardScaler
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.ensemble import GradientBoostingClassifier, RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn import set_config

import functions_variables as fv
from sklearn.pipeline import FeatureUnion
from sklearn.feature_selection import SelectKBest
from sklearn.decomposition import PCA



from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
set_config(display="diagram")
X = pd.DataFrame()
source_folder_name = '../data'





In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import itertools

# Custom Cross-Validation Function

def custom_cross_validation(training_data, n_splits=5):
    '''Creates n_splits sets of training and validation folds'''

    kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)  
    training_folds, validation_folds = [], []

    for train_idx, val_idx in kf.split(training_data):
        train_fold = training_data.iloc[train_idx]
        val_fold = training_data.iloc[val_idx]
        
        training_folds.append(train_fold)
        validation_folds.append(val_fold)

    return training_folds, validation_folds

# Hyperparameter Search Function

def hyperparameter_search(training_folds, validation_folds, param_grid):
    '''Finds the best hyperparameter settings using cross-validation'''

    best_score = float('inf')  
    best_params = None  

    # create all hyperparameter combinations
    keys, values = zip(*param_grid.items())
    param_combinations = [dict(zip(keys, v)) for v in itertools.product(*values)]

    for params in param_combinations:
        scores = []

        for train_fold, val_fold in zip(training_folds, validation_folds):
            X_train, y_train = train_fold.drop(columns=["target"]), train_fold["target"]
            X_val, y_val = val_fold.drop(columns=["target"]), val_fold["target"]

            model = RandomForestRegressor(**params, random_state=42)
            model.fit(X_train, y_train)
            predictions = model.predict(X_val)
            score = mean_squared_error(y_val, predictions)
            scores.append(score)

        avg_score = np.mean(scores)

        if avg_score < best_score:
            best_score = avg_score
            best_params = params

    return best_params

# loading dataset 
X_train = pd.read_pickle("/Users/rimbarbar/LHL_DS-Midterm-Project/DS-Midterm-Project/data/processed/X_train.pkl")
y_train = pd.read_pickle("/Users/rimbarbar/LHL_DS-Midterm-Project/DS-Midterm-Project/data/processed/Y_train.pkl")

# combine into single dataframe
training_data = X_train.copy()
training_data["target"] = y_train  

# Cross-Validation
train_folds, val_folds = custom_cross_validation(training_data)

# Define Hyperparameter Grid

param_grid = {
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2']
}

# Run Hyperparameter Search
best_hyperparams = hyperparameter_search(train_folds, val_folds, param_grid)

# Print the best hyperparameters
print("Best Hyperparameters:", best_hyperparams)

Best Hyperparameters: {'max_depth': 20, 'min_samples_split': 5, 'min_samples_leaf': 2, 'max_features': 'sqrt'}


## Hyperparam Tuning

In [30]:
# GridSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

# Define parameter grid
param_grid = {
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2']
}

# Initialize RandomForestRegressor
rf = RandomForestRegressor(random_state=42)

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)

# Perform GridSearchCV on the training data
grid_search.fit(X_train, y_train)

# Best hyperparameters found by GridSearchCV
best_grid_params = grid_search.best_params_
print("Best Hyperparameters from GridSearchCV:", best_grid_params)

# Get best model
best_model_grid = grid_search.best_estimator_

# evaluate best model on test set
# ensure X_test has all columns from X_train
missing_cols = set(X_train.columns) - set(X_test.columns)
for col in missing_cols:
    X_test[col] = np.nan 

# Ensure X_test columns match order of X_train
X_test = X_test[X_train.columns]

# predict using the best model
y_pred_grid = best_model_grid.predict(X_test)


Best Hyperparameters from GridSearchCV: {'max_depth': 20, 'max_features': 'log2', 'min_samples_leaf': 2, 'min_samples_split': 10}


In [31]:
# RandomizedSearchCV

from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from scipy.stats import randint

# Define parameter distribution
param_dist = {
    'max_depth': [None, 10, 20, 30, 50, 100],
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 10),
    'max_features': ['sqrt', 'log2', None]
}

# Initialize RandomForestRegressor
rf = RandomForestRegressor(random_state=42)

# Initialize RandomizedSearchCV
random_search = RandomizedSearchCV(estimator=rf, param_distributions=param_dist, n_iter=100, cv=5, 
                                   scoring='neg_mean_squared_error', n_jobs=-1, random_state=42)

# Perform RandomizedSearchCV on the training data
random_search.fit(X_train, y_train)

# Best hyperparameters found by RandomizedSearchCV
best_random_params = random_search.best_params_
print("Best Hyperparameters from RandomizedSearchCV:", best_random_params)

# Get best model
best_model_random = random_search.best_estimator_

# evaluate the best model on test set
y_pred_random = best_model_random.predict(X_test)


Best Hyperparameters from RandomizedSearchCV: {'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 18}


We want to make sure that we save our models.  In the old days, one just simply pickled (serialized) the model.  Now, however, certain model types have their own save format.  If the model is from sklearn, it can be pickled, if it's xgboost, for example, the newest format to save it in is JSON, but it can also be pickled.  It's a good idea to stay with the most current methods. 
- you may want to create a new `models/` subdirectory in your repo to stay organized

In [33]:
# Save your best model here
filename = 'best_model.pkl'
filepath = os.path.join(model_dir, filename)

# Save model using pickle
try:
    with open(filepath, 'wb') as file:
        pickle.dump(best_model_grid, file)
        print(f"Model saved to: {filepath}")
except Exception as e:
    print(f"Error saving model to {filepath}: {e}")

Model saved to: ../data/models/best_model.pkl


## Building a Pipeline (Stretch)

> **This step doesn't need to be part of your Minimum Viable Product (MVP), but its highly recommended you complete it if you have time!**

Once you've identified which model works the best, implement a prediction pipeline to make sure that you haven't leaked any data, and that the model could be easily deployed if desired.
- Your pipeline should load the data, process it, load your saved tuned model, and output a set of predictions
- Assume that the new data is in the same JSON format as your original data - you can use your original data to check that the pipeline works correctly
- Beware that a pipeline can only handle functions with fit and transform methods.
- Classes can be used to get around this, but now sklearn has a wrapper for user defined functions.
- You can develop your functions or classes in the notebook here, but once they are working, you should import them from `functions_variables.py` 

### Create the Classes for the loading and cleaning of the data
    - this could have been done by a Function Transformer however I wanted the practice at writing classes
        - code for the FunctionTransformer would have been much simpler

In [8]:
fv.DataLoader

functions_variables.DataLoader

In [9]:
fv.DataCleaner

functions_variables.DataCleaner

### Create seperate classes to scale the numeric values with a StandardScalar and the Boolean Values using PCA to reduce the amount of columns
    - again fairly unneccesary currently this was mostly an exercise in building more complex classes and strining them into a union

In [10]:
fv.PCA_bool

functions_variables.PCA_bool

In [11]:
fv.Scalar_numeric

functions_variables.Scalar_numeric

### Build our basic pipelines, one to load and clean the data, the next to scale and learn from the data
    - the load and clean pipeline is just strining two basic functions together so there is no real need for it as a solo pipeline however if I want the flexibility to have a train/test/split in there I cannot tie the two pipelines together

In [34]:
# Build pipeline here

pipe_load_clean = Pipeline([
    ('loader', fv.DataLoader(source_folder_name)),
    ('cleaner', fv.DataCleaner(num_type_to_drop=20))
])

union_pipe = FeatureUnion(transformer_list=[
    ('numeric', fv.Scalar_numeric()),
    ('boolean', fv.PCA_bool(n_components=10))
])

processing_pipe = Pipeline([
    ('scaling_union', union_pipe),
    ('random_forest', RandomForestRegressor(n_estimators=100, random_state=42))
])

In [35]:
cleaned_data = pipe_load_clean.fit_transform(X)
pipe_load_clean

Not a Json: uscities.csv
Not a Json: .gitkeep
Not a Json: models
Not a Json: license.txt
Not a Json: processed
Geocoding attempt 1 of 4...


In [36]:
processing_pipe

In [37]:
X = cleaned_data.drop('sold_price', axis=1)
y = cleaned_data['sold_price']
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=.8, random_state=42)


model = processing_pipe.fit(X_train, y_train)
y_pred = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

In [38]:
print('rmse:', rmse,'\nmae:', mae, '\nr2:',r2)

rmse: 435989.7805124466 
mae: 156973.93950605326 
r2: -0.5339755670345716


Pipelines come from sklearn.  When a pipeline is pickled, all of the information in the pipeline is stored with it.  For example, if we were deploying a model, and we had fit a scaler on the training data, we would want the same, already fitted scaling object to transform the new data with.  This is all stored when the pipeline is pickled.
- save your final pipeline in your `models/` folder

In [39]:
model_dir = '../data/models'
filename = 'best_pipeline.pkl'

filepath = os.path.join(model_dir, filename)

os.makedirs(model_dir, exist_ok=True) #makes directory if it does not exsist

try:
    with open(filepath, 'wb') as file:
        pickle.dump(model, file)
        print(f"Model saved to {filepath}")
except Exception as e:
    print(f"Error saving model: {e}")


Model saved to ../data/models/best_pipeline.pkl
