## Hyperparam Tuning

Now that we know which models are performing better, it's time to perform cross validation and tune hyperparameters.
- Do a google search for hyperparameter ranges for each type of model.

GridSearch/RandomSearch are a great methods for checking off both of these tasks.

There is a fairly significant issue with this approach for this particular problem (described below). But in the interest of creating a basic functional pipeline, you can just use the default Sklearn methods for now.

## Preventing Data Leakage in Tuning - STRETCH

> **This step doesn't need to be part of your Minimum Viable Product (MVP), but its highly recommended you complete it, if you have time!**

BUT we have a problem - if we calculated a numerical value to encode city (such as the mean of sale prices in that city) on the training data, we can't cross validate
- The rows in each validation fold were part of the original calculation of the mean for that city - that means we're leaking information!
- While sklearn's built in functions are extremely useful, sometimes it is necessary to do things ourselves

You need to create two functions to replicate what Gridsearch does under the hood. This is a challenging, real world data problem! To help you out, we've created some psuedocode and docstrings to get you started.

**`custom_cross_validation()`**
- Should take the training data, and divide it into multiple train/validation splits.
- Look into `sklearn.model_selection.KFold` to accomplish this - the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) shows how to split a dataframe and loop through the indexes of your split data.
- Within your function, you should compute the city means on the training folds just like you did in Notebook 1 - you may have to re-join the city column to do this - and then join these values to the validation fold

This psuedocode may help you fill in the function:

```python
kfold = KFold() # fit sklearn k folds on X_train
train_folds = []
val_folds = []
for training_index, val_index in kfold.split(X_train):
    train_fold, val_fold = #.iloc loop variables on X_train

    # recompute training city means like you did in notebook 1
    # merge to validation fold
        
    train_folds.append(train_fold)
    val_folds.append(val_fold)

    return train_folds, val_folds
```


**`hyperparameter_search()`**
- Should take the validation and training splits from your previous function, along with your dictionary of hyperparameter values
- For each set of hyperparameter values, fit your chosen model on each set of training folds, and take the average of your chosen scoring metric. [itertools.product()](https://docs.python.org/3/library/itertools.html) will be helpful for looping through all combinations of hyperparameter values
- Your function should output the hyperparameter values corresponding the highest average score across all folds. Alternatively, it could also output a model object fit on the full training dataset with these parameters.


This psuedocode may help you fill in the function:

```python
hyperparams = # Generate hyperparam options with itertools
hyperparam-scores = []
for hyperparam-combo in hyperparams:

    scores = []

    for folds in allmyfolds:
        # score fold the fold with the model/ hyperparams
        scores.append(score-fold)
        
    score = scores.mean()
    hyperparam-scores.append(score)
# After loop, find max of hyperparam-scores. Best params are at same index in `hyperparams` loop iteratble
```

Docstrings have been provided below to get you started. Once you're done developing your functions, you should move them to `functions_variables.py` to keep your notebook clean

Bear in mind that these instructions are just one way to tackle this problem - the inputs and output formats don't need to be exactly as specified here.

In [None]:
# develop your custom functions here

def custom_cross_validation(training_data, n_splits =5):
    '''creates n_splits sets of training and validation folds

    Args:
      training_data: the dataframe of features and target to be divided into folds
      n_splits: the number of sets of folds to be created

    Returns:
      A tuple of lists, where the first index is a list of the training folds,
      and the second the corresponding validation fold

    Example:
        >>> output = custom_cross_validation(train_df, n_splits = 10)
        >>> output[0][0] # The first training fold
        >>> output[1][0] # The first validation fold
        >>> output[0][1] # The second training fold
        >>> output[1][1] # The second validation fold... etc.
    '''

    return training_folds, validation_folds

def hyperparameter_search(training_folds, validation_folds, param_grid):
    '''outputs the best combination of hyperparameter settings in the param grid,
    given the training and validation folds

    Args:
      training_folds: the list of training fold dataframes
      validation_folds: the list of validation fold dataframes
      param_grid: the dictionary of possible hyperparameter values for the chosen model

    Returns:
      A list of the best hyperparameter settings based on the chosen metric

    Example:
        >>> param_grid = {
          'max_depth': [None, 10, 20, 30],
          'min_samples_split': [2, 5, 10],
          'min_samples_leaf': [1, 2, 4],
          'max_features': ['sqrt', 'log2']} # for random forest
        >>> hyperparameter_search(output[0], output[1], param_grid = param_grid)
        # assuming 'ouput' is the output of custom_cross_validation()
        [20, 5, 2, 'log2'] # hyperparams in order
    '''

    return hyperparameters


## Hyperparam Tuning

In [None]:
# perform tuning and cross validation here
# using GridsearchCV/ RandomsearchCV (MVP)
# or your custom functions
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

In [None]:
import pickle

# Load all models from the pickle file
with open("trained_models.pkl", "rb") as f:
    loaded_models = pickle.load(f)


print("Available models:", list(loaded_models.keys()))  # Check which models are in the file


✅ Models loaded successfully!
Available models: ['Linear Regression', 'Support Vector Regression', 'Random Forest', 'XGBoost']


In [None]:
rf_model = loaded_models["Random Forest"]
xgb_model = loaded_models["XGBoost"]
svr_model = loaded_models["Support Vector Regression"]
lr_model = loaded_models["Linear Regression"]

In [None]:
import pickle
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split # Import train_test_split
import pandas as pd

# Load all models from the pickle file
with open("trained_models.pkl", "rb") as f:
    loaded_models = pickle.load(f)


print("Available models:", list(loaded_models.keys()))  # Check which models are in the file
X_train = pd.read_csv("X_train.csv")
y_train = pd.read_csv("y_train.csv")

✅ Models loaded successfully!
Available models: ['Linear Regression', 'Support Vector Regression', 'Random Forest', 'XGBoost']


In [None]:
import pickle
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split  # Import train_test_split
import pandas as pd

# Load all models from the pickle file
with open("trained_models.pkl", "rb") as f:
    loaded_models = pickle.load(f)

print("Available models:", list(loaded_models.keys()))  # Check which models are in the file
X_train = pd.read_csv("X_train.csv")
y_train = pd.read_csv("y_train.csv")

# Identify and drop or convert columns containing date-time strings or 'sold' in X_train
for col in X_train.columns:
    if X_train[col].dtype == 'object':  # Check if column is of object type (likely string)
        # Attempt to convert to datetime, if successful, extract numerical features
        try:
            X_train[col] = pd.to_datetime(X_train[col])
            X_train[col + '_year'] = X_train[col].dt.year
            X_train[col + '_month'] = X_train[col].dt.month
            X_train[col + '_day'] = X_train[col].dt.day
            X_train = X_train.drop(columns=[col])  # Drop original datetime column
            print(f"Converted datetime column '{col}' to numerical features")
        except ValueError:
            # If conversion fails, it's likely not a datetime, proceed to check for 'sold' or other non-numeric values
            if X_train[col].str.contains('sold', na=False).any() or not pd.api.types.is_numeric_dtype(X_train[col]):
                print(f"Dropping column '{col}' as it contains non-numeric values or 'sold'")
                X_train = X_train.drop(columns=[col])  # Remove this from the DataFrame for X

cv_folds = 5  # Define number of folds
cv_results = {}

for name, model in loaded_models.items():
    scores = cross_val_score(model, X_train, y_train.values.ravel(), cv=cv_folds, scoring="r2") # ravel y_train to 1D array
    cv_results[name] = {
        "Mean R²": scores.mean(),
        "Std Dev": scores.std()
    }



✅ Models loaded successfully!
Available models: ['Linear Regression', 'Support Vector Regression', 'Random Forest', 'XGBoost']
Dropping column 'status' as it contains non-numeric values or 'sold'
Dropping column 'list_date' as it contains non-numeric values or 'sold'
Converted datetime column 'sold_date' to numerical features
Dropping column 'type' as it contains non-numeric values or 'sold'
Dropping column 'address' as it contains non-numeric values or 'sold'
Dropping column 'city' as it contains non-numeric values or 'sold'
Dropping column 'state' as it contains non-numeric values or 'sold'
Dropping column 'latitude' as it contains non-numeric values or 'sold'
Dropping column 'longitude' as it contains non-numeric values or 'sold'


  X_train[col] = pd.to_datetime(X_train[col])
  X_train[col] = pd.to_datetime(X_train[col])
  X_train[col] = pd.to_datetime(X_train[col])
  X_train[col] = pd.to_datetime(X_train[col])
  X_train[col] = pd.to_datetime(X_train[col])
  X_train[col] = pd.to_datetime(X_train[col])


✅ Cross-validation completed!


In [None]:
for name, metrics in cv_results.items():
    print(f"{name} Cross-Validation Performance:")
    print(f"  - Mean R²: {metrics['Mean R²']:.4f}")
    print(f"  - Std Dev: {metrics['Std Dev']:.4f}")
    print("-" * 50)

Linear Regression Cross-Validation Performance:
  - Mean R²: 0.9884
  - Std Dev: 0.0077
--------------------------------------------------
Support Vector Regression Cross-Validation Performance:
  - Mean R²: -0.0132
  - Std Dev: 0.0077
--------------------------------------------------
Random Forest Cross-Validation Performance:
  - Mean R²: 0.9691
  - Std Dev: 0.0610
--------------------------------------------------
XGBoost Cross-Validation Performance:
  - Mean R²: 0.9767
  - Std Dev: 0.0465
--------------------------------------------------


In [None]:
import numpy as np
y_train =np.ravel(y_train)

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

rf_param_grid = {
    "n_estimators": [100, 200],  # Reduce from 3 values → 2
    "max_depth": [10, 20],  # Reduce from 3 values → 2
    "min_samples_split": [2, 5]  # Reduce from 3 values → 2
}

grid_search_rf = RandomizedSearchCV(
    RandomForestRegressor(random_state=42),
    rf_param_grid,
    cv=3,
    n_iter=6,
    scoring="r2",
    n_jobs=-1,
    verbose=2
)

grid_search_rf.fit(X_train, y_train)

print(" Best Parameters for Random Forest:", grid_search_rf.best_params_)
print(" Best R² Score:", grid_search_rf.best_score_)


Fitting 3 folds for each of 6 candidates, totalling 18 fits
✅ Best Parameters for Random Forest: {'n_estimators': 200, 'min_samples_split': 2, 'max_depth': 20}
📊 Best R² Score: 0.9525979278968667


In [None]:
# Define hyperparameters for XGBoost
from xgboost import XGBRegressor
xgb_param_grid = {
    "n_estimators": [100, 200],  # Reduce from 3 values → 2
    "learning_rate": [0.01, 0.1],  # Reduce from 3 values → 2
    "max_depth": [3, 6]  # Reduce from 3 values → 2
}

grid_search_xgb = RandomizedSearchCV(
    XGBRegressor(random_state=42),
    xgb_param_grid,
    cv=3,
    n_iter=6,
    scoring="r2",
    n_jobs=-1,
    verbose=2
)

grid_search_xgb.fit(X_train, y_train)

print(" Best Parameters for XGBoost:", grid_search_xgb.best_params_)
print(" Best R² Score:", grid_search_xgb.best_score_)


Fitting 3 folds for each of 6 candidates, totalling 18 fits
✅ Best Parameters for XGBoost: {'n_estimators': 200, 'max_depth': 3, 'learning_rate': 0.1}
📊 Best R² Score: 0.9720388188054608


In [None]:
from sklearn.model_selection import RandomizedSearchCV  # Keep this
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler  # Ensure this is imported
import pandas as pd  # Import pandas for reading data

# Define hyperparameters for SVR
svr_param_grid = {
    "kernel": ["rbf"],
    "C": [1, 10],  # Reduced from 3 values to 2
    "epsilon": [0.01, 0.1]  # Reduced from 3 to 2
}

# Apply Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# Load X_test data - Assuming it's in a CSV file named 'X_test.csv'
X_test = pd.read_csv("X_test.csv")

# Preprocess X_test similarly to X_train
# Ensure X_test has the same columns as X_train used during fitting
# 1. Identify and drop or convert columns containing date-time strings or 'sold' in X_test
for col in X_test.columns:
    if X_test[col].dtype == 'object':
        try:
            # Attempt to convert to datetime, if successful, extract numerical features
            X_test[col] = pd.to_datetime(X_test[col])
            X_test[col + '_year'] = X_test[col].dt.year
            X_test[col + '_month'] = X_test[col].dt.month
            X_test[col + '_day'] = X_test[col].dt.day
            X_test = X_test.drop(columns=[col])  # Drop original datetime column
            print(f"Converted datetime column '{col}' to numerical features")
        except ValueError:
            # If conversion fails, it's likely not a datetime, proceed to check for 'sold' or other non-numeric values
            if X_test[col].str.contains('sold', na=False).any() or not pd.api.types.is_numeric_dtype(X_test[col]):
                print(f"Dropping column '{col}' as it contains non-numeric values or 'sold'")
                X_test = X_test.drop(columns=[col])  # Remove this from the DataFrame for X


# 2. Apply the same scaling as used for X_train
X_test_scaled = scaler.transform(X_test)

Dropping column 'status' as it contains non-numeric values or 'sold'
Dropping column 'list_date' as it contains non-numeric values or 'sold'
Converted datetime column 'sold_date' to numerical features
Dropping column 'type' as it contains non-numeric values or 'sold'
Dropping column 'address' as it contains non-numeric values or 'sold'
Dropping column 'city' as it contains non-numeric values or 'sold'
Dropping column 'state' as it contains non-numeric values or 'sold'
Dropping column 'latitude' as it contains non-numeric values or 'sold'
Dropping column 'longitude' as it contains non-numeric values or 'sold'


  X_test[col] = pd.to_datetime(X_test[col])
  X_test[col] = pd.to_datetime(X_test[col])
  X_test[col] = pd.to_datetime(X_test[col])
  X_test[col] = pd.to_datetime(X_test[col])
  X_test[col] = pd.to_datetime(X_test[col])
  X_test[col] = pd.to_datetime(X_test[col])


In [None]:
print (" Best Parameters for SVR:", grid_search_svr.best_params_)
print(" Best R² Score:", grid_search_svr.best_score_)

✅ Best Parameters for SVR: {'kernel': 'rbf', 'epsilon': 0.1, 'C': 10}
📊 Best R² Score: -0.009309857432915089


We want to make sure that we save our models.  In the old days, one just simply pickled (serialized) the model.  Now, however, certain model types have their own save format.  If the model is from sklearn, it can be pickled, if it's xgboost, for example, the newest format to save it in is JSON, but it can also be pickled.  It's a good idea to stay with the most current methods.
- you may want to create a new `models/` subdirectory in your repo to stay organized

In [None]:

import pickle


with open("best_xgb_model.pkl", "wb") as f:
    pickle.dump(grid_search_xgb.best_estimator_, f)


## Building a Pipeline (Stretch)

> **This step doesn't need to be part of your Minimum Viable Product (MVP), but its highly recommended you complete it if you have time!**

Once you've identified which model works the best, implement a prediction pipeline to make sure that you haven't leaked any data, and that the model could be easily deployed if desired.
- Your pipeline should load the data, process it, load your saved tuned model, and output a set of predictions
- Assume that the new data is in the same JSON format as your original data - you can use your original data to check that the pipeline works correctly
- Beware that a pipeline can only handle functions with fit and transform methods.
- Classes can be used to get around this, but now sklearn has a wrapper for user defined functions.
- You can develop your functions or classes in the notebook here, but once they are working, you should import them from `functions_variables.py`

In [None]:
# Build pipeline here

Pipelines come from sklearn.  When a pipeline is pickled, all of the information in the pipeline is stored with it.  For example, if we were deploying a model, and we had fit a scaler on the training data, we would want the same, already fitted scaling object to transform the new data with.  This is all stored when the pipeline is pickled.
- save your final pipeline in your `models/` folder

In [None]:
# save your pipeline here