# Hyperparameter Tuning

With the best-performing model(s) identified, the next step is to optimize their performance through hyperparameter tuning and cross-validation. This process helps ensure that the model is as accurate and generalizable as possible.
Approach to Hyperparameter Tuning

Hyperparameter Research:
Initial hyperparameter ranges were determined through research and reviewing commonly recommended settings for each model type. This provided a solid foundation for the tuning process.

Tuning Methods:
   Two primary methods were considered for hyperparameter optimization:
    GridSearchCV: Systematically explores a predefined set of hyperparameter combinations.
     RandomizedSearchCV: Randomly samples hyperparameter combinations within defined ranges, offering a faster alternative for larger search spaces.
   
   After consideration, GridSearchCV was determine practical to use on this dataset.

<!-- Cross-Validation:
   Both tuning methods were combined with cross-validation to ensure model robustness and avoid overfitting. This approach validates the model's performance across multiple data splits, providing a reliable estimate of its generalization capabilities. -->

Considerations for This Dataset

    Due to the complexity and size of the dataset, there are potential challenges with this approach, such as high computational costs and the risk of overfitting during extensive grid searches.
    For this iteration, the default Scikit-learn tuning methods were used to establish a basic and functional pipeline. More advanced and customized tuning strategies may be explored in future iterations for enhanced optimization.

## Hyperparam Tuning

In [1]:
#Import preprocessed data
import pandas as pd
from functions_variables import get_error_scores, display_results_sample, find_best_regression_model

#Independant variable training data
X_train = pd.read_csv("../data/processed/X_train_selected.csv", index_col=0)
print(f"X_train shape: {X_train.shape}")

#Target training data
y_train = pd.read_csv("../data/preprocessed/y_train.csv", index_col=0)
print(f"y_train shape: {y_train.shape}")

#Independant variable test data
X_test = pd.read_csv("../data/processed/X_test_selected.csv", index_col=0)
print(f"X_test shape: {X_test.shape}")

#Target test data
y_test = pd.read_csv("../data/preprocessed/y_test.csv", index_col=0)
print(f"y_test shape: {y_test.shape}")

X_train shape: (3458, 10)
y_train shape: (3458, 1)
X_test shape: (1482, 10)
y_test shape: (1482, 1)


In [2]:
from xgboost import XGBRegressor

# Create a new model instance
loaded_xg = XGBRegressor()

# Load the saved model
loaded_xg.load_model("../models/xgboost_model.json")

#Run model
y_train_pred = loaded_xg.predict(X_train)
y_test_pred = loaded_xg.predict(X_test)

# Check if it performs the same
from sklearn.metrics import r2_score
print("TRAIN R² Score (Loaded Model):", r2_score(y_train, y_train_pred))
print("TEST R² Score (Loaded Model):", r2_score(y_test, y_test_pred))


TRAIN R² Score (Loaded Model): 0.9931779503822327
TEST R² Score (Loaded Model): 0.9858717322349548


In [3]:
#Create grid for desired selection of hyperparameters
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],  # Number of trees
    'max_depth': [3, 5, 7],  # Tree depth
    'learning_rate': [0.01, 0.05, 0.1],  # Step size shrinkage
    'subsample': [0.7, 0.8, 1.0],  # Fraction of samples per tree
    'colsample_bytree': [0.7, 0.8, 1.0],  # Fraction of features per tree
    'reg_alpha': [0, 0.1, 0.5],   # L1 regularization (to reduce overfitting)
    'reg_lambda': [0.1, 0.5, 1.0]  # L2 regularization
}


In [4]:
#Best parameters listed below as result from search

# Initialize XGBoost model
xg = XGBRegressor()

# Perform Grid Search with 5-fold cross-validation
grid_search = GridSearchCV(
     estimator=xg,
     param_grid=param_grid,
     scoring='r2',  # Using R² as the evaluation metric
     cv=5,  # 5-fold cross-validation
     verbose=1,  # Show training process
     n_jobs=-1  # Use all available CPU core
)

# Fit GridSearchCV to the training data
grid_search.fit(X_train, y_train)


Fitting 5 folds for each of 2187 candidates, totalling 10935 fits


In [5]:
# Print the best parameters found
print("Best Parameters:", grid_search.best_params_)

# Get the best model
best_xg_model = grid_search.best_estimator_

print(grid_search.best_estimator_)


Best Parameters: {'colsample_bytree': 1.0, 'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 300, 'reg_alpha': 0, 'reg_lambda': 0.1, 'subsample': 1.0}
XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=1.0, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=0.1, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=7, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             multi_strategy=None, n_estimators=300, n_jobs=None,
             num_parallel_tree=None, random_state=None, ...)


In [6]:
# Best hyperparameters from GridSearchCV
best_params = {
    'colsample_bytree': 0.7,
    'learning_rate': 0.1,
    'max_depth': 7,
    'n_estimators': 300,
    'reg_alpha': 0.5,
    'reg_lambda': 0.5,
    'subsample': 0.7
}

# Initialize and train the model with the best parameters
best_xg_model = XGBRegressor(**best_params)
best_xg_model.fit(X_train, y_train)

In [7]:
# Get predictions using the best model
y_train_pred_best = best_xg_model.predict(X_train)
y_test_pred_best = best_xg_model.predict(X_test)

# Check R² score
print("TRAIN R² Score (Best Model):", r2_score(y_train, y_train_pred_best))
print("TEST R² Score (Best Model):", r2_score(y_test, y_test_pred_best))

get_error_scores(y_train, y_train_pred_best, y_test, y_test_pred_best)

TRAIN R² Score (Best Model): 0.998171329498291
TEST R² Score (Best Model): 0.9939860105514526
R SQUARED
	Train R²:	0.9982
	Test R²:	0.994
MEAN AVERAGE ERROR
	Train MAE:	5379.17
	Test MAE:	8567.57
ROOT MEAN SQUARED ERROR
	Train RMSE:	7872.19
	Test RMSE:	14489.44

10 Randomly selected results.
Index: 987 	- 	Prediction: $356,311 	Actual: $350,000 	Difference: 6,311, -1.77%
Index: 396 	- 	Prediction: $139,923 	Actual: $145,000 	Difference: -5,077, 3.63%
Index: 92 	- 	Prediction: $352,221 	Actual: $330,000 	Difference: 22,221, -6.31%
Index: 1178 	- 	Prediction: $336,512 	Actual: $335,000 	Difference: 1,512, -0.45%
Index: 483 	- 	Prediction: $235,708 	Actual: $210,000 	Difference: 25,708, -10.91%
Index: 248 	- 	Prediction: $264,346 	Actual: $305,000 	Difference: -40,654, 15.38%
Index: 846 	- 	Prediction: $344,682 	Actual: $350,000 	Difference: -5,318, 1.54%
Index: 921 	- 	Prediction: $169,174 	Actual: $165,000 	Difference: 4,174, -2.47%
Index: 1474 	- 	Prediction: $215,483 	Actual: $205,000

In [8]:
# Save the best model
best_xg_model.save_model("../models/xgboost_best_model.json")

## Preventing Data Leakage in Tuning

Since we used the entire X_train set to target encode the cities and states, we can't use normal cross-validation techniques since data from the validation folds would have been used in the training folds. This would cause data leakage, where our model during cross-validation would be better fit to the validation fold than if it were truly random. 

To prevent this we wrote two custom functions, the first of which takes the training data, splits it into k-folds and then target encodes the city and state. The second function replicates the function of Grid Search, but takes the output of our first function as an argument.

In [9]:
#Redoing transformations from EDA notebook, from before data leakage ocurred
X_train = pd.read_csv("../data/preprocessed/X_train.csv", index_col=0)
X_train = X_train.drop(columns=['location.address.city', 'location.address.state'])

#Setting Cities and States back to original string values
df = pd.read_csv("../data/preprocessed/combined_df.csv", index_col=[0])
X_train = X_train.join(df[['location.address.city', 'location.address.state']])

#Selecting same features from model_selection
selected_features = pd.read_csv('../data/preprocessed/selected_feature.csv', index_col=0)
selected_features = selected_features['Feature']
X_train = X_train[selected_features]

X_train = X_train.join(df['description.sold_price'])

In [10]:
from functions_variables import custom_cross_validation
output = custom_cross_validation(X_train, 5)

In [11]:
OHE_cols = ['fireplace', 'condos', 'townhomes', 'single_family']

In [12]:
#Standardizing data

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

#Fitting scaler to train fold and transforming train and val folds
for i in range(len(output[0])):
    scaled_train = pd.DataFrame(scaler.fit_transform(output[0][i]), index=output[0][i].index, columns=output[0][i].columns)
    scaled_val = pd.DataFrame(scaler.transform(output[1][i]), index=output[1][i].index, columns=output[1][i].columns)
    scaled_train[OHE_cols] = output[0][i][OHE_cols]
    scaled_val[OHE_cols] = output[1][i][OHE_cols]
    output[0][i] = scaled_train
    output[1][i] = scaled_val

In [13]:
from functions_variables import hyperparameter_search
hyperparameter_search(output, param_grid)

Parameters with lowest RMSE: {'n_estimators': 300, 'max_depth': 7, 'learning_rate': 0.1, 'subsample': 0.7, 'colsample_bytree': 0.7, 'reg_alpha': 0, 'reg_lambda': 0.1}


In [14]:
#Resetting all train and test values back to the same
X_train = pd.read_csv("../data/processed/X_train_selected.csv", index_col=0)
y_train = pd.read_csv("../data/preprocessed/y_train.csv", index_col=0)
X_test = pd.read_csv("../data/processed/X_test_selected.csv", index_col=0)
y_test = pd.read_csv("../data/preprocessed/y_test.csv", index_col=0)

In [17]:
custom_function_params = {
    'colsample_bytree': 0.7,
    'learning_rate': 0.1,
    'max_depth': 7,
    'n_estimators': 300,
    'reg_alpha': 0,
    'reg_lambda': 0.1,
    'subsample': 0.7
}

# Initialize and train the model with the hyperparameter search values
test_xg_model = XGBRegressor(**custom_function_params)
test_xg_model.fit(X_train, y_train)

In [18]:
# Get predictions using the best model
y_train_pred_test = test_xg_model.predict(X_train)
y_test_pred_test = test_xg_model.predict(X_test)

# Check R² score
print("TRAIN R² Score (Best Model):", r2_score(y_train, y_train_pred_test))
print("TEST R² Score (Best Model):", r2_score(y_test, y_test_pred_test))

get_error_scores(y_train, y_train_pred_test, y_test, y_test_pred_test)

TRAIN R² Score (Best Model): 0.9985064268112183
TEST R² Score (Best Model): 0.994156539440155
R SQUARED
	Train R²:	0.9985
	Test R²:	0.9942
MEAN AVERAGE ERROR
	Train MAE:	4952.11
	Test MAE:	8104.39
ROOT MEAN SQUARED ERROR
	Train RMSE:	7114.4
	Test RMSE:	14282.57

10 Randomly selected results.
Index: 117 	- 	Prediction: $583,661 	Actual: $575,000 	Difference: 8,661, -1.48%
Index: 1349 	- 	Prediction: $546,007 	Actual: $550,000 	Difference: -3,993, 0.73%
Index: 1112 	- 	Prediction: $687,268 	Actual: $685,000 	Difference: 2,268, -0.33%
Index: 834 	- 	Prediction: $226,263 	Actual: $226,500 	Difference: -237, 0.1%
Index: 18 	- 	Prediction: $796,317 	Actual: $827,522 	Difference: -31,205, 3.92%
Index: 1099 	- 	Prediction: $213,457 	Actual: $207,000 	Difference: 6,457, -3.02%
Index: 1253 	- 	Prediction: $305,439 	Actual: $292,000 	Difference: 13,439, -4.4%
Index: 1164 	- 	Prediction: $759,304 	Actual: $780,000 	Difference: -20,696, 2.73%
Index: 978 	- 	Prediction: $428,576 	Actual: $415,000 	D