## 4th approach - Using PCA and z-score normalization

#### Overview

- This notebook demonstrates an approach to building a ML model using a cleaned dataset. The key steps include feature selection, scaling, Principal Component Analysis (PCA), model training, and hyperparameter tuning using Grid Search.

#### Steps, on the 1st block of code, are:

**Feature selection**

- Numerical Features: We select a set of numerical features that are likely to influence the target variable.
- Categorical Features: Categorical features are also selected, and will later be encoded.

**Data Preprocessing**

- We apply Z-score normalization to the numerical features to standardize their values.
- Correlation Matrix - We calculate the correlation matrix to identify the most relevant numerical features with a correlation threshold of 0.15.
- Categorical Encoding - We include categorical features for one-hot encoding and merge them with the selected numerical features.
- Feature Scaling - We apply standard scaling to the numerical features to ensure all features contribute equally.

**Principal Component Analysis (PCA)**

- PCA is applied to reduce the dimensionality of the dataset while retaining 95% of the variance.

**Train-Test Split**

- We split the data into training and testing sets.

**Model Training and Evaluation**

- We initialize and train three models: Gradient Boosting, Random Forest, and XGBoost.
- We evaluate the models using Mean Squared Error (MSE) and R².

**Hyperparameter Tuning with Grid Search**

- We perform Grid Search to find the best hyperparameters for the model with the highest R² score.


In [None]:
# Import libraries

import pandas as pd
from scipy.stats import zscore
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import GridSearchCV


# Load the dataset

df_cleaned_path = '/Users/alexandreribeiro/Downloads/df_cleaned.csv'
df_cleaned = pd.read_csv(df_cleaned_path)

# Select numerical and categorical features

numerical_features = [
    'living_space_size_(m2)', 'lot_size_(m2)', 'build_year', 
    'estimated_neighbourhood_price_per_m2', 'toilet', 'rooms', 'bathroom'
]
categorical_features = [
    'city', 'build_type', 'house_type', 'house_type_detail', 
    'roof', 'floors', 'energy_label', 'position', 'garden'
]
target = 'price'

# Apply Z-score normalization to numerical features

df_cleaned[numerical_features] = df_cleaned[numerical_features].apply(zscore)

# Calculate correlation matrix using only numerical features

correlation_matrix = df_cleaned[numerical_features + [target]].corr()
correlated_features = correlation_matrix.index[abs(correlation_matrix[target]) > 0.15].tolist()

# Include categorical features for encoding

selected_features = list(set(correlated_features + categorical_features + [target]))
df_selected = df_cleaned[selected_features]

# Encode categorical variables

df_encoded = pd.get_dummies(df_selected, columns=categorical_features, drop_first=True)

# Separate features and target variable

X = df_encoded.drop(columns=[target])
y = df_encoded[target]

# Scale numerical features

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA

from sklearn.decomposition import PCA

pca = PCA(n_components=0.95)  # Retain 95% of variance

X_pca = pca.fit_transform(X_scaled)

# Split the data

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=0.2, random_state=42)

# Initialize models

from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, r2_score

models = {
    'GradientBoosting': GradientBoostingRegressor(random_state=42),
    'RandomForest': RandomForestRegressor(random_state=42),
    'XGBoost': XGBRegressor(random_state=42)
}

# Train and evaluate each model

results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    results[name] = {'MSE': mse, 'R2': r2}

# Display results

for name, metrics in results.items():
    print(f"{name}: MSE = {metrics['MSE']}, R2 = {metrics['R2']}")

# Perform Grid Search for the best model

from sklearn.model_selection import GridSearchCV

param_grids = {
    'GradientBoosting': {
        'n_estimators': [100, 200],
        'learning_rate': [0.1, 0.05],
        'max_depth': [3, 4, 5]
    },
    'RandomForest': {
        'n_estimators': [100, 200],
        'max_depth': [10, 20],
        'min_samples_split': [2, 5]
    },
    'XGBoost': {
        'n_estimators': [100, 200],
        'learning_rate': [0.1, 0.05],
        'max_depth': [3, 4, 5]
    }
}

# Find the best model

best_model_name = max(results, key=lambda name: results[name]['R2'])
best_model = models[best_model_name]
param_grid = param_grids[best_model_name]

grid_search = GridSearchCV(estimator=best_model, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2, scoring='r2')
grid_search.fit(X_train, y_train)

# Get best parameters and evaluate on test set

best_params = grid_search.best_params_
best_estimator = grid_search.best_estimator_
y_pred = best_estimator.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Best Model: {best_model_name}")
print(f"Best Parameters: {best_params}")
print(f"Test Set Performance: MSE = {mse}, R2 = {r2}")

GradientBoosting: MSE = 40826593558.49962, R2 = 0.6466229452631963
RandomForest: MSE = 48316334735.84451, R2 = 0.5817950366060929
XGBoost: MSE = 34466279683.907974, R2 = 0.7016749978065491
Fitting 3 folds for each of 12 candidates, totalling 36 fits
[CV] END ...learning_rate=0.1, max_depth=3, n_estimators=100; total time=   5.3s
[CV] END ...learning_rate=0.1, max_depth=3, n_estimators=100; total time=   5.5s
[CV] END ...learning_rate=0.1, max_depth=3, n_estimators=100; total time=   5.6s
[CV] END ...learning_rate=0.1, max_depth=4, n_estimators=100; total time=   9.7s
[CV] END ...learning_rate=0.1, max_depth=3, n_estimators=200; total time=   9.8s
[CV] END ...learning_rate=0.1, max_depth=3, n_estimators=200; total time=   9.9s
[CV] END ...learning_rate=0.1, max_depth=3, n_estimators=200; total time=  10.2s
[CV] END ...learning_rate=0.1, max_depth=4, n_estimators=100; total time=  10.8s
[CV] END ...learning_rate=0.1, max_depth=4, n_estimators=100; total time=  10.0s
[CV] END ...learning_

#### Grouping the results into a dataframe

- After analyzing other approaches, we found out that this try was the most sucessful one with a higher R2 for XGBoost

In [8]:
import pandas as pd

# Results data
results = [
    {'Model': 'GradientBoosting', 'MSE': 40826593558.49962, 'R2': 0.6466229452631963, 'MAE': 115976.443411, 'Best Parameters': None},
    {'Model': 'RandomForest', 'MSE': 48316334735.84451, 'R2': 0.5817950366060929, 'MAE': 120205.811967, 'Best Parameters': None},
    {'Model': 'XGBoost', 'MSE': 34466279683.907974, 'R2': 0.7016749978065491, 'MAE': 112525.067799, 'Best Parameters': None},
    {'Model': 'XGBoost (Grid Search)', 'MSE': 41162374909.99318, 'R2': 0.64371657371521, 'MAE': 112348.109583, 'Best Parameters': {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 200}}
]

# Convert to DataFrame
results_df = pd.DataFrame(results)

results_df

Unnamed: 0,Model,MSE,R2,MAE,Best Parameters
0,GradientBoosting,40826590000.0,0.646623,115976.443411,
1,RandomForest,48316330000.0,0.581795,120205.811967,
2,XGBoost,34466280000.0,0.701675,112525.067799,
3,XGBoost (Grid Search),41162370000.0,0.643717,112348.109583,"{'learning_rate': 0.1, 'max_depth': 3, 'n_esti..."


#### Hyperparameter tuning approach


- Randomized Search CV: This is used first to quickly explore a wide range of hyperparameters. It is more efficient than Grid Search when you have many parameters to tune and you want to get a rough idea of what works best.
- Grid Search CV: After identifying a promising region of the hyperparameter space with Randomized Search, Grid Search is used to finely tune the model by systematically exploring the hyperparameters around the best values found.
- Evaluation: The final performance of the models found by Randomized Search and Grid Search is evaluated on the test set, allowing you to compare how well each approach worked.

In [None]:
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
import numpy as np

# Define the parameter grid for XGBoost
param_grid = {
    'learning_rate': [0.01, 0.05, 0.1, 0.2],
    'n_estimators': [50, 100, 200, 300],
    'max_depth': [3, 4, 5, 6, 7, 8],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0]
}

# Initialize the XGBoost model
xgb_model = XGBRegressor(random_state=42)

# Randomized Search
random_search = RandomizedSearchCV(
    estimator=xgb_model,
    param_distributions=param_grid,
    n_iter=100,  # number of different combinations to try
    cv=5,
    verbose=2,
    n_jobs=-1,
    scoring='r2',
    random_state=42
)

random_search.fit(X_train, y_train)
best_params_random = random_search.best_params_
best_estimator_random = random_search.best_estimator_
print(f"Best Parameters from Randomized Search: {best_params_random}")

# Evaluate the best estimator
y_pred_random = best_estimator_random.predict(X_test)
mse_random = mean_squared_error(y_test, y_pred_random)
r2_random = r2_score(y_test, y_pred_random)
print(f"Randomized Search - Test Set Performance: MSE = {mse_random}, R2 = {r2_random}")

# Grid Search using a refined grid around the best parameters from Randomized Search
param_grid_refined = {
    'learning_rate': [best_params_random['learning_rate'] / 2, best_params_random['learning_rate'], best_params_random['learning_rate'] * 2],
    'n_estimators': [best_params_random['n_estimators'] // 2, best_params_random['n_estimators'], best_params_random['n_estimators'] * 2],
    'max_depth': [best_params_random['max_depth'] - 1, best_params_random['max_depth'], best_params_random['max_depth'] + 1],
    'subsample': [best_params_random['subsample'] - 0.1, best_params_random['subsample'], best_params_random['subsample'] + 0.1],
    'colsample_bytree': [best_params_random['colsample_bytree'] - 0.1, best_params_random['colsample_bytree'], best_params_random['colsample_bytree'] + 0.1]
}

grid_search = GridSearchCV(
    estimator=xgb_model,
    param_grid=param_grid_refined,
    cv=5,
    verbose=2,
    n_jobs=-1,
    scoring='r2'
)

grid_search.fit(X_train, y_train)
best_params_grid = grid_search.best_params_
best_estimator_grid = grid_search.best_estimator_
print(f"Best Parameters from Grid Search: {best_params_grid}")

# Evaluate the best estimator
y_pred_grid = best_estimator_grid.predict(X_test)
mse_grid = mean_squared_error(y_test, y_pred_grid)
r2_grid = r2_score(y_test, y_pred_grid)
print(f"Grid Search - Test Set Performance: MSE = {mse_grid}, R2 = {r2_grid}")

Fitting 5 folds for each of 100 candidates, totalling 500 fits
[CV] END colsample_bytree=1.0, learning_rate=0.1, max_depth=3, n_estimators=50, subsample=0.6; total time=   4.3s
[CV] END colsample_bytree=1.0, learning_rate=0.1, max_depth=3, n_estimators=50, subsample=0.6; total time=   4.4s
[CV] END colsample_bytree=1.0, learning_rate=0.1, max_depth=3, n_estimators=50, subsample=0.6; total time=   4.9s
[CV] END colsample_bytree=1.0, learning_rate=0.1, max_depth=3, n_estimators=50, subsample=0.6; total time=   3.4s
[CV] END colsample_bytree=1.0, learning_rate=0.1, max_depth=3, n_estimators=50, subsample=0.6; total time=   3.5s
[CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=7, n_estimators=200, subsample=0.6; total time= 3.0min
[CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=7, n_estimators=200, subsample=0.6; total time= 3.1min
[CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=7, n_estimators=200, subsample=0.6; total time= 3.1min
[CV] END colsample_byt



[CV] END colsample_bytree=0.7000000000000001, learning_rate=0.05, max_depth=2, n_estimators=600, subsample=0.8; total time=  11.4s
[CV] END colsample_bytree=0.7000000000000001, learning_rate=0.05, max_depth=2, n_estimators=600, subsample=0.8; total time=  12.2s
[CV] END colsample_bytree=0.7000000000000001, learning_rate=0.05, max_depth=2, n_estimators=600, subsample=0.8; total time=  12.3s
[CV] END colsample_bytree=0.7000000000000001, learning_rate=0.05, max_depth=3, n_estimators=150, subsample=0.7000000000000001; total time=   6.2s
[CV] END colsample_bytree=0.7000000000000001, learning_rate=0.05, max_depth=2, n_estimators=600, subsample=0.9; total time=  12.4s
[CV] END colsample_bytree=0.7000000000000001, learning_rate=0.05, max_depth=2, n_estimators=600, subsample=0.9; total time=  12.2s
[CV] END colsample_bytree=0.7000000000000001, learning_rate=0.05, max_depth=2, n_estimators=600, subsample=0.9; total time=  12.5s
[CV] END colsample_bytree=0.7000000000000001, learning_rate=0.05, ma

-----------------------------------------------------------------------------------------------------------------------------

#### Initial Model Training and Cross-Validation:

- Objective: To train and evaluate three different models (GradientBoosting, RandomForest, and XGBoost) using cross-validation and identify the best-performing model based on cross-validation results.

Steps:

1. Cross-Validation: For each model, the code performs 5-fold cross-validation on the training data using the cross_val_score function, which calculates the R² score (scoring='r2') for each fold.
2.	Model Training: After cross-validation, each model is fitted on the entire training set.
3.	Evaluation: The model’s performance is evaluated on the test set by calculating MSE (Mean Squared Error), R² score, and MAE (Mean Absolute Error).
4.	Results Storage: The results of the cross-validation (mean and standard deviation of R² scores) and the evaluation on the test set are stored in dictionaries.
5.	Best Model Selection: The best model is selected based on the highest mean R² score from the cross-validation.

- Outcome: This approach identifies the best model based on cross-validation performance and provides a baseline evaluation of each model on the test set.

In [4]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# Assuming X_train, X_test, y_train, y_test are already defined

# Initialize models
models = {
    'GradientBoosting': GradientBoostingRegressor(random_state=42),
    'RandomForest': RandomForestRegressor(random_state=42),
    'XGBoost': XGBRegressor(random_state=42)
}

# Train and evaluate each model using cross-validation
results = {}
cv_results = {}
for name, model in models.items():
    cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='r2')
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    results[name] = {'MSE': mse, 'R2': r2, 'MAE': mae}
    cv_results[name] = {'CV R2 Mean': np.mean(cv_scores), 'CV R2 Std': np.std(cv_scores)}

# Display results
for name, metrics in results.items():
    print(f"{name}: MSE = {metrics['MSE']}, R2 = {metrics['R2']}, MAE = {metrics['MAE']}")
    print(f"{name} CV: R2 Mean = {cv_results[name]['CV R2 Mean']}, R2 Std = {cv_results[name]['CV R2 Std']}")

# Find the best model based on cross-validation R2 mean
best_model_name = max(cv_results, key=lambda name: cv_results[name]['CV R2 Mean'])
best_model = models[best_model_name]
print(f"Best Model based on CV: {best_model_name}")

GradientBoosting: MSE = 40826593558.49962, R2 = 0.6466229452631963, MAE = 116476.07786641359
GradientBoosting CV: R2 Mean = 0.6229178561084371, R2 Std = 0.03253142260968761
RandomForest: MSE = 48316334735.84451, R2 = 0.5817950366060929, MAE = 119963.07312500001
RandomForest CV: R2 Mean = 0.571062829242394, R2 Std = 0.027650427755007805
XGBoost: MSE = 34466279683.907974, R2 = 0.7016749978065491, MAE = 112782.751809042
XGBoost CV: R2 Mean = 0.5511950969696044, R2 Std = 0.051661427635945216
Best Model based on CV: GradientBoosting


#### Grid Search CV for Hyperparameter Tuning:

- Objective: To fine-tune the hyperparameters of the best-performing model identified in the first approach using Grid Search with cross-validation.

Steps:

1.	Parameter Grid Definition: A grid of hyperparameters is defined for each model (GradientBoosting, RandomForest, and XGBoost). These grids specify the range of hyperparameters to be tested.
2.	Grid Search: The best model from the first approach is passed to GridSearchCV, which systematically evaluates all possible combinations of the hyperparameters specified in the grid. This process involves performing 5-fold cross-validation for each combination.
3.	Best Parameters Identification: After the Grid Search completes, the best combination of hyperparameters is identified.
4.	Final Model Evaluation: The best model (with the best hyperparameters) is re-evaluated on the test set to measure its final performance.

Outcome: This approach further refines the best model by finding the optimal set of hyperparameters, aiming to improve the model’s performance on the test set.

In [5]:
from sklearn.model_selection import GridSearchCV

# Define parameter grids for GridSearchCV
param_grids = {
    'GradientBoosting': {
        'n_estimators': [100, 200],
        'learning_rate': [0.1, 0.05],
        'max_depth': [3, 4, 5]
    },
    'RandomForest': {
        'n_estimators': [100, 200],
        'max_depth': [10, 20],
        'min_samples_split': [2, 5]
    },
    'XGBoost': {
        'n_estimators': [100, 200],
        'learning_rate': [0.1, 0.05],
        'max_depth': [3, 4, 5]
    }
}

# Perform Grid Search for the best model
param_grid = param_grids[best_model_name]

grid_search = GridSearchCV(estimator=best_model, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2, scoring='r2')
grid_search.fit(X_train, y_train)

# Get best parameters and evaluate on test set
best_params = grid_search.best_params_
best_estimator = grid_search.best_estimator_
y_pred = best_estimator.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Best Model: {best_model_name}")
print(f"Best Parameters: {best_params}")
print(f"Test Set Performance: MSE = {mse}, R2 = {r2}")

Fitting 5 folds for each of 12 candidates, totalling 60 fits
[CV] END ...learning_rate=0.1, max_depth=3, n_estimators=100; total time= 2.6min
[CV] END ...learning_rate=0.1, max_depth=3, n_estimators=100; total time= 2.6min
[CV] END ...learning_rate=0.1, max_depth=3, n_estimators=100; total time= 2.6min
[CV] END ...learning_rate=0.1, max_depth=3, n_estimators=100; total time= 2.6min
[CV] END ...learning_rate=0.1, max_depth=3, n_estimators=100; total time= 2.6min
[CV] END ...learning_rate=0.1, max_depth=3, n_estimators=200; total time= 5.2min
[CV] END ...learning_rate=0.1, max_depth=3, n_estimators=200; total time= 5.3min
[CV] END ...learning_rate=0.1, max_depth=3, n_estimators=200; total time= 5.3min
[CV] END ...learning_rate=0.1, max_depth=4, n_estimators=100; total time= 3.4min
[CV] END ...learning_rate=0.1, max_depth=4, n_estimators=100; total time= 3.4min
[CV] END ...learning_rate=0.1, max_depth=4, n_estimators=100; total time= 3.4min
[CV] END ...learning_rate=0.1, max_depth=3, n_es