# Project 1 - Tehran House Price Prediction - Scikit-Learn - All Models

- Modelling House Price Prediction with **Scikit-Learn**

- Course Name :         Applied Machine Learning
- Course instructor:    Sohail Tehranipour
- Student Name :        Afshin Masoudi Ashtiani
- Project 1 -           Tehran House Price Prediction
- Date :                September 2024
- File(ipynb) :         4/5

## Step 1 : Install required libraries

In [None]:
%pip install pandas numpy joblib
%pip install lightgbm xgboost catboost

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Collecting catboost
  Using cached catboost-1.2.7-cp312-cp312-win_amd64.whl.metadata (1.2 kB)
Collecting graphviz (from catboost)
  Using cached graphviz-0.20.3-py3-none-any.whl.metadata (12 kB)
Collecting plotly (from catboost)
  Using cached plotly-5.24.1-py3-none-any.whl.metadata (7.3 kB)
Using cached catboost-1.2.7-cp312-cp312-win_amd64.whl (101.7 MB)
Using cached graphviz-0.20.3-py3-none-any.whl (47 kB)
Using cached plotly-5.24.1-py3-none-any.whl (19.1 MB)
Note: you may need to restart the kernel to use updated packages.


ERROR: THESE PACKAGES DO NOT MATCH THE HASHES FROM THE REQUIREMENTS FILE. If you have updated the package versions, please update the hashes. Otherwise, examine the package contents carefully; someone may have tampered with them.
    unknown package:
        Expected sha256 645082f23762c281a7e14fdc23b88e47a3e3bbf8655f5246d80194b104a8ada9
             Got        a6576c510a804292d5087144de0cd58c70d70b5f3e4b871b051d606fc5d09ba1



## Step 2 : Import required libraries

In [2]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

# from google.colab import drive
# drive.mount('/content/drive', force_remount=True)

## Step 3 : Load and Prepare data

In [4]:
# Load dataset
dataset_path = r'C:/Users/Afshin/Desktop/Project_1/datasets/cleaned_housePrice.csv'
# dataset_path = r'/content/drive/My Drive/Applied Machine Learning/Project 1 : Tehran House Price Prediction/datasets/cleaned_housePrice.csv'
df = pd.read_csv(dataset_path)
df.info()

# Define features and target variables
X = df.drop(columns=['Price'])  # Features
y = df['Price'] 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2884 entries, 0 to 2883
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Area       2884 non-null   int64  
 1   Room       2884 non-null   int64  
 2   Parking    2884 non-null   bool   
 3   Warehouse  2884 non-null   bool   
 4   Elevator   2884 non-null   bool   
 5   Address    2884 non-null   object 
 6   Price      2884 non-null   float64
dtypes: bool(3), float64(1), int64(2), object(1)
memory usage: 98.7+ KB


## Step 4 : Data Preprocessing

- Convert boolean columns to int

In [4]:
# Convert boolean columns to int
# boolean_features = X.select_dtypes(include=['bool']).columns.tolist()
# X[boolean_features] = X[boolean_features].astype(int)
# X.info()

- Create a preprocessor

In [5]:
# Identify feature types
numeric_features = X.select_dtypes(include=['number']).columns.tolist()
boolean_features = X.select_dtypes(include=['bool']).columns.tolist()
categorical_features = X.select_dtypes(include=['object']).columns.tolist()

print(f'> Numeric Features     is {numeric_features}')
print(f'> Boolean Features     is {boolean_features}')
print(f'> Categorical Features is {categorical_features}')

> Numeric Features     is ['Area', 'Room']
> Boolean Features     is ['Parking', 'Warehouse', 'Elevator']
> Categorical Features is ['Address']


In [6]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder

# Column transformer for preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ('num', MinMaxScaler(), numeric_features + boolean_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ], remainder= 'drop')
preprocessor.fit(X)
preprocessor

- Save the preprocessor

In [None]:
# Save the preprocessor to a file
import joblib

preprocessor_path = r'C:/Users/Afshin/Desktop/Project_1/models/tehran_house_price_preprocessor.joblib'
# preprocessor_path = r'/content/drive/My Drive/Applied Machine Learning/Project 1 : Tehran House Price Prediction/models/tehran_house_price_preprocessor.joblib'

joblib.dump(preprocessor, preprocessor_path)
print(f'> The preprocessor saved successfully in {preprocessor_path}')

> The preprocessor saved successfully in C:/Users/Afshin/Desktop/Project_1/models/tehran_house_price_preprocessor.joblib


- Split the dataset into **train** and **test** sets

In [58]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state= 123)
print(f'> Shape of X_train: {X_train.shape}, X_test: {X_test.shape}, y_train: {y_train.shape}, y_test: {y_test.shape}')

> Shape of X_train: (2307, 6), X_test: (577, 6), y_train: (2307,), y_test: (577,)


## Step 5 : Train model and Tune **Hyperparameters**

- Define a function using **GridSearchCV**

In [9]:
import time
from sklearn.metrics import r2_score
from sklearn.model_selection import KFold, GridSearchCV

def tune_model(model, param_grid, X_train, y_train, X_test, y_test) -> pd.DataFrame:
    start_time = time.time()

    kf = KFold(n_splits= 20, shuffle= True, random_state= 123)

    # GridSearchCV for hyperparameter tuning
    grid_search = GridSearchCV(model, param_grid= param_grid, scoring= 'r2', refit= True, cv= kf, n_jobs= -1)
    grid_search.fit(X_train, y_train)

    best_estimator = grid_search.best_estimator_
    best_params = grid_search.best_params_
    best_score = grid_search.best_score_
    best_estimator_name = str(best_estimator).rsplit('(', 1)[0].rsplit(',', 1)[1].strip()
    
    # Make predictions
    y_train_pred = best_estimator.predict(X_train)
    y_test_pred = best_estimator.predict(X_test)

    train_r2 = r2_score(y_train, y_train_pred)
    test_r2 = r2_score(y_test, y_test_pred)

    runtime = time.time() - start_time

    print(f">> Results from Grid Search for {best_estimator_name} " + "--" * 10)
    print(f"> Best Score      : {best_score:0.4f}")
    print(f"> Train R2        : {train_r2:0.4f}")
    print(f"> Test R2         : {test_r2:0.4f}")
    print(f"> Runtime         : {runtime:0.4f}")

    return pd.DataFrame({
        'Best_Score' : [best_score], 
        'Train_R2' : [train_r2], 
        'Test_R2' : [test_r2], 
        'Difference_R2' : [np.abs(train_r2 - test_r2)], 
        'Runtime' : [runtime], 
        'Best_Estimator' : [best_estimator]
    })

- Define models and their respective parameter grids

In [10]:
from sklearn.pipeline import Pipeline
# from catboost import CatBoostRegressor
from sklearn.linear_model import (LinearRegression, Lasso, Ridge, ElasticNet)
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.ensemble import (GradientBoostingRegressor, RandomForestRegressor, ExtraTreesRegressor)
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor

# Define a function to create a model pipeline
def create_pipeline(model):
    return Pipeline(steps=[('preprocessor', preprocessor), ('regressor', model)])

# Define models and their respective parameter grids
models_param_grids = {
    'Linear Regression': (create_pipeline(LinearRegression()), {
        'regressor__fit_intercept': [True, False],
    }),

    'Lasso': (create_pipeline(Lasso()), {
        'regressor__alpha': [0.01, 0.1, 1.0, 10.0],
        'regressor__max_iter': [500, 1000, 2000],
        'regressor__tol': [1e-3, 1e-4, 1e-5]
    }),

    'Ridge': (create_pipeline(Ridge()), {
        'regressor__alpha': [0.01, 0.1, 1.0, 10.0],
        'regressor__solver': ['auto', 'svd', 'cholesky', 'lsqr', 'sag', 'saga'],
        'regressor__max_iter': [500, 1000, 2000],
        'regressor__tol': [1e-3, 1e-4, 1e-5]
    }),

    'Elastic Net': (create_pipeline(ElasticNet()), {
        'regressor__alpha': [0.01, 0.1, 1.0, 10.0],
        'regressor__l1_ratio': [0.1, 0.5, 0.9, 1.0],
        'regressor__max_iter': [500, 1000, 2000],
        'regressor__tol': [1e-3, 1e-4, 1e-5]
    }),

    'Decision Tree Regressor': (create_pipeline(DecisionTreeRegressor()), {
        'regressor__max_depth': [-1, 10, 20, 30],
        'regressor__min_samples_split': [2, 5, 10],
        'regressor__min_samples_leaf': [1, 2, 4],
        'regressor__max_features': ['auto', 'sqrt', 'log2'],
    }),

    'KNeighbors Regressor': (create_pipeline(KNeighborsRegressor()), {
        'regressor__n_neighbors': [3, 5, 7, 9, 11],
        'regressor__weights': ['uniform', 'distance'],
        'regressor__algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
        'regressor__leaf_size': [20, 30, 40, 50],
        'regressor__p': [1, 2]
    }),

    'Kernel Ridge': (create_pipeline(KernelRidge()), {
        'regressor__alpha': [0.1, 1.0, 10.0, 100.0],
        'regressor__kernel': ['linear', 'polynomial', 'rbf'],
        'regressor__gamma': [0.1, 1.0, 10.0]
    }),

    'Random Forest Regressor': (create_pipeline(RandomForestRegressor()), {
        'regressor__n_estimators': [100, 200, 300],
        'regressor__max_features': ['auto', 'sqrt', 'log2'],
        'regressor__max_depth': [-1, 10, 20, 30],
        'regressor__bootstrap': [True, False]
    }),

    'Gradient Boosting Regressor': (create_pipeline(GradientBoostingRegressor()), {
        'regressor__n_estimators': [100, 200, 300],
        'regressor__learning_rate': [0.01, 0.05, 0.1, 0.2],
        'regressor__max_depth': [-1, 10, 20, 30],
        'regressor__subsample': [0.8, 1.0]
    }),

    'Extra Trees Regressor': (create_pipeline(ExtraTreesRegressor()), {
        'regressor__n_estimators': [100, 200, 300],
        'regressor__max_depth': [-1, 10, 20, 30],
        'regressor__max_features': ['auto', 'sqrt', 'log2'],
        'regressor__bootstrap': [True, False]
    }),

    'XGBoost Regressor': (create_pipeline(XGBRegressor()), {
        'regressor__n_estimators': [100, 200, 300],
        'regressor__learning_rate': [0.01, 0.05, 0.1, 0.2],
        'regressor__max_depth': [-1, 10, 20, 30],
        'regressor__subsample': [0.8, 1.0]
    }),

    'LGBM Regressor': (create_pipeline(LGBMRegressor()), {
        'regressor__n_estimators': [100, 200, 300],
        'regressor__learning_rate': [0.01, 0.05, 0.1, 0.2],
        'regressor__max_depth': [-1, 10, 20, 30],
        'regressor__num_leaves': [31, 63, 127],
        'regressor__boosting_type': ['gbdt', 'dart']
    }),

    #'CatBoost Regressor': (create_pipeline(CatBoostRegressor(verbose=0)), {
    #    'regressor__n_estimators': [100, 200, 300],
    #    'regressor__learning_rate': [0.01, 0.05, 0.1, 0.2],
    #    'regressor__depth': [6, 8, 10],
    #    'regressor__l2_leaf_reg': [3, 5, 7]
    #}),
}

- Tune each model

In [11]:
# Store results
results_df = pd.DataFrame()

# Tune each model
for model_name, (model_pipeline, param_grid) in models_param_grids.items():
    print(f"\n>> Tuning {model_name} ...")
    new_row = tune_model(model_pipeline, param_grid, X_train, y_train, X_test, y_test)
    new_row.insert(0, 'Name', [model_name])
    new_row
    results_df = pd.concat([results_df, new_row], axis=0, ignore_index=True)

results_df


>> Tuning Linear Regression ...
>> Results from Grid Search for LinearRegression --------------------
> Best Score      : 0.8320
> Train R2        : 0.8716
> Test R2         : 0.8450
> Runtime         : 5.2005

>> Tuning Lasso ...
>> Results from Grid Search for Lasso --------------------
> Best Score      : 0.8296
> Train R2        : 0.8716
> Test R2         : 0.8407
> Runtime         : 189.9701

>> Tuning Ridge ...
>> Results from Grid Search for Ridge --------------------
> Best Score      : 0.8327
> Train R2        : 0.8659
> Test R2         : 0.8374
> Runtime         : 27.8812

>> Tuning Elastic Net ...
>> Results from Grid Search for ElasticNet --------------------
> Best Score      : 0.8296
> Train R2        : 0.8716
> Test R2         : 0.8407
> Runtime         : 231.9063

>> Tuning Decision Tree Regressor ...
>> Results from Grid Search for DecisionTreeRegressor --------------------
> Best Score      : 0.7110
> Train R2        : 0.7571
> Test R2         : 0.6604
> Runtime     

Unnamed: 0,Name,Best_Score,Train_R2,Test_R2,Difference_R2,Runtime,Best_Estimator
0,Linear Regression,0.831952,0.871592,0.845027,0.026565,5.200489,"(ColumnTransformer(transformers=[('num', MinMa..."
1,Lasso,0.82964,0.871592,0.840737,0.030855,189.97011,"(ColumnTransformer(transformers=[('num', MinMa..."
2,Ridge,0.832692,0.865903,0.837437,0.028466,27.881236,"(ColumnTransformer(transformers=[('num', MinMa..."
3,Elastic Net,0.82964,0.871592,0.840737,0.030855,231.906334,"(ColumnTransformer(transformers=[('num', MinMa..."
4,Decision Tree Regressor,0.711,0.757065,0.660363,0.096702,12.135073,"(ColumnTransformer(transformers=[('num', MinMa..."
5,KNeighbors Regressor,0.818835,0.985937,0.820416,0.165521,62.731019,"(ColumnTransformer(transformers=[('num', MinMa..."
6,Kernel Ridge,0.861786,0.917461,0.87557,0.041891,153.998145,"(ColumnTransformer(transformers=[('num', MinMa..."
7,Random Forest Regressor,0.802,0.911061,0.790561,0.120501,357.469936,"(ColumnTransformer(transformers=[('num', MinMa..."
8,Gradient Boosting Regressor,0.83338,0.965524,0.839464,0.12606,1574.656969,"(ColumnTransformer(transformers=[('num', MinMa..."
9,Extra Trees Regressor,0.796823,0.890737,0.7959,0.094837,284.516606,"(ColumnTransformer(transformers=[('num', MinMa..."


- Store the results in a DataFrame for analysis

In [31]:
# Sort the results in a DataFrame for analysis
sorted_results_df = results_df[results_df.Best_Score > 0.83].sort_values(by=['Best_Score'], ascending=[False])
best_score_index = results_df.Best_Score.idxmax()
train_r2_index = results_df.Train_R2.idxmax()
test_r2_index = results_df.Test_R2.idxmax()
difference_r2_index = results_df.Difference_R2.idxmin()
best_score_model = results_df.loc[best_score_index, 'Best_Estimator']
train_r2_model = results_df.loc[train_r2_index, 'Best_Estimator']
test_r2_model = results_df.loc[test_r2_index, 'Best_Estimator']
difference_r2_model = results_df.loc[difference_r2_index, 'Best_Estimator']

- Summary of Model Tuning Results

In [32]:
print(">> Summary of Model Tuning Results :")
print(f"> Maximum Best Score Model    : {results_df.loc[best_score_index, 'Name']}")
print(f"> Maximum Train R2 Model      : {results_df.loc[train_r2_index, 'Name']}")
print(f"> Maximum Test R2 Model       : {results_df.loc[test_r2_index, 'Name']}")
print(f"> Minimum Difference R2 Model : {results_df.loc[difference_r2_index, 'Name']}")
sorted_results_df

>> Summary of Model Tuning Results :
> Maximum Best Score Model    : Kernel Ridge
> Maximum Train R2 Model      : KNeighbors Regressor
> Maximum Test R2 Model       : Kernel Ridge
> Minimum Difference R2 Model : Linear Regression


Unnamed: 0,Name,Best_Score,Train_R2,Test_R2,Difference_R2,Runtime,Best_Estimator
6,Kernel Ridge,0.861786,0.917461,0.87557,0.041891,153.998145,"(ColumnTransformer(transformers=[('num', MinMa..."
10,XGBoost Regressor,0.838707,0.94966,0.845919,0.103741,737.797046,"(ColumnTransformer(transformers=[('num', MinMa..."
8,Gradient Boosting Regressor,0.83338,0.965524,0.839464,0.12606,1574.656969,"(ColumnTransformer(transformers=[('num', MinMa..."
2,Ridge,0.832692,0.865903,0.837437,0.028466,27.881236,"(ColumnTransformer(transformers=[('num', MinMa..."
0,Linear Regression,0.831952,0.871592,0.845027,0.026565,5.200489,"(ColumnTransformer(transformers=[('num', MinMa..."


## Step 6 : Save the tuned best model

- Save the tuned best model

In [None]:
import joblib

for index, row in sorted_results_df[:3].iterrows():
    tuned_best_pipeline_name = row.Name.replace(' ', '')
    tuned_best_pipeline= row.Best_Estimator

    pipeline_path = f"C:/Users/Afshin/Desktop/Project_1/models/{tuned_best_pipeline_name}_pipeline.joblib"
    # pipeline_path = f'/content/drive/My Drive/Applied Machine Learning/Project 1 : Tehran House Price Prediction/models/{tuned_best_pipeline_name}_pipeline.joblib'
    
    joblib.dump(tuned_best_pipeline, pipeline_path)
    print(f"> Tuned {row.Name} pipeline Saved to {pipeline_path}")

> Tuned best pipeline Saved to C:/Users/Afshin/Desktop/Project_1/models/KernelRidge_pipeline.joblib
> Tuned best pipeline Saved to C:/Users/Afshin/Desktop/Project_1/models/XGBoostRegressor_pipeline.joblib
> Tuned best pipeline Saved to C:/Users/Afshin/Desktop/Project_1/models/GradientBoostingRegressor_pipeline.joblib


## Step 7 : Make Predictions

- Set display option for floating-point numbers

In [39]:
# Set display option for floating-point numbers
pd.set_option('display.float_format', lambda x: '%.f' % x)

- Load the tuned best pipeline

In [54]:
pipeline_path = 'C:/Users/Afshin/Desktop/Project_1/models/KernelRidge_pipeline.joblib'
loaded_tuned_best_pipeline = joblib.load(pipeline_path)
loaded_tuned_best_pipeline

- Predict the random sample 

In [56]:
# Make predictions
df_sample = df.sample(1)
df_sample['Predicted Price'] = loaded_tuned_best_pipeline.predict(df_sample.drop('Price', axis= 1))
df_sample[['Price','Predicted Price']]

Unnamed: 0,Price,Predicted Price
1668,156802500000,148617753430


- Predict the new sample 

In [57]:
# Make predictions on new data
new_data = pd.DataFrame({
    'Area': [100],  # Example data
    'Room': [2],
    'Parking': [True],
    'Warehouse': [True],
    'Elevator': [True],
    'Address': ['Punak']
})
new_data['Predicted Price'] = loaded_tuned_best_pipeline.predict(new_data)
new_data

Unnamed: 0,Area,Room,Parking,Warehouse,Elevator,Address,Predicted Price
0,100,2,True,True,True,Punak,105940873714
