# Project 1 - Tehran House Price Prediction - Scikit-Learn

- Modelling House Price Prediction with **Scikit-Learn**

- Course Name :         Applied Machine Learning
- Course instructor:    Sohail Tehranipour
- Student Name :        Afshin Masoudi Ashtiani
- Project 1 -           Tehran House Price Prediction
- Date :                September 2024
- File(ipynb) :         5/5

## Step 1 : Install required libraries

In [9]:
%pip install pandas numpy joblib
%pip install lightgbm xgboost catboost

Note: you may need to restart the kernel to use updated packages.
Collecting catboost
  Using cached catboost-1.2.7-cp312-cp312-win_amd64.whl.metadata (1.2 kB)
Collecting graphviz (from catboost)
  Using cached graphviz-0.20.3-py3-none-any.whl.metadata (12 kB)
Collecting plotly (from catboost)
  Using cached plotly-5.24.1-py3-none-any.whl.metadata (7.3 kB)
Using cached catboost-1.2.7-cp312-cp312-win_amd64.whl (101.7 MB)
Using cached graphviz-0.20.3-py3-none-any.whl (47 kB)
Using cached plotly-5.24.1-py3-none-any.whl (19.1 MB)
Note: you may need to restart the kernel to use updated packages.


ERROR: THESE PACKAGES DO NOT MATCH THE HASHES FROM THE REQUIREMENTS FILE. If you have updated the package versions, please update the hashes. Otherwise, examine the package contents carefully; someone may have tampered with them.
    unknown package:
        Expected sha256 645082f23762c281a7e14fdc23b88e47a3e3bbf8655f5246d80194b104a8ada9
             Got        a6576c510a804292d5087144de0cd58c70d70b5f3e4b871b051d606fc5d09ba1



## Step 2 : Import required libraries

In [2]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

# from google.colab import drive
# drive.mount('/content/drive', force_remount=True)

## Step 3 : Load and Prepare data

In [3]:
# Load dataset
dataset_path = r'C:/Users/Afshin/Desktop/Project_1/datasets/cleaned_housePrice.csv'
# dataset_path = r'/content/drive/My Drive/Applied Machine Learning/Project 1 : Tehran House Price Prediction/datasets/cleaned_housePrice.csv'
df = pd.read_csv(dataset_path)
df.info()

# Define features and target variables
X = df.drop(columns=['Price'])  # Features
y = df['Price'] 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2884 entries, 0 to 2883
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Area       2884 non-null   int64  
 1   Room       2884 non-null   int64  
 2   Parking    2884 non-null   bool   
 3   Warehouse  2884 non-null   bool   
 4   Elevator   2884 non-null   bool   
 5   Address    2884 non-null   object 
 6   Price      2884 non-null   float64
dtypes: bool(3), float64(1), int64(2), object(1)
memory usage: 98.7+ KB


## Step 4 : Data Preprocessing

- Convert boolean columns to int

In [4]:
# Convert boolean columns to int
# boolean_features = X.select_dtypes(include=['bool']).columns.tolist()
# X[boolean_features] = X[boolean_features].astype(int)
# X.info()

- Create a preprocessor

In [4]:
# Identify feature types
numeric_features = X.select_dtypes(include=['number']).columns.tolist()
boolean_features = X.select_dtypes(include=['bool']).columns.tolist()
categorical_features = X.select_dtypes(include=['object']).columns.tolist()

print(f'> Numeric Features     is {numeric_features}')
print(f'> Boolean Features     is {boolean_features}')
print(f'> Categorical Features is {categorical_features}')

> Numeric Features     is ['Area', 'Room']
> Boolean Features     is ['Parking', 'Warehouse', 'Elevator']
> Categorical Features is ['Address']


In [5]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder

# Column transformer for preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ('num', MinMaxScaler(), numeric_features + boolean_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ], remainder= 'drop')
preprocessor.fit(X)
preprocessor

- Save the preprocessor

In [6]:
# Save the preprocessor to a file
import joblib

preprocessor_path = r'C:/Users/Afshin/Desktop/Project_1/models/tehran_house_price_preprocessor.joblib'
# preprocessor_path = r'/content/drive/My Drive/Applied Machine Learning/Project 1 : Tehran House Price Prediction/models/tehran_house_price_preprocessor.joblib'

joblib.dump(preprocessor, preprocessor_path)
print(f'> The preprocessor saved successfully in {preprocessor_path}')

> The preprocessor saved successfully in C:/Users/Afshin/Desktop/Project_1/models/tehran_house_price_preprocessor.joblib


- Split the dataset into **train** and **test** sets

In [7]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state= 123)
print(f'> Shape of X_train: {X_train.shape}, X_test: {X_test.shape}, y_train: {y_train.shape}, y_test: {y_test.shape}')

> Shape of X_train: (2307, 6), X_test: (577, 6), y_train: (2307,), y_test: (577,)


## Step 5 : Train model and Tune **Hyperparameters**

- Define a function using **GridSearchCV**

In [8]:
import time
from sklearn.metrics import r2_score
from sklearn.model_selection import KFold, GridSearchCV

def tune_model(model, param_grid, X_train, y_train, X_test, y_test) -> pd.DataFrame:
    start_time = time.time()

    kf = KFold(n_splits= 20, shuffle= True, random_state= 123)

    # GridSearchCV for hyperparameter tuning
    grid_search = GridSearchCV(model, param_grid= param_grid, scoring= 'r2', refit= True, cv= kf, n_jobs= -1)
    grid_search.fit(X_train, y_train)

    best_estimator = grid_search.best_estimator_
    best_params = grid_search.best_params_
    best_score = grid_search.best_score_
    best_estimator_name = str(best_estimator).rsplit('(', 1)[0].rsplit(',', 1)[1].strip()
    
    # Make predictions
    y_train_pred = best_estimator.predict(X_train)
    y_test_pred = best_estimator.predict(X_test)

    train_r2 = r2_score(y_train, y_train_pred)
    test_r2 = r2_score(y_test, y_test_pred)

    runtime = time.time() - start_time

    print(f">> Results from Grid Search for {best_estimator_name} " + "--" * 10)
    print(f"> Best Score      : {best_score:0.4f}")
    print(f"> Train R2        : {train_r2:0.4f}")
    print(f"> Test R2         : {test_r2:0.4f}")
    print(f"> Runtime         : {runtime:0.4f}")

    return pd.DataFrame({
        'Best_Score' : [best_score], 
        'Train_R2' : [train_r2], 
        'Test_R2' : [test_r2], 
        'Difference_R2' : [np.abs(train_r2 - test_r2)], 
        'Runtime' : [runtime], 
        'Best_Estimator' : [best_estimator]
    })

- Define models and their respective parameter grids

In [10]:
from sklearn.pipeline import Pipeline
from sklearn.kernel_ridge import KernelRidge
# from catboost import CatBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor

# Define a function to create a model pipeline
def create_pipeline(model):
    return Pipeline(steps=[('preprocessor', preprocessor), ('regressor', model)])

# Define models and their respective parameter grids
models_param_grids = {
    'Kernel Ridge': (create_pipeline(KernelRidge()), {
        'regressor__alpha': [0.1, 1.0, 10.0, 100.0],
        'regressor__kernel': ['linear', 'polynomial', 'rbf'],
        'regressor__gamma': [0.1, 1.0, 10.0]
    }),

    'Gradient Boosting Regressor': (create_pipeline(GradientBoostingRegressor()), {
        'regressor__n_estimators': [100, 200, 300],
        'regressor__learning_rate': [0.01, 0.05, 0.1, 0.2],
        'regressor__max_depth': [-1, 10, 20, 30],
        'regressor__subsample': [0.8, 1.0]
    }),

    'XGBoost Regressor': (create_pipeline(XGBRegressor()), {
        'regressor__n_estimators': [100, 200, 300],
        'regressor__learning_rate': [0.01, 0.05, 0.1, 0.2],
        'regressor__max_depth': [-1, 10, 20, 30],
        'regressor__subsample': [0.8, 1.0]
    }),

    'LGBM Regressor': (create_pipeline(LGBMRegressor()), {
        'regressor__n_estimators': [100, 200, 300],
        'regressor__learning_rate': [0.01, 0.05, 0.1, 0.2],
        'regressor__max_depth': [-1, 10, 20, 30],
        'regressor__num_leaves': [31, 63, 127],
        'regressor__boosting_type': ['gbdt', 'dart']
    }),

    #'CatBoost Regressor': (create_pipeline(CatBoostRegressor(verbose=0)), {
    #    'regressor__n_estimators': [100, 200, 300],
    #    'regressor__learning_rate': [0.01, 0.05, 0.1, 0.2],
    #    'regressor__depth': [6, 8, 10],
    #    'regressor__l2_leaf_reg': [3, 5, 7]
    #}),
}

- Tune each model and Store the results in a DataFrame for analysis

In [11]:
# Store results
results_df = pd.DataFrame()

# Tune each model
for model_name, (model_pipeline, param_grid) in models_param_grids.items():
    print(f"\n>> Tuning {model_name} ...")
    new_row = tune_model(model_pipeline, param_grid, X_train, y_train, X_test, y_test)
    new_row.insert(0, 'Name', [model_name])
    new_row
    results_df = pd.concat([results_df, new_row], axis=0, ignore_index=True)

results_df


>> Tuning Kernel Ridge ...
>> Results from Grid Search for KernelRidge --------------------
> Best Score      : 0.8618
> Train R2        : 0.9175
> Test R2         : 0.8756
> Runtime         : 147.4552

>> Tuning Gradient Boosting Regressor ...
>> Results from Grid Search for GradientBoostingRegressor --------------------
> Best Score      : 0.8316
> Train R2        : 0.9659
> Test R2         : 0.8334
> Runtime         : 1613.3065

>> Tuning XGBoost Regressor ...
>> Results from Grid Search for XGBRegressor --------------------
> Best Score      : 0.8387
> Train R2        : 0.9497
> Test R2         : 0.8459
> Runtime         : 685.1159

>> Tuning LGBM Regressor ...
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000468 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 202
[LightGBM] [Info] Number of data points in the train set: 2307, nu

Unnamed: 0,Name,Best_Score,Train_R2,Test_R2,Difference_R2,Runtime,Best_Estimator
0,Kernel Ridge,0.861786,0.917461,0.87557,0.041891,147.455173,"(ColumnTransformer(transformers=[('num', MinMa..."
1,Gradient Boosting Regressor,0.831557,0.965927,0.83335,0.132577,1613.306533,"(ColumnTransformer(transformers=[('num', MinMa..."
2,XGBoost Regressor,0.838707,0.94966,0.845919,0.103741,685.115917,"(ColumnTransformer(transformers=[('num', MinMa..."
3,LGBM Regressor,0.737713,0.816738,0.743522,0.073216,2615.955143,"(ColumnTransformer(transformers=[('num', MinMa..."


- Sort the results in a DataFrame for analysis

In [12]:
# Sort the results in a DataFrame for analysis
results_df.sort_values(by= ['Best_Score'], ascending= [False], inplace= True)
print(">> Summary of Model Tuning Results :")
results_df

>> Summary of Model Tuning Results :


Unnamed: 0,Name,Best_Score,Train_R2,Test_R2,Difference_R2,Runtime,Best_Estimator
0,Kernel Ridge,0.861786,0.917461,0.87557,0.041891,147.455173,"(ColumnTransformer(transformers=[('num', MinMa..."
2,XGBoost Regressor,0.838707,0.94966,0.845919,0.103741,685.115917,"(ColumnTransformer(transformers=[('num', MinMa..."
1,Gradient Boosting Regressor,0.831557,0.965927,0.83335,0.132577,1613.306533,"(ColumnTransformer(transformers=[('num', MinMa..."
3,LGBM Regressor,0.737713,0.816738,0.743522,0.073216,2615.955143,"(ColumnTransformer(transformers=[('num', MinMa..."


## Step 6 : Save the tuned best model

- Save the tuned best model

In [14]:
import joblib

for index, row in results_df.iterrows():
    tuned_best_pipeline_name = row.Name.replace(' ', '')
    tuned_best_pipeline= row.Best_Estimator

    pipeline_path = f"C:/Users/Afshin/Desktop/Project_1/models/{tuned_best_pipeline_name}_pipeline.joblib"
    # pipeline_path = f'/content/drive/My Drive/Applied Machine Learning/Project 1 : Tehran House Price Prediction/models/{tuned_best_pipeline_name}_pipeline.joblib'
    
    joblib.dump(tuned_best_pipeline, pipeline_path)
    print(f"> Tuned {row.Name} pipeline Saved to {pipeline_path}")

> Tuned Kernel Ridge pipeline Saved to C:/Users/Afshin/Desktop/Project_1/models/KernelRidge_pipeline.joblib
> Tuned XGBoost Regressor pipeline Saved to C:/Users/Afshin/Desktop/Project_1/models/XGBoostRegressor_pipeline.joblib
> Tuned Gradient Boosting Regressor pipeline Saved to C:/Users/Afshin/Desktop/Project_1/models/GradientBoostingRegressor_pipeline.joblib
> Tuned LGBM Regressor pipeline Saved to C:/Users/Afshin/Desktop/Project_1/models/LGBMRegressor_pipeline.joblib


## Step 7 : Make Predictions

- Set display option for floating-point numbers

In [None]:
# Set display option for floating-point numbers
pd.set_option('display.float_format', lambda x: '%.f' % x)

- Load the tuned best pipeline

In [None]:
pipeline_path = 'C:/Users/Afshin/Desktop/Project_1/models/KernelRidge_pipeline.joblib'
# pipeline_path = '/content/drive/My Drive/Applied Machine Learning/Project 1 : Tehran House Price Prediction/models/KernelRidge_pipeline.joblib'

loaded_tuned_best_pipeline = joblib.load(pipeline_path)
loaded_tuned_best_pipeline

- Predict the random sample 

In [18]:
# Make predictions
df_sample = df.sample(1)
df_sample['Predicted Price'] = loaded_tuned_best_pipeline.predict(df_sample.drop('Price', axis= 1))
df_sample[['Price','Predicted Price']]

Unnamed: 0,Price,Predicted Price
1349,42950250000,49416240041


- Predict the new sample 

In [20]:
# Make predictions on new data
new_data = pd.DataFrame({
    'Area': [100],  # Example data
    'Room': [2],
    'Parking': [True],
    'Warehouse': [True],
    'Elevator': [True],
    'Address': ['Punak']
})
new_data['Predicted Price'] = loaded_tuned_best_pipeline.predict(new_data)
new_data

Unnamed: 0,Area,Room,Parking,Warehouse,Elevator,Address,Predicted Price
0,100,2,True,True,True,Punak,105940873714
