**<h1><center>Model Selection and Tunning</center></h1>**

In this file we are going to select the best model which gives less RMSE values. And fine tune that model.

**<h2>Contents</h2>**

1. Importing Libraries

2. Loading Data

2. Splitting train and test set

2. Initializing Models

3. Training Models

4. Selecting and fine Tuning that model

5. Saving Model



**1. Importing Libraries**

In [1]:
# Accessing drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!pip install catboost

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting catboost
  Downloading catboost-1.1.1-cp37-none-manylinux1_x86_64.whl (76.6 MB)
[K     |████████████████████████████████| 76.6 MB 1.2 MB/s 
Installing collected packages: catboost
Successfully installed catboost-1.1.1


In [3]:
# Importing the libraries

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

from sklearn.linear_model import Lasso, Ridge
from sklearn.linear_model import LinearRegression

from sklearn.ensemble import AdaBoostRegressor,GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor

from sklearn.neighbors import KNeighborsRegressor

from sklearn.model_selection import train_test_split, cross_val_score

from xgboost import XGBRegressor

from catboost import CatBoostRegressor

import warnings
warnings.filterwarnings('ignore')


2. **Loading Data**

Here we will load the preprocessed data set. And We will fit the model on this data.

In [4]:
# here you can change the path
total_data_one_hot_encoder = pd.read_csv('/content/drive/My Drive/\
Real Time Projects/House Price Prediction/Data Files/processed_one_\
hot_encoder_encoder.csv')

total_data_one_hot_encoder.shape

(2919, 121)

In [5]:
##Dividing the training set back into train and test dataset
train_one_hot_encoder = total_data_one_hot_encoder.iloc[:-1459]
test_one_hot_encoder = total_data_one_hot_encoder.iloc[-1459:].drop("SalePrice",
                                                                    axis=1)

In [6]:
test_one_hot_encoder.head()

Unnamed: 0,Id,LotFrontage,Street,LotShape,OverallQual,YearBuilt,YearRemodAdd,MasVnrArea,ExterQual,ExterCond,...,2Types,NA.1,CarPort,Attchd,Partial,Abnorml,Family,Alloca,AdjLand,Normal
1460,1461,80.0,2,4,5,1961,1961,0.0,3,3,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1461,1462,81.0,2,2,6,1958,1958,108.0,3,3,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1462,1463,74.0,2,2,5,1997,1998,0.0,3,3,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1463,1464,78.0,2,2,6,1998,1998,20.0,3,3,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1464,1465,43.0,2,2,8,1992,1992,0.0,4,3,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [7]:
train_one_hot_encoder.drop(['Id'],
                        axis = 1, inplace = True)

In [8]:
train_one_hot_encoder.head()

Unnamed: 0,LotFrontage,Street,LotShape,OverallQual,YearBuilt,YearRemodAdd,MasVnrArea,ExterQual,ExterCond,BsmtQual,...,2Types,NA.1,CarPort,Attchd,Partial,Abnorml,Family,Alloca,AdjLand,Normal
0,65.0,2,4,7,2003,2003,196.0,4,3,4,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,80.0,2,4,6,1976,1976,0.0,3,3,4,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,68.0,2,2,7,2001,2002,162.0,4,3,4,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,60.0,2,2,7,1915,1970,0.0,3,3,3,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,84.0,2,2,8,2000,2000,350.0,4,3,4,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


**3. Spliting train and test**

Here we will split the dataset into train and test. Here we will split in the ratio 0.8 : 0.2. Train set will have 80% of the data and test set will have 20% of data.

We will use train_test_split function to do this.

In [9]:
##X_train Y_train for the train.csv file
X = train_one_hot_encoder.drop(columns=['SalePrice'])
y = train_one_hot_encoder['SalePrice']

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, 
                                                    random_state=1)

In [11]:
X_test

Unnamed: 0,LotFrontage,Street,LotShape,OverallQual,YearBuilt,YearRemodAdd,MasVnrArea,ExterQual,ExterCond,BsmtQual,...,2Types,NA.1,CarPort,Attchd,Partial,Abnorml,Family,Alloca,AdjLand,Normal
258,80.0,2,4,7,2001,2001,172.0,4,3,4,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
267,60.0,2,4,5,1939,1997,0.0,3,3,3,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
288,68.0,2,2,5,1967,1967,31.0,3,4,3,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
649,21.0,2,4,4,1970,1970,0.0,3,3,4,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
1233,68.0,2,2,5,1959,1959,180.0,3,3,3,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
163,55.0,2,4,4,1956,1956,0.0,3,3,3,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
47,84.0,2,4,8,2006,2006,0.0,4,3,4,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1432,60.0,2,4,4,1927,2007,0.0,3,3,3,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
98,85.0,2,4,5,1920,1950,0.0,3,3,3,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


**4. Loading models</h3>**

In [12]:
models = {'Linear_Regression' : LinearRegression(),
          "catboost": CatBoostRegressor(verbose=0),
          "ridge": Ridge(),
          "xg": XGBRegressor(),
          "grb": GradientBoostingRegressor(),
          "rfr": RandomForestRegressor()
}

**5. Fitting Models** 

In [13]:
# Define a function to calculate RMSE
def rmse(y_true, y_pred):
    return np.sqrt(np.mean((y_true-y_pred)**2))

In [14]:
for name, model in models.items():
    cross_val_log = cross_val_score(model,
                                        X_train, y_train,
                                        cv=10,
                                        scoring = 'r2'
                                        )
    
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    result = rmse(y_test, pred)
    
    print("\n\nCross validation of {} is {}".format(name, cross_val_log))
    print("Cross validation of {} mean is {}".format(name, 
                                                     np.mean(cross_val_log)))

    print("\n{} has {}R2 score".format(name,r2_score(y_test,pred)*100))
    print(f'{name}: {result}')



Cross validation of Linear_Regression is [0.87030545 0.58455493 0.86946671 0.88159161 0.88051644 0.79674744
 0.8840334  0.84203076 0.90595545 0.90430745]
Cross validation of Linear_Regression mean is 0.8419509652584475

Linear_Regression has 87.589056794541R2 score
Linear_Regression: 0.1467637867634783


Cross validation of catboost is [0.90733103 0.8235024  0.88243077 0.89392191 0.91048213 0.84841927
 0.90617668 0.87883191 0.89837878 0.92183624]
Cross validation of catboost mean is 0.8871311126341219

catboost has 90.06379618370764R2 score
catboost: 0.13131875612887223


Cross validation of ridge is [0.86906194 0.57403913 0.87321077 0.88690587 0.8844266  0.79793067
 0.88400157 0.8521006  0.90312598 0.9121591 ]
Cross validation of ridge mean is 0.8436962228486597

ridge has 88.34704041620395R2 score
ridge: 0.14221147299521947


Cross validation of xg is [0.90665597 0.81894239 0.89127813 0.88622152 0.89554915 0.83063392
 0.89960024 0.87464442 0.89074533 0.91056282]
Cross validation of

**6. Selecting and Fine tunning the models**

- Catboost Model is selected as RMSE is less in Catboost

In [15]:
catboost = CatBoostRegressor(verbose=0, loss_function='RMSE')

catboost.fit(X_train, y_train)

##Prediction 
y_predict = catboost.predict(X_test)

##Antilog to retrive the actual values
y_predictictions = np.exp(y_predict)

##Evaluating the model's accuracy
from sklearn.metrics import r2_score
print("R2 score",r2_score(y_test,y_predict)*100)
print(rmse(y_test,y_predict))

R2 score 90.06379618370764
0.13131875612887223


In [16]:
import pickle 

# here you can change the path
filename = '/content/drive/My Drive/Real Time Projects/House Price \
Prediction/Data Files/catboost_model.csv'
pickle.dump(catboost,open(filename, 'wb'))

**Parameter Tuning**

In [17]:
from sklearn.model_selection import GridSearchCV

parameters = {'depth'         : [6,8,10,12,14],
              'learning_rate' : [0.01, 0.05, 0.1,0.06],
              'iterations'    : [20, 30, 50,75,100]
              }
catboost = CatBoostRegressor(verbose=1, loss_function='RMSE')

grid = GridSearchCV(estimator=catboost, 
                    param_grid = parameters, 
                    cv = 2, 
                    n_jobs=-1,
                    verbose = True)

grid.fit(X_train, y_train)

Fitting 2 folds for each of 100 candidates, totalling 200 fits
0:	learn: 0.3708344	total: 4.15ms	remaining: 411ms
1:	learn: 0.3494370	total: 8.31ms	remaining: 407ms
2:	learn: 0.3270400	total: 11.7ms	remaining: 379ms
3:	learn: 0.3078793	total: 14.7ms	remaining: 352ms
4:	learn: 0.2898411	total: 18ms	remaining: 341ms
5:	learn: 0.2739436	total: 21.9ms	remaining: 343ms
6:	learn: 0.2600715	total: 25.2ms	remaining: 335ms
7:	learn: 0.2475390	total: 28.6ms	remaining: 329ms
8:	learn: 0.2362440	total: 31.9ms	remaining: 323ms
9:	learn: 0.2262360	total: 35.3ms	remaining: 318ms
10:	learn: 0.2171882	total: 38.5ms	remaining: 312ms
11:	learn: 0.2094750	total: 41.9ms	remaining: 307ms
12:	learn: 0.2017088	total: 45ms	remaining: 301ms
13:	learn: 0.1951364	total: 48.2ms	remaining: 296ms
14:	learn: 0.1889614	total: 51.3ms	remaining: 291ms
15:	learn: 0.1833183	total: 54.5ms	remaining: 286ms
16:	learn: 0.1785399	total: 57.8ms	remaining: 282ms
17:	learn: 0.1735753	total: 60.9ms	remaining: 277ms
18:	learn: 0.16

GridSearchCV(cv=2,
             estimator=<catboost.core.CatBoostRegressor object at 0x7f558d89ed50>,
             n_jobs=-1,
             param_grid={'depth': [6, 8, 10, 12, 14],
                         'iterations': [20, 30, 50, 75, 100],
                         'learning_rate': [0.01, 0.05, 0.1, 0.06]},
             verbose=True)

In [18]:
grid.best_params_

{'depth': 6, 'iterations': 100, 'learning_rate': 0.1}

In [19]:
catboost = CatBoostRegressor(depth = 6, 
                             iterations = 100, 
                             learning_rate = 0.1
                             )

catboost.fit(X, y)

test = test_one_hot_encoder.drop(['Id'], axis = 1)

# Make predictions on the test set
y_pred = np.exp(catboost.predict(test))

output = pd.DataFrame({'Id': test_one_hot_encoder['Id'], 'SalePrice': y_pred})

# here you can change the path to store
output.to_csv('/content/drive/My Drive/Real Time Projects/House Price \
Prediction/Data Files/prediction_catboost_one_hot_encoding_full.csv', 
index=False)


0:	learn: 0.3743686	total: 3.78ms	remaining: 374ms
1:	learn: 0.3510708	total: 7.52ms	remaining: 369ms
2:	learn: 0.3304346	total: 10.8ms	remaining: 351ms
3:	learn: 0.3107235	total: 14.2ms	remaining: 340ms
4:	learn: 0.2946938	total: 17.6ms	remaining: 334ms
5:	learn: 0.2793593	total: 21.9ms	remaining: 343ms
6:	learn: 0.2652312	total: 25.4ms	remaining: 338ms
7:	learn: 0.2527222	total: 28.7ms	remaining: 330ms
8:	learn: 0.2421597	total: 32.1ms	remaining: 324ms
9:	learn: 0.2323837	total: 35.6ms	remaining: 320ms
10:	learn: 0.2224395	total: 39.1ms	remaining: 316ms
11:	learn: 0.2137261	total: 42.3ms	remaining: 311ms
12:	learn: 0.2063710	total: 45.9ms	remaining: 307ms
13:	learn: 0.1994169	total: 49.5ms	remaining: 304ms
14:	learn: 0.1928475	total: 53ms	remaining: 300ms
15:	learn: 0.1866887	total: 56.3ms	remaining: 295ms
16:	learn: 0.1815646	total: 60ms	remaining: 293ms
17:	learn: 0.1769987	total: 63.5ms	remaining: 290ms
18:	learn: 0.1726418	total: 67ms	remaining: 286ms
19:	learn: 0.1684561	total: 

**7. saving the model**

We will save the best model for further prediction. We use pickle package the save the model.

In [20]:
import pickle 

# here you can chnage the path
filename = '/content/drive/My Drive/Real Time Projects/House Price \
Prediction/Data Files/after_tuning_catboost_model.csv'
pickle.dump(catboost,open(filename, 'wb'))