**REGRESSION TASK WITH MACHINE LEARNING**

House Prices - Advanced Regression Techniques
Dataset available on Kaggle

https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or
the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences
price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition
challenges you to predict the final price of each home.

Goal:
To predict the sales price for each house.
For each Id in the test set, you must predict the value of the SalePrice variable.

In [1]:
##Import Libraries
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

In [2]:
##Load dataset

train_df = pd.read_csv(r"/content/train.csv")
test_df = pd.read_csv(r"/content/test.csv")

In [3]:
##Visualise dataset

train_df.head(5)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [4]:
test_df.head(5)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,0,,,Gar2,12500,6,2010,WD,Normal
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,0,,MnPrv,,0,3,2010,WD,Normal
3,1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,6,2010,WD,Normal
4,1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,...,144,0,,,,0,1,2010,WD,Normal


In [5]:
# split the Target column
target=train_df.SalePrice

#combine train and test sets for preprocessing
HousePrice_df=pd.concat([train_df.drop('SalePrice',axis=1),test_df])
#
HousePrice_df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,2,2008,WD,Normal
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,5,2007,WD,Normal
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,9,2008,WD,Normal
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,2,2006,WD,Abnorml
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,12,2008,WD,Normal


In [6]:
HousePrice_df.shape

(2919, 80)

In [7]:
HousePrice_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2919 entries, 0 to 1458
Data columns (total 80 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             2919 non-null   int64  
 1   MSSubClass     2919 non-null   int64  
 2   MSZoning       2915 non-null   object 
 3   LotFrontage    2433 non-null   float64
 4   LotArea        2919 non-null   int64  
 5   Street         2919 non-null   object 
 6   Alley          198 non-null    object 
 7   LotShape       2919 non-null   object 
 8   LandContour    2919 non-null   object 
 9   Utilities      2917 non-null   object 
 10  LotConfig      2919 non-null   object 
 11  LandSlope      2919 non-null   object 
 12  Neighborhood   2919 non-null   object 
 13  Condition1     2919 non-null   object 
 14  Condition2     2919 non-null   object 
 15  BldgType       2919 non-null   object 
 16  HouseStyle     2919 non-null   object 
 17  OverallQual    2919 non-null   int64  
 18  OverallC

In [8]:
print(HousePrice_df.columns.values)

['Id' 'MSSubClass' 'MSZoning' 'LotFrontage' 'LotArea' 'Street' 'Alley'
 'LotShape' 'LandContour' 'Utilities' 'LotConfig' 'LandSlope'
 'Neighborhood' 'Condition1' 'Condition2' 'BldgType' 'HouseStyle'
 'OverallQual' 'OverallCond' 'YearBuilt' 'YearRemodAdd' 'RoofStyle'
 'RoofMatl' 'Exterior1st' 'Exterior2nd' 'MasVnrType' 'MasVnrArea'
 'ExterQual' 'ExterCond' 'Foundation' 'BsmtQual' 'BsmtCond' 'BsmtExposure'
 'BsmtFinType1' 'BsmtFinSF1' 'BsmtFinType2' 'BsmtFinSF2' 'BsmtUnfSF'
 'TotalBsmtSF' 'Heating' 'HeatingQC' 'CentralAir' 'Electrical' '1stFlrSF'
 '2ndFlrSF' 'LowQualFinSF' 'GrLivArea' 'BsmtFullBath' 'BsmtHalfBath'
 'FullBath' 'HalfBath' 'BedroomAbvGr' 'KitchenAbvGr' 'KitchenQual'
 'TotRmsAbvGrd' 'Functional' 'Fireplaces' 'FireplaceQu' 'GarageType'
 'GarageYrBlt' 'GarageFinish' 'GarageCars' 'GarageArea' 'GarageQual'
 'GarageCond' 'PavedDrive' 'WoodDeckSF' 'OpenPorchSF' 'EnclosedPorch'
 '3SsnPorch' 'ScreenPorch' 'PoolArea' 'PoolQC' 'Fence' 'MiscFeature'
 'MiscVal' 'MoSold' 'YrSold' 'SaleTy

In [9]:
description = HousePrice_df.describe().T
description['num_of_unique']=HousePrice_df.nunique()
description['NULLS']=HousePrice_df.isna().sum()
description

Unnamed: 0,count,mean,std,min,25%,50%,75%,max,num_of_unique,NULLS
Id,2919.0,1460.0,842.787043,1.0,730.5,1460.0,2189.5,2919.0,2919,0
MSSubClass,2919.0,57.137718,42.517628,20.0,20.0,50.0,70.0,190.0,16,0
LotFrontage,2433.0,69.305795,23.344905,21.0,59.0,68.0,80.0,313.0,128,486
LotArea,2919.0,10168.11408,7886.996359,1300.0,7478.0,9453.0,11570.0,215245.0,1951,0
OverallQual,2919.0,6.089072,1.409947,1.0,5.0,6.0,7.0,10.0,10,0
OverallCond,2919.0,5.564577,1.113131,1.0,5.0,5.0,6.0,9.0,9,0
YearBuilt,2919.0,1971.312778,30.291442,1872.0,1953.5,1973.0,2001.0,2010.0,118,0
YearRemodAdd,2919.0,1984.264474,20.894344,1950.0,1965.0,1993.0,2004.0,2010.0,61,0
MasVnrArea,2896.0,102.201312,179.334253,0.0,0.0,0.0,164.0,1600.0,444,23
BsmtFinSF1,2918.0,441.423235,455.610826,0.0,0.0,368.5,733.0,5644.0,991,1


In [10]:
HousePrice_df.describe(include=['O']).T ##Describe the categorical variable

Unnamed: 0,count,unique,top,freq
MSZoning,2915,5,RL,2265
Street,2919,2,Pave,2907
Alley,198,2,Grvl,120
LotShape,2919,4,Reg,1859
LandContour,2919,4,Lvl,2622
Utilities,2917,2,AllPub,2916
LotConfig,2919,5,Inside,2133
LandSlope,2919,3,Gtl,2778
Neighborhood,2919,25,NAmes,443
Condition1,2919,9,Norm,2511


**DATA PREPROCESSING**



Fill in missing values

cat 1 ==> Meaning value 'NAN' has value

e.g.

Alley: Type of alley access to property

       Grvl	Gravel
       Pave	Paved
       NA 	No alley access
BsmtCond: Evaluates the general condition of the basement

       Ex	Excellent
       Gd	Good
       TA	Typical - slight dampness allowed
       Fa	Fair - dampness or some cracking or settling
       Po	Poor - Severe cracking, settling, or wetness
       NA	No Basement
       
cat_2 ==> 'NAN' means missing value and will be filled with the most occuring value




In [11]:
## Fill missing value 'NAN' with None

cat_1=['Alley','BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2','FireplaceQu','GarageType'
   ,'GarageFinish','GarageQual','GarageCond','PoolQC','Fence','MiscFeature']
for column in cat_1:
    HousePrice_df[column] = HousePrice_df[column].fillna("None")

In [12]:
##For missing values with the most prevalent value

cat_2=['MasVnrType','MSZoning','Functional','Utilities','SaleType','Exterior2nd','Exterior1st',
         'Electrical' ,'KitchenQual']
for column in cat_2:
    HousePrice_df[column] = HousePrice_df[column].fillna(HousePrice_df[column].mode()[0])

In [13]:
###Check for other missing columns
num_1 = []
for column, count in HousePrice_df.isnull().sum().items():
    if count > 0:
        print(f"Column {column} has {count} missing values.")
        num_1.append(column)

Column LotFrontage has 486 missing values.
Column MasVnrArea has 23 missing values.
Column BsmtFinSF1 has 1 missing values.
Column BsmtFinSF2 has 1 missing values.
Column BsmtUnfSF has 1 missing values.
Column TotalBsmtSF has 1 missing values.
Column BsmtFullBath has 2 missing values.
Column BsmtHalfBath has 2 missing values.
Column GarageYrBlt has 159 missing values.
Column GarageCars has 1 missing values.
Column GarageArea has 1 missing values.


In [14]:
##Check the dtype of columns with missing values

missing_df = pd.DataFrame(range(1, len(num_1)+1))
missing_df['column'] = num_1
Cdtype = []
for column in num_1:
    Cdtype.append(HousePrice_df[column].dtypes)

missing_df['dtype'] = Cdtype
missing_df

Unnamed: 0,0,column,dtype
0,1,LotFrontage,float64
1,2,MasVnrArea,float64
2,3,BsmtFinSF1,float64
3,4,BsmtFinSF2,float64
4,5,BsmtUnfSF,float64
5,6,TotalBsmtSF,float64
6,7,BsmtFullBath,float64
7,8,BsmtHalfBath,float64
8,9,GarageYrBlt,float64
9,10,GarageCars,float64


In [15]:
#Function to Replace missing numerical values with the mean of each feature
for i in num_1:
    Replace_Value = HousePrice_df[i].mean()
    HousePrice_df[i] = HousePrice_df[i].fillna(Replace_Value)


for column, count in HousePrice_df.isnull().sum().items():
    if count > 0:
        print(f"Column {column} has {count} missing values.")

**Feature Engineering**

In [16]:
HousePrice_df.shape

(2919, 80)

In [17]:
##Create new columns

NewHousePrice_df = HousePrice_df.copy()

NewHousePrice_df['TotalArea']=NewHousePrice_df['LotFrontage']+NewHousePrice_df['LotArea']

NewHousePrice_df['Total_Home_Quality'] = NewHousePrice_df['OverallQual'] + NewHousePrice_df['OverallCond']

NewHousePrice_df['Total_Bathrooms'] = (NewHousePrice_df['FullBath'] + (0.5 * NewHousePrice_df['HalfBath']) +
                               NewHousePrice_df['BsmtFullBath'] + (0.5 * NewHousePrice_df['BsmtHalfBath']))
NewHousePrice_df["AllSF"] = NewHousePrice_df["GrLivArea"] + NewHousePrice_df["TotalBsmtSF"]

NewHousePrice_df["AvgSqFtPerRoom"] = NewHousePrice_df["GrLivArea"] / (NewHousePrice_df["TotRmsAbvGrd"] +
                                                       NewHousePrice_df["FullBath"] +
                                                       NewHousePrice_df["HalfBath"] +
                                                       NewHousePrice_df["KitchenAbvGr"])

NewHousePrice_df["totalFlrSF"] = NewHousePrice_df["1stFlrSF"] + NewHousePrice_df["2ndFlrSF"]

In [18]:
NewHousePrice_df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,MoSold,YrSold,SaleType,SaleCondition,TotalArea,Total_Home_Quality,Total_Bathrooms,AllSF,AvgSqFtPerRoom,totalFlrSF
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,2,2008,WD,Normal,8515.0,12,3.5,2566.0,142.5,1710
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,5,2007,WD,Normal,9680.0,14,2.5,2524.0,140.222222,1262
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,9,2008,WD,Normal,11318.0,12,3.5,2706.0,178.6,1786
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,2,2006,WD,Abnorml,9610.0,12,2.0,2473.0,190.777778,1717
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,12,2008,WD,Normal,14344.0,13,3.5,3343.0,169.076923,2198


In [19]:
NewHousePrice_df.shape

(2919, 86)

**Feature Encoding**

In [20]:
NewEHousePrice_df=pd.get_dummies(NewHousePrice_df)
NewEHousePrice_df.head()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,1,60,65.0,8450,7,5,2003,2003,196.0,706.0,...,0,0,0,1,0,0,0,0,1,0
1,2,20,80.0,9600,6,8,1976,1976,0.0,978.0,...,0,0,0,1,0,0,0,0,1,0
2,3,60,68.0,11250,7,5,2001,2002,162.0,486.0,...,0,0,0,1,0,0,0,0,1,0
3,4,70,60.0,9550,7,5,1915,1970,0.0,216.0,...,0,0,0,1,1,0,0,0,0,0
4,5,60,84.0,14260,8,5,2000,2000,350.0,655.0,...,0,0,0,1,0,0,0,0,1,0


**Feature Scaling**

In [21]:
## Scaling of Numerical values using Scaler module
sc = StandardScaler()

NewESHousePrice_df=pd.DataFrame(sc.fit_transform(NewEHousePrice_df), index=NewEHousePrice_df.index, columns=NewEHousePrice_df.columns)
NewESHousePrice_df.head()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,-1.731458,0.067331,-0.202068,-0.217879,0.646183,-0.507284,1.046258,0.896833,0.525202,0.580907,...,-0.052423,-0.298629,-0.049029,0.394439,-0.263861,-0.064249,-0.09105,-0.126535,0.463937,-0.302693
1,-1.730271,-0.873616,0.50187,-0.072044,-0.063185,2.188279,0.154764,-0.395604,-0.57225,1.178112,...,-0.052423,-0.298629,-0.049029,0.394439,-0.263861,-0.064249,-0.09105,-0.126535,0.463937,-0.302693
2,-1.729084,0.067331,-0.06128,0.137197,0.646183,-0.507284,0.980221,0.848965,0.334828,0.097873,...,-0.052423,-0.298629,-0.049029,0.394439,-0.263861,-0.064249,-0.09105,-0.126535,0.463937,-0.302693
3,-1.727897,0.302568,-0.436714,-0.078385,0.646183,-0.507284,-1.859351,-0.682812,-0.57225,-0.494941,...,-0.052423,-0.298629,-0.049029,0.394439,3.789876,-0.064249,-0.09105,-0.126535,-2.155466,-0.302693
4,-1.726711,0.067331,0.689587,0.518903,1.355551,-0.507284,0.947203,0.753229,1.387486,0.468931,...,-0.052423,-0.298629,-0.049029,0.394439,-0.263861,-0.064249,-0.09105,-0.126535,0.463937,-0.302693


In [22]:
len(target)

1460

In [23]:
###Split dataset to train and test again

X_train=NewESHousePrice_df.iloc[:1460]
y_train=target
X_test= NewESHousePrice_df.iloc[1460:]

In [24]:
##Split train dataset into train and validation

XTrain, XVal, yTrain, yVal = train_test_split(X_train, y_train, test_size = 0.2)

**Model Training**

Models:
1. Decision Tree Regressor
2. Random Forest Regressor
3. XGB Regressor

In [25]:
##Decision Tree Regressor
decision_model = DecisionTreeRegressor()
decision_model.fit(XTrain, yTrain)
predicted_decision_trees = decision_model.predict(XVal)
print ("Mean Absolute Error using Decision Tress :", mean_absolute_error(yVal, predicted_decision_trees))
print ("RMSE using Decision Tress :", mean_squared_error(yVal, predicted_decision_trees, squared = False))
print ("R2 Score Absolute Error using Decision Tress :", r2_score(yVal, predicted_decision_trees))

Mean Absolute Error using Decision Tress : 25100.02397260274
RMSE using Decision Tress : 37566.32503798567
R2 Score Absolute Error using Decision Tress : 0.802569782741895


In [26]:
##Random Forest Regressor

forest_model = RandomForestRegressor(n_estimators=100, max_depth=10)
forest_model.fit(XTrain, yTrain)
predicted_random_forest = forest_model.predict(XVal)
print("Mean Absolute Error using Random Forest:", mean_absolute_error(yVal, predicted_random_forest))
print ("RMSE using Decision Tress :", mean_squared_error(yVal, predicted_random_forest, squared = False))
print ("R2 Score Absolute Error using Decision Tress :", r2_score(yVal, predicted_random_forest))

Mean Absolute Error using Random Forest: 17973.227576562942
RMSE using Decision Tress : 31955.29126208004
R2 Score Absolute Error using Decision Tress : 0.8571429227905216


In [27]:
## XGB Regressor

xg_model = XGBRegressor(n_estimators=100)
xg_model.fit(XTrain, yTrain)
predicted_XGBoost = xg_model.predict(XVal)
print("Mean Absolute Error using XGBoost: ", mean_absolute_error(yVal, predicted_XGBoost))
print ("RMSE using Decision Tress :", mean_absolute_error(yVal, predicted_random_forest))
print ("R2 Score Absolute Error using Decision Tress :", r2_score(yVal, predicted_decision_trees))

Mean Absolute Error using XGBoost:  18127.81664704623
RMSE using Decision Tress : 17973.227576562942
R2 Score Absolute Error using Decision Tress : 0.802569782741895


**Hyperparameter Tuning**

1. Manual Search
2. Grid Search
3. Random Search

_Hyperparameter Tuning with Decison Tree Regressor._

In [57]:
##Listing possible Parameters below.
#See https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html

parameters={"criterion": ["squared_error", "friedman_mse", "absolute_error", "poisson"],
            "splitter":["best","random"],
            "max_depth" : [1,3,5,7,9,11,12],
           "min_samples_leaf":[1,2,3,4,5,6,7,8,9,10],
           "min_weight_fraction_leaf":[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9],
           "max_features":["auto","log2","sqrt",None],
           "max_leaf_nodes":[None,10,20,30,40,50,60,70,80,90] }

1. Manual Search

In [58]:
#sets of hyperparameters

params_1 = {'criterion': 'squared_error', 'splitter': 'best', 'max_depth': None, 'min_samples_leaf': 1, 'min_weight_fraction_leaf': 0.0,
            'max_features': None, 'max_leaf_nodes' : None}
params_2 = {'criterion': 'absolute_error', 'splitter': 'best', 'max_depth': 5, 'min_samples_leaf': 4, 'min_weight_fraction_leaf': 0.2,
            'max_features': 'log2', 'max_leaf_nodes' : 40}
params_3 = {'criterion': 'poisson', 'splitter': 'best', 'max_depth': 7, 'min_samples_leaf': 6, 'min_weight_fraction_leaf': 0.3,
            'max_features': 'sqrt', 'max_leaf_nodes' : 60}
params_4 = {'criterion': 'friedman_mse', 'splitter': 'random', 'max_depth':11, 'min_samples_leaf': 7, 'min_weight_fraction_leaf': 0.4,
            'max_features': None, 'max_leaf_nodes' : 70}
params_5 = {'criterion': 'squared_error', 'splitter': 'random', 'max_depth':12, 'min_samples_leaf': 9, 'min_weight_fraction_leaf': 0.5,
            'max_features': None, 'max_leaf_nodes' : 80}
# Separate models

model_1 = DecisionTreeRegressor(**params_1)
model_2 = DecisionTreeRegressor(**params_2)
model_3 = DecisionTreeRegressor(**params_3)
model_4 = DecisionTreeRegressor(**params_4)
model_5 = DecisionTreeRegressor(**params_5)


model_1.fit(XTrain, yTrain)
model_2.fit(XTrain, yTrain)
model_3.fit(XTrain, yTrain)
model_4.fit(XTrain, yTrain)
model_5.fit(XTrain, yTrain)

# Prediction sets

preds_1 = model_1.predict(XVal)
preds_2 = model_3.predict(XVal)
preds_3 = model_3.predict(XVal)
preds_4 = model_4.predict(XVal)
preds_5 = model_5.predict(XVal)


print('R2 Score Absolute Error on Model 1: ',round(r2_score(yVal, preds_1), 3))
print('R2 Score Absolute Error on Model 2: ',round(r2_score(yVal, preds_2),3))
print('R2 Score Absolute Error on Model 3: ',round(r2_score(yVal, preds_3),3))
print('R2 Score Absolute Error on Model 3: ',round(r2_score(yVal, preds_4), 3))
print('R2 Score Absolute Error on Model 5: ',round(r2_score(yVal, preds_5),3))

R2 Score Absolute Error on Model 1:  0.81
R2 Score Absolute Error on Model 2:  0.277
R2 Score Absolute Error on Model 3:  0.277
R2 Score Absolute Error on Model 3:  0.257
R2 Score Absolute Error on Model 5:  -0.001


A higher R-squared indicates the model is a good fit, while a lower R-squared indicates the model is not a good fit

2. Grid Search

In [None]:
from sklearn.model_selection import GridSearchCV

decision_model = DecisionTreeRegressor()
tuning_model=GridSearchCV(estimator=decision_model, param_grid=parameters,scoring='neg_mean_squared_error',cv=3,verbose=3)

tuning_model.fit(XTrain, yTrain)


In [None]:
# best hyperparameters
tuning_model.best_params_

In [62]:
##Estimating with the best parameter

tuned_dec_model= DecisionTreeRegressor(criterion= 'squared_error', splitter= 'best', max_depth= None, min_samples_leaf= 1, min_weight_fraction_leaf= 0.0,
            max_features= None, max_leaf_nodes = None)
tuned_dec_model.fit(XTrain, yTrain)

predicted_tuned_decision_trees=tuned_dec_model.predict(XVal)

print ("Mean Absolute Error using Decision Trees with Hyperparameters:", mean_absolute_error(yVal, predicted_tuned_decision_trees))
print ("RMSE using Decision Trees with Hyperparameters:", mean_squared_error(yVal, predicted_tuned_decision_trees, squared = False))
print ("R2 Score Absolute Error using Decision Trees with Hyperparameters:", r2_score(yVal, predicted_tuned_decision_trees))

Mean Absolute Error using Decision Trees with Hyperparameters: 25173.972602739726
RMSE using Decision Trees with Hyperparameters: 36575.44295577238
R2 Score Absolute Error using Decision Trees with Hyperparameters: 0.8128476052413407


In [63]:
##Compare Tuned and Untuned result.

Hyper_param_result = {'Models': ['Untuned Decision Tree Model', 'Tuned Decision Tree Model'], 'R2 Score': [round(r2_score(yVal, predicted_decision_trees), 3), round(r2_score(yVal, predicted_tuned_decision_trees), 3)],
                 'RMSE': [ round(mean_squared_error(yVal, predicted_decision_trees), 3), round(mean_squared_error(yVal, predicted_tuned_decision_trees, squared = False), 3)],
                 'MAE': [round(mean_absolute_error(yVal, predicted_decision_trees), 3), round(mean_absolute_error(yVal, predicted_tuned_decision_trees), 3)]}

df_Hyper_param_result = pd.DataFrame(Hyper_param_result)
df_Hyper_param_result


Unnamed: 0,Models,R2 Score,RMSE,MAE
0,Untuned Decision Tree Model,0.803,1411229000.0,25100.024
1,Tuned Decision Tree Model,0.813,36575.44,25173.973
