# House Price Prediction

> - 1.0 Introduction
    - 1.1 Importing libraries
    - 1.2 Collecting the data
  
> - 2.0 Preprocessing
    - 2.1 Dropping irrelevant features
    - 2.2 Null Value Removal
        - 2.2.1 Null Values
        - 2.2.2 Legit
            - 2.2.2.1 Numeric
            - 2.2.2.2 Object
            - 2.2.2.3 Complex
    - 2.3 Data Encoding
        - 2.3.1 One Hot Encoding
    - 2.4 Feature Selection
        - 2.4.1 High Correlation Filter (Resolved the Dummy Variable Trap)
        - 2.4.2 Correlation of the target variable with all the features
    - 2.5 Dimensionality Reduction
        - 2.5.1 Low Variance Filter
   
> - 3.0 Model Training
     - 3.0.1 Splitting the data
     - 3.0.2 Standardizing the Data
- 3.1 Multiple Linear Regression
- 3.2 Decidion Tree
- 3.3 Random Forest
- 3.4 Support Vector Machine
- 3.5 Gradient Boosting
- 3.6 Ada Boosting
- 3.7 Light GBM

> - 4.0 Final Result

> - 5.0 Submit

# 1.0 Introduction


> In this project we intend to predict the price of the houses with the various given features

# 1.1 Importing libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix

In [3]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import RandomizedSearchCV

In [4]:
from scipy.stats import pearsonr
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
import sklearn.metrics as metrics
from sklearn.metrics import mean_squared_log_error
from sklearn.metrics import r2_score

In [6]:
# Models
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
import lightgbm

In [7]:
import time
from collections import Counter

# 1.2 Collecting the data

In [10]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

In [11]:
df = pd.concat([train, test], ignore_index=True)
df.head(10)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500.0
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500.0
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500.0
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000.0
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000.0
5,6,50,RL,85.0,14115,Pave,,IR1,Lvl,AllPub,...,0,,MnPrv,Shed,700,10,2009,WD,Normal,143000.0
6,7,20,RL,75.0,10084,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,307000.0
7,8,60,RL,,10382,Pave,,IR1,Lvl,AllPub,...,0,,,Shed,350,11,2009,WD,Normal,200000.0
8,9,50,RM,51.0,6120,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2008,WD,Abnorml,129900.0
9,10,190,RL,50.0,7420,Pave,,Reg,Lvl,AllPub,...,0,,,,0,1,2008,WD,Normal,118000.0


In [12]:
df.head(10)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500.0
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500.0
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500.0
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000.0
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000.0
5,6,50,RL,85.0,14115,Pave,,IR1,Lvl,AllPub,...,0,,MnPrv,Shed,700,10,2009,WD,Normal,143000.0
6,7,20,RL,75.0,10084,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,307000.0
7,8,60,RL,,10382,Pave,,IR1,Lvl,AllPub,...,0,,,Shed,350,11,2009,WD,Normal,200000.0
8,9,50,RM,51.0,6120,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2008,WD,Abnorml,129900.0
9,10,190,RL,50.0,7420,Pave,,Reg,Lvl,AllPub,...,0,,,,0,1,2008,WD,Normal,118000.0


In [13]:
# copying for future purposes
data_0 = df.copy()

In [14]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

> There a lot of features with non-numeric data which will be required to be Encoded for our code to parse it.

> - To Drop:
    - Id (Irrelevant Data)

# 2.0 Preprocessing
# 2.1 Dropping irrelevant features

In [15]:
df = df.drop(['Id'], axis=1)

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2919 entries, 0 to 2918
Data columns (total 80 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   MSSubClass     2919 non-null   int64  
 1   MSZoning       2915 non-null   object 
 2   LotFrontage    2433 non-null   float64
 3   LotArea        2919 non-null   int64  
 4   Street         2919 non-null   object 
 5   Alley          198 non-null    object 
 6   LotShape       2919 non-null   object 
 7   LandContour    2919 non-null   object 
 8   Utilities      2917 non-null   object 
 9   LotConfig      2919 non-null   object 
 10  LandSlope      2919 non-null   object 
 11  Neighborhood   2919 non-null   object 
 12  Condition1     2919 non-null   object 
 13  Condition2     2919 non-null   object 
 14  BldgType       2919 non-null   object 
 15  HouseStyle     2919 non-null   object 
 16  OverallQual    2919 non-null   int64  
 17  OverallCond    2919 non-null   int64  
 18  YearBuil

# 2.2 Null Value Removal:

> - Null:	
	- MSZoning, Utilities, Exterior1st, Exterior2nd, Electrical, BsmtFullBath(No Bsmt), BsmtHalfBath(No Bsmt), KitchenQual, Functional, GarageYrBlt(Not all), GarageFinish(Not all), GarageQual(Not all), GarageCond(Not all), SaleType


> - Legit:
	- Numeric: LotFrontage, MasVnrArea, BsmtFinSF1, BsmtFinSF2, 
		BsmtUnfSF, TotalBsmtSF, GarageYrBlt(Not all), 
		GarageFinish(Not all), GarageCars, GarageArea, 
		GarageQual(Not all), GarageCond(Not all) 
    - Object: Alley, MasVnrType, GarageType, MiscFeature, 
	- Obj-Num: BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinType2, FireplaceQu, PoolQC, Fence, 

In [17]:
i = 0
for x in range(len(df.columns)):
    if df.iloc[:,x].isnull().sum() > 0:
        i += 1
print(i)

35


## 2.2.1 Null Values:
> Values which are missing thereby have to be filled with median/mode

In [18]:
null_num = ['BsmtFullBath', 'BsmtHalfBath']
null_com = ['GarageYrBlt', 'GarageFinish', 'GarageQual', 'GarageCond']
null_obj = ['MSZoning', 'Utilities', 'Exterior1st', 'Exterior2nd', 'Electrical', 'KitchenQual', 'Functional', 'SaleType']

In [19]:
for x in null_num:
    df[x].fillna(df[x].median(), inplace = True)

for x in null_obj:
    df[x].fillna(df[x].mode()[0], inplace = True)

In [20]:
for x in null_com:
    df[x].fillna(0, inplace=True)

## 2.2.2 Legit:
> Values are not missing just to be replaced with some value
### 2.2.2.1 Numeric:
> Here null values are directly replaced by 0

In [21]:
df["LotFrontage"].fillna(0, inplace = True)
df["MasVnrArea"].fillna(0, inplace=True)
df["BsmtFinSF1"].fillna(0, inplace=True)
df["BsmtFinSF2"].fillna(0, inplace=True)
df["BsmtUnfSF"].fillna(0, inplace=True)
df["TotalBsmtSF"].fillna(0, inplace=True)
df["GarageCars"].fillna(0, inplace=True)
df["GarageArea"].fillna(0, inplace=True)
df["LotFrontage"].value_counts()

0.0      486
60.0     276
80.0     137
70.0     133
50.0     117
        ... 
111.0      1
138.0      1
182.0      1
168.0      1
133.0      1
Name: LotFrontage, Length: 129, dtype: int64

### 2.2.2.2 Object:
> Here null values are directly replaced by 'No'

In [22]:
df["Alley"].fillna('No', inplace = True)
df["MasVnrType"].fillna('No', inplace = True)
df["GarageType"].fillna('No', inplace = True)
df["MiscFeature"].fillna('No', inplace = True)
df["BsmtQual"].fillna('No', inplace = True)
df["BsmtCond"].fillna('No', inplace = True)
df["BsmtExposure"].fillna('No', inplace = True)
df["BsmtFinType1"].fillna('No', inplace = True)
df["BsmtFinType2"].fillna('No', inplace = True)
df["FireplaceQu"].fillna('No', inplace = True)
df["PoolQC"].fillna('No', inplace = True)
df["Fence"].fillna('No', inplace = True)

### 2.2.2.3 Complex
> For the Values with complexity

In [23]:
for x in range(df.shape[0]):
    for y in null_com:
        if df.iloc[x,df.columns.get_loc("GarageType")] == 'No':
            df.iloc[x,df.columns.get_loc(y)] = 0
        elif df.iloc[x,df.columns.get_loc("GarageType")] != 'No' and df.iloc[x,df.columns.get_loc(y)] == 'No':
            df.iloc[x,df.columns.get_loc(y)] = df[y].median()

# 2.3 Data Encoding:
> Used OneHotEncoder over the whole the object features

> Total Columns: 303

In [24]:
columns_numeric = list(df.dtypes[(df.dtypes=='int64') | (df.dtypes=='float64') ].index)
columns_object = list(df.dtypes[df.dtypes=='object'].index)
print(f"numeric columns: {len(columns_numeric)} \nobject columns: {len(columns_object)}")

numeric columns: 37 
object columns: 43


## 2.3.1 One Hot Encoder

In [25]:
df2 = df.copy()
for x in columns_object:
    temp = pd.get_dummies(df2[x],prefix=x)
    df2 = pd.concat([df2,temp],axis=1)
    df2.drop(x,axis=1,inplace=True)
df2.shape

(2919, 303)

#### Splitting Target and Feature Variables

In [26]:
X = df2.drop(['SalePrice'], axis=1)
y = df2['SalePrice']
train = df2.iloc[:1460,:]

# 2.4 Feature Selection:

*Not able to interpret anything from the heatmap as too many features therefore not using*
> plt.subplots(figsize = (25,20))
sns.heatmap(df2.corr(method='pearson'), annot=False, linewidths=0.2)

## 2.4.1 High Correlation Filter (Resolved the Dummy Variable Trap)
> Calculated the correlation of all the feature variables with each other and then removed those having correlation above 0.9

> Removed 10, Total Columns Remaining: 293

In [27]:
corr = X.corr(method='pearson')
columns = np.full((corr.shape[0],), True, dtype=bool)
for i in range(corr.shape[0]):
    for j in range(i+1, corr.shape[0]):
        if corr.iloc[i,j] >= 0.9:
            if columns[j]:
                columns[j] = False
selected_columns = X.columns[columns]
X2 = X[selected_columns]

## 2.4.2 Correlation of the target variable with all the features 
> Calculated the correlation of all the feature variables with the target variable and removed those with absolute val less than 0.05

> Removed 110, Total Columns Remaining: 183

In [28]:
df3 = X2.copy()
df3['SalePrice'] = y
corr = df3.corr(method='pearson')['SalePrice']

In [29]:
flag = 0
for x in range(len(corr)):
    if corr[x] < 0.05 and corr[x] > -0.05:
        flag += 1
        # print(f"Dropping column: {df2.columns[x]}: {corr[x]}")
        df3 = df3.drop([X2.columns[x]], axis=1)
        # print()
print(f"Columns dropped: {flag}")

Columns dropped: 110


# 2.5 Dimensionality Reduction

## 2.5.1 Low Variance Filter

> Calculating the variance of all feature columns and removing those with value less than 0.05

> Removed 11, Total Columns Remaining: 172

In [30]:
df11 = df3.copy()
var = df11.var()
i = 0
for x in range(len(var)):
    if var[x] < 0.005:
        i += 1
        df11 = df11.drop([df3.columns[x]], axis=1)
        # print(f"dropping: {df3.columns[x]}")
print(f'Columns dropped: {i}')

Columns dropped: 11


# Outliers Handling
> Wasn't able to find a model yet to remove the outliers

# 3.0 Model Training:

## 3.0.1 Splitting the Data

In [31]:
dataset = df11

In [32]:
X = dataset.drop(['SalePrice'], axis=1)
y = dataset['SalePrice']
X_t = X.iloc[:1460,:]
y_t = y.iloc[:1460]
X_test = X.iloc[1460:,:]
y_test = y.iloc[1460:]

In [33]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X_t, y_t, test_size = 0.3, random_state = 0)

## 3.0.2 Standardizing the Data

In [34]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)

# 3.1 Multiple Linear Regression

In [35]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
start = time.time()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_val)
time_ML = time.time() - start
acc01 = round(r2_score(y_val, y_pred),4)
print('Linear regression accuracy : ' ,acc01)

Linear regression accuracy :  0.7873


In [36]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_val, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_val, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_val, y_pred)))
print('Root Mean Log Squared Error:', np.sqrt(mean_squared_log_error(y_val, y_pred)))
RMLSE_ML = np.sqrt(mean_squared_log_error(y_val, y_pred))

Mean Absolute Error: 18852.53155857433
Mean Squared Error: 1443799837.6942825
Root Mean Squared Error: 37997.36619417565
Root Mean Log Squared Error: 0.15530373225385524


# 3.2 Decision Tree

In [37]:
from sklearn.tree import DecisionTreeRegressor
regr = DecisionTreeRegressor(max_depth=2, random_state=0, max_leaf_nodes=2)
start = time.time()
regr.fit(X_train, y_train)
y_pred01 = regr.predict(X_val)
time_DT = time.time() - start
acc02 = round(r2_score(y_val, y_pred01),4)
print('Decision tree regression accuracy : ' ,acc02)

Decision tree regression accuracy :  0.4683


In [38]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_val, y_pred01))
print('Mean Squared Error:', metrics.mean_squared_error(y_val, y_pred01))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_val, y_pred01)))
print('Root Mean Log Squared Error:', np.sqrt(mean_squared_log_error(y_val, y_pred01)))
RMLSE_DT = np.sqrt(mean_squared_log_error(y_val, y_pred01))

Mean Absolute Error: 44119.82194267656
Mean Squared Error: 3609701201.211137
Root Mean Squared Error: 60080.78895296846
Root Mean Log Squared Error: 0.3075830173527023


# 3.3 Random Forest

In [39]:
# randomforest = RandomForestRegressor(n_estimators=200, random_state=2)
randomforest = RandomForestRegressor(n_estimators=400, random_state=2, min_samples_split=2, min_samples_leaf=1, max_features='sqrt', max_depth=None, bootstrap=False)
# randomforest = RandomForestRegressor(n_estimators=110, random_state=2, min_samples_split=6, min_samples_leaf=2, max_features='auto', max_depth=20, bootstrap=True)
start = time.time()
randomforest.fit(X_train, y_train)
y_pred02= randomforest.predict(X_val)
time_RF = time.time() - start
acc03 = round(r2_score(y_val, y_pred02),4)
print('Random Forest Regression accuracy : ' ,acc03)

Random Forest Regression accuracy :  0.8662


### Hyperparameter Tuning(RF)

In [40]:
from sklearn.model_selection import RandomizedSearchCV

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
print(random_grid)

rf = RandomForestRegressor()
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
rf_random.fit(X_train, y_train)
rf_random.best_params_

{'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000], 'max_features': ['auto', 'sqrt'], 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4], 'bootstrap': [True, False]}
Fitting 3 folds for each of 100 candidates, totalling 300 fits


In [None]:
tuple(rf_random.best_params_)

In [None]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_val, y_pred02))  
print('Mean Squared Error:', metrics.mean_squared_error(y_val, y_pred02))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_val, y_pred02)))
print('Root Mean Log Squared Error:', np.sqrt(mean_squared_log_error(y_val, y_pred02)))
RMLSE_RF = np.sqrt(mean_squared_log_error(y_val, y_pred02))

# 3.4 Support Vector

In [None]:
from sklearn.svm import SVR

regr01 = SVR(kernel='linear')
start = time.time()
regr01.fit(X_train, y_train)
y_pred03 = regr01.predict(X_val)
time_SV = time.time() - start
acc04 = round(r2_score(y_val, y_pred03),4)
print('SVR accuracy : ' ,acc04)

In [None]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_val, y_pred03))  
print('Mean Squared Error:', metrics.mean_squared_error(y_val, y_pred03))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_val, y_pred03)))
print('Root Mean Log Squared Error:', np.sqrt(mean_squared_log_error(y_val, y_pred03)))
RMLSE_SV = np.sqrt(mean_squared_log_error(y_val, y_pred03))

# 3.5 Gradient Boosting
> Model with highest accuracy

In [None]:
gb = GradientBoostingRegressor(n_estimators=1400, random_state=4, min_samples_split=10, min_samples_leaf=1, max_features='sqrt', max_depth=20, learning_rate=0.01)
start = time.time()
gb.fit(X_train, y_train)
y_pred04= gb.predict(X_val)
time_GB = time.time() - start
acc05 = round(r2_score(y_val, y_pred04),4)
print('Gradient Boosting accuracy : ' ,acc05)

### Hyperparameter Tuning(GB)

In [None]:
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]

# Number of features to consider at every split
max_features = ['auto', 'sqrt']

# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)

# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]

# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]

# Learning rate
learning_rate = [1, 0.5, 0.25, 0.1, 0.05, 0.01]

# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'learning_rate': learning_rate}
print(random_grid)

gb = GradientBoostingRegressor()
gb_random = RandomizedSearchCV(estimator = gb, param_distributions = random_grid, n_iter = 1, cv = 3, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
gb_random.fit(X_train, y_train)
gb_random.best_params_

In [None]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_val, y_pred04))  
print('Mean Squared Error:', metrics.mean_squared_error(y_val, y_pred04))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_val, y_pred04)))
print('Root Mean Log Squared Error:', np.sqrt(mean_squared_log_error(y_val, y_pred04)))
RMLSE_GB = np.sqrt(mean_squared_log_error(y_val, y_pred04))

# 4.0 Final Result

In [None]:
models= pd.DataFrame({ 
"Model" : ["MultipleLinearRegression", "DecisionTreeRegression", "RandomForestRegression", "SVR","Gradient Boosting"],
"Accuracy" : [acc01, acc02, acc03, acc04, acc05],
"Time" : [time_ML, time_DT, time_RF, time_SV, time_GB],
"RMLSE" : [RMLSE_ML, RMLSE_DT, RMLSE_RF, RMLSE_SV, RMLSE_GB]
})
model_notime = pd.DataFrame({ 
"Model" : ["MultipleLinearRegression", "DecisionTreeRegression", "RandomForestRegression", "SVR","Gradient Boosting"],
"Accuracy" : [acc01, acc02, acc03, acc04, acc05]
})
model_time = pd.DataFrame({ 
"Model" : ["MultipleLinearRegression", "DecisionTreeRegression", "RandomForestRegression", "SVR","Gradient Boosting"],
"Time" : [time_ML, time_DT, time_RF, time_SV, time_GB]
})
models

In [None]:
models.sort_values(by="RMLSE")

# 5.0 Submissions
> Used Gradient Boostiong as it is giving the highest accuracy

In [None]:
gb = GradientBoostingRegressor(n_estimators=1400, random_state=2, min_samples_split=10, min_samples_leaf=1, max_features='sqrt', max_depth=20, learning_rate=0.01)
gb.fit(X_t, y_t)
y_pred = gb.predict(X_test)
y_t

In [None]:
submit = data_0.iloc[1460:,0]
submit = pd.DataFrame(submit)

In [None]:
submit['SalePrice'] = y_pred

In [None]:
submit

In [None]:
submit.to_csv('Submission.csv', index=False)

In [None]:
submit.shape