# House Prices: Preprocessing and Modeling

In this notebook, I have filled in missing values, converted the dataframe, split the data, and then have started the modeling process. Within each model I have looked for the mean squared error across the six models of: Linear Regression, L1 Lasso, L2 Ridge, Random Forest, Desision Tree, and Gradiiant Boosting. Later towards the end of the notebook we took the lowest score by building a table to easily identify which model is best. We take the lowest score because it provides us the highest accuracy of the predition between the actual and the predicited data.


# Imports 

In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn import linear_model

# Load Data 

In [2]:
df = pd.read_csv('../House Prices Advance Regression Technique/train.csv')

# Preprocessing

We first remove the categorical features and account for the missing values. We then check on the DataFrame to check if everything is consistant. 

In [3]:
#Remove the categorical features 
dfTrain = df.select_dtypes(include = ['float64', 'int64'])

#fill in missing values  
dfTrain.drop(['LotFrontage', 'MasVnrArea', 'GarageYrBlt'],axis = 1, inplace = True)

dfTrain.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 35 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   Id             1460 non-null   int64
 1   MSSubClass     1460 non-null   int64
 2   LotArea        1460 non-null   int64
 3   OverallQual    1460 non-null   int64
 4   OverallCond    1460 non-null   int64
 5   YearBuilt      1460 non-null   int64
 6   YearRemodAdd   1460 non-null   int64
 7   BsmtFinSF1     1460 non-null   int64
 8   BsmtFinSF2     1460 non-null   int64
 9   BsmtUnfSF      1460 non-null   int64
 10  TotalBsmtSF    1460 non-null   int64
 11  1stFlrSF       1460 non-null   int64
 12  2ndFlrSF       1460 non-null   int64
 13  LowQualFinSF   1460 non-null   int64
 14  GrLivArea      1460 non-null   int64
 15  BsmtFullBath   1460 non-null   int64
 16  BsmtHalfBath   1460 non-null   int64
 17  FullBath       1460 non-null   int64
 18  HalfBath       1460 non-null   int64
 19  Bedroo

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [4]:
dfTrain['SalePrice']

0       208500
1       181500
2       223500
3       140000
4       250000
         ...  
1455    175000
1456    210000
1457    266500
1458    142125
1459    147500
Name: SalePrice, Length: 1460, dtype: int64

In [5]:
dfTrain.shape

(1460, 35)

## Convert Dataframe into dummies to include the categorical values 

Here we convert the categorical values into the numberical values with dummies, confirm the shape, and check the first five rows of the data.

In [6]:
dfTrain1 = pd.get_dummies(df)

In [7]:
dfTrain1.shape

(1460, 290)

In [8]:
dfTrain1.head()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,1,60,65.0,8450,7,5,2003,2003,196.0,706,...,0,0,0,1,0,0,0,0,1,0
1,2,20,80.0,9600,6,8,1976,1976,0.0,978,...,0,0,0,1,0,0,0,0,1,0
2,3,60,68.0,11250,7,5,2001,2002,162.0,486,...,0,0,0,1,0,0,0,0,1,0
3,4,70,60.0,9550,7,5,1915,1970,0.0,216,...,0,0,0,1,1,0,0,0,0,0
4,5,60,84.0,14260,8,5,2000,2000,350.0,655,...,0,0,0,1,0,0,0,0,1,0


In [9]:
dfTrain1.isna().sum().sum()

348

In [10]:
dfTrain1.fillna(0, inplace = True)

# Split and Train 

In [11]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(dfTrain1.drop(columns = ['SalePrice']),
                                                    dfTrain1.SalePrice,random_state =5)

In [12]:
x_train.shape, x_test.shape, y_train.shape, y_test.shape

((1095, 289), (365, 289), (1095,), (365,))

In [13]:
model_acc_score = pd.DataFrame(columns=["model", "MSE"])

# Model 1:  Linear Regression  

In [14]:
#r^2
lr = LinearRegression()
lr.fit(x_train,y_train)
print(r2_score(y_test,lr.predict(x_test)))

#mse
y_pred = lr.predict(x_test)

lr_mse = mean_squared_error(y_test,y_pred)

model_acc_score = model_acc_score.append({"model":"Linear Regression", "MSE":lr_mse}, ignore_index = True)


0.464440082802685


In [15]:
model_acc_score.head()

Unnamed: 0,model,MSE
0,Linear Regression,3313708000.0


# Model  2:  L2 Ridge 

In [16]:
#r^2
reg = linear_model.Ridge(alpha=.5)
reg.fit(x_train,y_train)
print(r2_score(y_test,reg.predict(x_test)))

#mse
y_pred = lr.predict(x_test)
l2_mse = mean_squared_error(y_test,y_pred)

model_acc_score = model_acc_score.append({"model":"L2 Ridge", "MSE":l2_mse}, ignore_index = True)

0.8803196417692194


# Model 3: L1 Lasso 

In [17]:
#r^2
regl2 = linear_model.Lasso(alpha=0.1)
regl2.fit(x_train,y_train)
print(r2_score(y_test,regl2.predict(x_test)))

#mse
y_pred = regl2.predict(x_test)
l1_mse = mean_squared_error(y_test,y_pred)

model_acc_score = model_acc_score.append({"model":"L1 Lasso", "MSE":l1_mse}, ignore_index = True)


0.8844448631088188


  positive)


# Model 4 - Random Forest Regressor

In [18]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler, MinMaxScaler

In [19]:
RF = RandomForestRegressor(random_state=30, max_depth = 85, n_estimators = 375)

In [20]:
rf1 = RF.fit(x_train, y_train)
ypred = rf1.predict(x_test)

In [21]:
rf_mse = mean_squared_error(y_test,ypred)

model_acc_score = model_acc_score.append({"model":"Random Forest Regressor", "MSE":rf_mse}, ignore_index = True)

# Model 5 - Decision Tree Regressor 

In [22]:
from sklearn.tree import DecisionTreeRegressor
from sklearn import tree, metrics

In [23]:
reg = DecisionTreeRegressor(random_state=30, max_depth = 10)
reg.fit(x_train, y_train)

DecisionTreeRegressor(max_depth=10, random_state=30)

In [24]:
ypred = reg.predict(x_test)

In [25]:
dt_mse = mean_squared_error(y_test,ypred)

model_acc_score = model_acc_score.append({"model":"Decision Tree Regressor", "MSE":dt_mse}, ignore_index = True)

# Model 6 - Gradient Boosting Regressor

In [26]:
from sklearn.ensemble import GradientBoostingRegressor

reg = GradientBoostingRegressor()
reg.fit(x_train, y_train)

GradientBoostingRegressor()

In [27]:
ypred = reg.predict(x_test)
gd_mse = mean_squared_error(y_test,ypred)

model_acc_score = model_acc_score.append({"model":"Gradiant Boosting Regressor", "MSE":gd_mse}, ignore_index = True)

In [28]:
model_acc_score

Unnamed: 0,model,MSE
0,Linear Regression,3313708000.0
1,L2 Ridge,3313708000.0
2,L1 Lasso,714982600.0
3,Random Forest Regressor,733079200.0
4,Decision Tree Regressor,1285078000.0
5,Gradiant Boosting Regressor,607661100.0


In [29]:
model_acc_score[model_acc_score.MSE == model_acc_score.MSE.min()]

Unnamed: 0,model,MSE
5,Gradiant Boosting Regressor,607661100.0


After calculating the models for the mean squared error, the best model I got was the Gradient Boosting Regressor, with a score of 0.606766. 