### **House Prices - Advanced Regression Techniques**

*A perfect competition for data science students. This competition challenges you to predict the final price of each home.*

*For each Id in the test set, you must predict the value of the SalePrice variable.*

IMPORT LIBRARIES

In [196]:
import numpy as np 
import matplotlib.pyplot as plt
import pandas as pd 
import seaborn as sns

IMPORT DATA

In [197]:
train = pd.read_csv('data/house-prices/train.csv')
test = pd.read_csv('data/house-prices/test.csv')

In [198]:
train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [199]:
test.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,0,,,Gar2,12500,6,2010,WD,Normal
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,0,,MnPrv,,0,3,2010,WD,Normal
3,1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,6,2010,WD,Normal
4,1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,...,144,0,,,,0,1,2010,WD,Normal


ADDING THE AVERAGE OF THE SALES PRICES TO THE TEST DATASET TO MAKE A **SIMPLE FIRST PREDICTION**

In [200]:
mean = train['SalePrice'].mean()

test['SalePrice'] = mean

test

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,6,2010,WD,Normal,180921.19589
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,,,Gar2,12500,6,2010,WD,Normal,180921.19589
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,,MnPrv,,0,3,2010,WD,Normal,180921.19589
3,1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,...,0,,,,0,6,2010,WD,Normal,180921.19589
4,1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,...,0,,,,0,1,2010,WD,Normal,180921.19589
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1454,2915,160,RM,21.0,1936,Pave,,Reg,Lvl,AllPub,...,0,,,,0,6,2006,WD,Normal,180921.19589
1455,2916,160,RM,21.0,1894,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2006,WD,Abnorml,180921.19589
1456,2917,20,RL,160.0,20000,Pave,,Reg,Lvl,AllPub,...,0,,,,0,9,2006,WD,Abnorml,180921.19589
1457,2918,85,RL,62.0,10441,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,Shed,700,7,2006,WD,Normal,180921.19589


In [201]:
test[['Id', 'SalePrice']].to_csv('first_average_prediction.csv', index=False)

-------------

NOW, LET'S WORK ON THE DATA SET

In [202]:
#only use the loot frontage. the type of dwelling and the sale price to train the model
train = train[['MSSubClass', 'LotFrontage', 'SalePrice']]
train

Unnamed: 0,MSSubClass,LotFrontage,SalePrice
0,60,65.0,208500
1,20,80.0,181500
2,60,68.0,223500
3,70,60.0,140000
4,60,84.0,250000
...,...,...,...
1455,60,62.0,175000
1456,20,85.0,210000
1457,70,66.0,266500
1458,20,68.0,142125


FILL MISSING VALUES

In [203]:
train.isnull().sum()

MSSubClass       0
LotFrontage    259
SalePrice        0
dtype: int64

In [204]:
train.mean()

MSSubClass         56.897260
LotFrontage        70.049958
SalePrice      180921.195890
dtype: float64

In [205]:
train = train.fillna(train.mean())

In [206]:
train.isnull().sum()

MSSubClass     0
LotFrontage    0
SalePrice      0
dtype: int64

SPLITTING THE DATA TO CREATE THE INPUT(x) AND OUTPUT(y) VARIABLES

In [207]:
X_train, y_train= train.to_numpy()[ :, :-1], train.to_numpy()[ :, -1]
print(X_train)
print()
print(y_train)

[[60. 65.]
 [20. 80.]
 [60. 68.]
 ...
 [70. 66.]
 [20. 68.]
 [20. 75.]]

[208500. 181500. 223500. ... 266500. 142125. 147500.]


LINEAR REGRESSION WITH SCIKIT-LEARN

In [208]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression().fit(X_train, y_train) #X_train (characteristics) and y_train (labels or target values)

lr

In [209]:
print(lr.coef_) #attribute that stores the estimated coefficients (or slopes) for each feature in the data set
print()
print(lr.intercept_) #is the point on the y-axis where the regression line intersects the x-axis
#it represents the predicted value when all characteristics are zero

[  75.96942776 1260.11397803]

88327.8118861935


**The *Predicted Selling Price* is the result of multiplying the values in the MSSubClass column by 75.96942776 , plus multiplying the values in the LotFrontage column by 1260.11397803, plus adding the intercept**

In [210]:
train.head()

Unnamed: 0,MSSubClass,LotFrontage,SalePrice
0,60,65.0,208500
1,20,80.0,181500
2,60,68.0,223500
3,70,60.0,140000
4,60,84.0,250000


In [211]:
60*75.96942776 + 65*1260.11397803 + 88327.8118861935

174793.3861237435

LET´S SEE THE ACCURACY OF THE PREDICTION

In [212]:
train['linear_predictions'] = lr.predict(X_train)
train

Unnamed: 0,MSSubClass,LotFrontage,SalePrice,linear_predictions
0,60,65.0,208500,174793.386124
1,20,80.0,181500,190656.318684
2,60,68.0,223500,178573.728058
3,70,60.0,140000,169252.510511
4,60,84.0,250000,198735.551706
...,...,...,...,...
1455,60,62.0,175000,171013.044190
1456,20,85.0,210000,196956.888574
1457,70,66.0,266500,176813.194379
1458,20,68.0,142125,175534.950948


In [213]:
from sklearn.metrics import mean_absolute_error

#the distance between the selling price and my predicted selling price
mean_absolute_error(train['SalePrice'], train['linear_predictions'])

54381.25047406032

*On Average were off by 54381*

PREDICT THE SELLING PRICE OF THE TEST DATASET

In [214]:
test_final= test[['MSSubClass', 'LotFrontage']]
test_final

Unnamed: 0,MSSubClass,LotFrontage
0,20,80.0
1,20,81.0
2,60,74.0
3,60,78.0
4,120,43.0
...,...,...
1454,160,21.0
1455,160,21.0
1456,20,160.0
1457,85,62.0


In [215]:
test_final.isnull().sum()

MSSubClass       0
LotFrontage    227
dtype: int64

In [216]:
test_final.mean()

MSSubClass     57.378341
LotFrontage    68.580357
dtype: float64

In [217]:
test_final = test_final.fillna(test_final.mean())
test_final.isnull().sum()

MSSubClass     0
LotFrontage    0
dtype: int64

In [218]:
X_test = test_final.to_numpy()
X_test

array([[ 20.,  80.],
       [ 20.,  81.],
       [ 60.,  74.],
       ...,
       [ 20., 160.],
       [ 85.,  62.],
       [ 60.,  74.]])

In [219]:
test['SalePrice'] = lr.predict(X_test) #lr is the previously trained linear regression model
test

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,6,2010,WD,Normal,190656.318684
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,,,Gar2,12500,6,2010,WD,Normal,191916.432662
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,,MnPrv,,0,3,2010,WD,Normal,186134.411926
3,1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,...,0,,,,0,6,2010,WD,Normal,191174.867838
4,1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,...,0,,,,0,1,2010,WD,Normal,151629.044272
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1454,2915,160,RM,21.0,1936,Pave,,Reg,Lvl,AllPub,...,0,,,,0,6,2006,WD,Normal,126945.313866
1455,2916,160,RM,21.0,1894,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2006,WD,Abnorml,126945.313866
1456,2917,20,RL,160.0,20000,Pave,,Reg,Lvl,AllPub,...,0,,,,0,9,2006,WD,Abnorml,291465.436927
1457,2918,85,RL,62.0,10441,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,Shed,700,7,2006,WD,Normal,172912.279884


In [221]:
test[['Id', 'SalePrice']].to_csv('final_prediction.csv', index=False)

----------------