# Kaggle house price prediction problem

Description of the data 
https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

1. Load data_combined_cleaned.csv 
2. Filter out all records where SalePrice is nan
3. Perform one hot encoding on full data set
4. Remove Id column from dataset
5. Divide the data into training and test datasets, use test size = 0.3 and random state = 1
6. Create pipeline to scale the data and fit model
7. Find r2 score based on training data and testing data

Cleaned dataset is located in link below
https://github.com/abulbasar/data/tree/master/kaggle-houseprice



In [2]:
import pandas as pd
import numpy as np
from sklearn import *
import matplotlib.pyplot as plt

In [3]:
df = pd.read_csv("/data/kaggle/house-prices/data_combined_cleaned.csv")

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2919 entries, 0 to 2918
Data columns (total 80 columns):
Id               2919 non-null int64
MSSubClass       2919 non-null int64
MSZoning         2919 non-null object
LotFrontage      2919 non-null float64
LotArea          2919 non-null int64
Street           2919 non-null object
Alley            2919 non-null object
LotShape         2919 non-null object
LandContour      2919 non-null object
LotConfig        2919 non-null object
LandSlope        2919 non-null object
Neighborhood     2919 non-null object
Condition1       2919 non-null object
Condition2       2919 non-null object
BldgType         2919 non-null object
HouseStyle       2919 non-null object
OverallQual      2919 non-null int64
OverallCond      2919 non-null int64
YearBuilt        2919 non-null int64
YearRemodAdd     2919 non-null int64
RoofStyle        2919 non-null object
RoofMatl         2919 non-null object
Exterior1st      2919 non-null object
Exterior2nd      2919 non

In [5]:
df.dropna().info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 0 to 1459
Data columns (total 80 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1460 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            1460 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-null object
Exterior2nd      1460 non

In [8]:
df = df[~df.SalesPrice.isnull()]
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 0 to 1459
Data columns (total 80 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1460 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            1460 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-null object
Exterior2nd      1460 non

In [9]:
df.select_dtypes(include=[np.object])

Unnamed: 0,MSZoning,Street,Alley,LotShape,LandContour,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,...,GarageType,GarageFinish,GarageQual,GarageCond,PavedDrive,PoolQC,Fence,MiscFeature,SaleType,SaleCondition
0,RL,Pave,,Reg,Lvl,Inside,Gtl,CollgCr,Norm,Norm,...,Attchd,RFn,TA,TA,Y,,,,WD,Normal
1,RL,Pave,,Reg,Lvl,FR2,Gtl,Veenker,Feedr,Norm,...,Attchd,RFn,TA,TA,Y,,,,WD,Normal
2,RL,Pave,,IR1,Lvl,Inside,Gtl,CollgCr,Norm,Norm,...,Attchd,RFn,TA,TA,Y,,,,WD,Normal
3,RL,Pave,,IR1,Lvl,Corner,Gtl,Crawfor,Norm,Norm,...,Detchd,Unf,TA,TA,Y,,,,WD,Abnorml
4,RL,Pave,,IR1,Lvl,FR2,Gtl,NoRidge,Norm,Norm,...,Attchd,RFn,TA,TA,Y,,,,WD,Normal
5,RL,Pave,,IR1,Lvl,Inside,Gtl,Mitchel,Norm,Norm,...,Attchd,Unf,TA,TA,Y,,MnPrv,Shed,WD,Normal
6,RL,Pave,,Reg,Lvl,Inside,Gtl,Somerst,Norm,Norm,...,Attchd,RFn,TA,TA,Y,,,,WD,Normal
7,RL,Pave,,IR1,Lvl,Corner,Gtl,NWAmes,PosN,Norm,...,Attchd,RFn,TA,TA,Y,,,Shed,WD,Normal
8,RM,Pave,,Reg,Lvl,Inside,Gtl,OldTown,Artery,Norm,...,Detchd,Unf,Fa,TA,Y,,,,WD,Abnorml
9,RL,Pave,,Reg,Lvl,Corner,Gtl,BrkSide,Artery,Artery,...,Attchd,RFn,Gd,TA,Y,,,,WD,Normal


In [10]:
del df["Id"]
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 0 to 1459
Data columns (total 79 columns):
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1460 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            1460 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-null object
Exterior2nd      1460 non-null object
MasVnrType       1460 no

In [13]:
y = df.SalesPrice
X = df.copy()
del X["SalesPrice"]
X_dummy = pd.get_dummies(X, drop_first=True)
X_dummy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 0 to 1459
Columns: 258 entries, MSSubClass to SaleCondition_Partial
dtypes: float64(11), int64(25), uint8(222)
memory usage: 738.6 KB


In [18]:
%%time 
X_train, X_test, y_train, y_test = model_selection.train_test_split(X_dummy
                        , y, test_size = 0.3, random_state = 1)

pipe = pipeline.Pipeline([
    ("poly", preprocessing.PolynomialFeatures(degree = 1, 
                                include_bias=False)),
    ("scaler", preprocessing.StandardScaler()),
    ("est", linear_model.LinearRegression())
])

pipe.fit(X_train, y_train)
print("train R2", pipe.score(X_train, y_train), 
      "test R2:", pipe.score(X_test, y_test))

train R2 0.933719310799 test R2: -7.35962213432e+17
CPU times: user 79.5 ms, sys: 11 ms, total: 90.5 ms
Wall time: 58.5 ms


In [19]:
%%time
X_train, X_test, y_train, y_test = model_selection.train_test_split(X_dummy
                        , y, test_size = 0.3, random_state = 1)

pipe = pipeline.Pipeline([
    ("poly", preprocessing.PolynomialFeatures(degree = 2, 
                                include_bias=False)),
    ("scaler", preprocessing.StandardScaler()),
    ("est", linear_model.LinearRegression())
])

pipe.fit(X_train, y_train)
print("train R2", pipe.score(X_train, y_train), 
      "test R2:", pipe.score(X_test, y_test))

train R2 1.0 test R2: 0.833664447949
CPU times: user 12.6 s, sys: 1.54 s, total: 14.1 s
Wall time: 9.61 s


In [27]:
%%time
X_train, X_test, y_train, y_test = model_selection.train_test_split(X_dummy
                        , y, test_size = 0.3, random_state = 1)

pipe = pipeline.Pipeline([
    ("poly", preprocessing.PolynomialFeatures(degree = 1, 
                                include_bias=False)),
    ("scaler", preprocessing.StandardScaler()),
    ("est", linear_model.Lasso(alpha = 1, tol=0.0001) )
])

pipe.fit(X_train, y_train)
print("train R2", pipe.score(X_train, y_train), 
      "test R2:", pipe.score(X_test, y_test))

train R2 0.933703571611 test R2: 0.729775659641
CPU times: user 204 ms, sys: 4.08 ms, total: 208 ms
Wall time: 207 ms




In [28]:
list(pipe.steps[2][1].coef_)

[2997.1905245084981,
 293.18363270766031,
 8311.2746626163935,
 7461.8392280623393,
 5804.5810608004658,
 7868.6850575869021,
 1708.4140260333522,
 2316.132119988024,
 12069.596696690414,
 1814.4012603972906,
 2888.1798767361911,
 3128.060920904938,
 14346.471491788003,
 21974.891167688274,
 1463.3155364429244,
 3812.2303838039361,
 -557.66935895730592,
 549.9381984864533,
 2105.2556540378732,
 1939.2192127486051,
 -797.64046788253268,
 -1663.9276453818018,
 1178.0251910499458,
 4375.1559592107524,
 -2777.1477343819593,
 1284.0473767520191,
 8072.1207674438992,
 1238.4015645709592,
 457.67766233786239,
 712.35838853234088,
 844.23166728626006,
 1807.357870972784,
 7932.9540675725239,
 1682.2411341263892,
 -1198.3304033719992,
 -748.5823678130181,
 8348.7905708720846,
 4060.6085834420933,
 14803.397363955672,
 13795.948719421138,
 -547.90450667993491,
 -1384.2548103360984,
 264.19435280675953,
 626.16750698714714,
 830.46363279796162,
 565.57203741735157,
 2286.5331152133745,
 -1656.057

In [35]:
%%time
X_train, X_test, y_train, y_test = model_selection.train_test_split(X_dummy
                        , y, test_size = 0.3, random_state = 1)

pipe = pipeline.Pipeline([
    ("poly", preprocessing.PolynomialFeatures(degree = 1, 
                                include_bias=False)),
    ("scaler", preprocessing.StandardScaler()),
    ("est", linear_model.Lasso(alpha = 400, tol=0.0001) )
])

pipe.fit(X_train, y_train)
print("train R2", pipe.score(X_train, y_train), 
      "test R2:", pipe.score(X_test, y_test))

print(list(pipe.steps[2][1].coef_))

train R2 0.911205914278 test R2: 0.875452497179
[-4262.7381867007452, -904.51206198238128, 5167.7215990038967, 11924.58043927877, 5211.7369590058443, 6328.7748157905271, 1554.0144812972158, 2921.373128728726, 3915.9799614815252, 876.516627710281, -0.0, 326.94657026453251, 767.51211135901804, 0.0, -286.69024360324198, 22054.560464265774, 1214.8819032738384, 0.0, 2332.7094505219229, 394.76167376519146, -406.5154024577991, -2042.8472878416169, 3419.4848350622583, 2902.9599383778564, 0.0, 6929.4457410940286, 1621.5088447160197, 1924.6127940656932, 719.85260272140488, 180.88241061211801, 38.76910330820094, 1808.8105221310855, 2272.1952003294095, -0.0, -565.80614171758202, -0.0, 29.730088755319368, 0.0, 0.0, -1128.1569376149964, 0.0, -0.0, 731.55488532928848, 992.58444563844489, -725.62889706217675, -0.0, 2548.7489312254984, -108.15284397975969, 4298.0466110404996, 1901.1300154425023, -128.58790480725509, -408.44918187254916, -325.45894359039852, 1327.4360498334029, -593.56760862552801, 0.0,