# Kaggle house price prediction problem

Description of the data 
https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

1. Load data_combined_cleaned.csv 
2. Filter out all records where SalePrice is nan
3. Perform one hot encoding on full data set
4. Remove Id column from dataset
5. Divide the data into training and test datasets, use test size = 0.3 and random state = 1
6. Create pipeline to scale the data and fit model
7. Find r2 score based on training data and testing data

Cleaned dataset is located in link below
https://github.com/abulbasar/data/tree/master/kaggle-houseprice


In [3]:
import pandas as pd
import numpy as np
from sklearn import *
import matplotlib.pyplot as plt

In [17]:
df = pd.read_csv("/data/kaggle/house-prices/data_combined_cleaned.csv")
df = df[~df.SalesPrice.isnull()]
del df["Id"]

y = df.SalesPrice
X = df.copy()
del X["SalesPrice"]
X_dummy = pd.get_dummies(X, drop_first=True)
X_dummy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 0 to 1459
Columns: 258 entries, MSSubClass to SaleCondition_Partial
dtypes: float64(11), int64(25), uint8(222)
memory usage: 738.6 KB


In [20]:
%%time 

X_train, X_test, y_train, y_test = model_selection.train_test_split(X_dummy
                        , y, test_size = 0.3, random_state = 1230)

pipe = pipeline.Pipeline([
    ("poly", preprocessing.PolynomialFeatures(degree = 1, 
                                include_bias=False)),
    ("scaler", preprocessing.StandardScaler()),
    ("est", linear_model.Lasso(alpha = 450, tol=0.0001) )
])

pipe.fit(X_train, y_train)
print("train R2", pipe.score(X_train, y_train), 
      "test R2:", pipe.score(X_test, y_test))


train R2 0.918867817322 test R2: 0.838123384039
CPU times: user 74.3 ms, sys: 5.24 ms, total: 79.5 ms
Wall time: 78.4 ms


In [23]:
scores = model_selection.cross_val_score(pipe, 
                    X_dummy, y, cv = 5, verbose=True)

np.mean(scores)

[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.3s finished


0.82456069007898147

In [24]:
pipe1 = pipeline.Pipeline([
    ("poly", preprocessing.PolynomialFeatures(degree = 1, 
                                include_bias=False)),
    ("scaler", preprocessing.StandardScaler()),
    ("est", linear_model.Lasso(alpha = 450, tol=0.0001) )
])

pipe2 = pipeline.Pipeline([
    ("poly", preprocessing.PolynomialFeatures(degree = 1, 
                                include_bias=False)),
    ("scaler", preprocessing.StandardScaler()),
    ("est", linear_model.Ridge(alpha = 40, tol=0.0001) )
])

scores1 = model_selection.cross_val_score(pipe1, 
                    X_dummy, y, cv = 5, verbose=True)

scores2 = model_selection.cross_val_score(pipe2, 
                    X_dummy, y, cv = 5, verbose=True)


np.mean(scores1), np.mean(scores2)

[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.3s finished
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.1s finished


(0.82456069007898147, 0.8179883345147877)