# House Prices - Advanced Regression Techniques
## Predict sales prices and practice feature engineering, RFs, and gradient boosting

This is competiton for data science students who have completed an online course in machine learning and are looking to expand their skill set before trying a featured competition. 

https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview

Acknowledgments

The Ames Housing dataset was compiled by Dean De Cock for use in data science education. It's an incredible alternative for data scientists looking for a modernized and expanded version of the often cited Boston Housing dataset. 

Photo by Tom Thain on Unsplash.


### Analysing the data

In [1]:
import pandas as pd

train_house_data = pd.read_csv('data/train.csv')
test_house_data = pd.read_csv('data/test.csv')

train_house_data.shape

(1460, 81)

In [2]:
train_house_data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [3]:
test_house_data.shape

(1459, 80)

### Preprocessing the data

1. Find the columns with more NULL values.

In [4]:
train_house_data.isnull().sum().sort_values(ascending=False).head(25)

PoolQC          1453
MiscFeature     1406
Alley           1369
Fence           1179
FireplaceQu      690
LotFrontage      259
GarageCond        81
GarageType        81
GarageYrBlt       81
GarageFinish      81
GarageQual        81
BsmtExposure      38
BsmtFinType2      38
BsmtFinType1      37
BsmtCond          37
BsmtQual          37
MasVnrArea         8
MasVnrType         8
Electrical         1
Utilities          0
YearRemodAdd       0
MSSubClass         0
Foundation         0
ExterCond          0
ExterQual          0
dtype: int64

2. Remove the column with more than 20% of missing data.

In [5]:
house_columns_Removed = ['PoolQC',           
'MiscFeature',      
'Alley',            
'Fence',
'FireplaceQu',
'LotFrontage']

processed_train_house_data = train_house_data.drop(house_columns_Removed, 1)
processed_test_house_data = test_house_data.drop(house_columns_Removed, 1)

# Display coulmns that still have Null values. 
processed_train_house_data.isnull().sum().sort_values(ascending=False).head(25)

GarageType      81
GarageYrBlt     81
GarageFinish    81
GarageCond      81
GarageQual      81
BsmtExposure    38
BsmtFinType2    38
BsmtFinType1    37
BsmtCond        37
BsmtQual        37
MasVnrType       8
MasVnrArea       8
Electrical       1
RoofMatl         0
RoofStyle        0
SalePrice        0
Exterior1st      0
Exterior2nd      0
YearBuilt        0
ExterQual        0
ExterCond        0
Foundation       0
YearRemodAdd     0
HouseStyle       0
OverallCond      0
dtype: int64

In [6]:
processed_train_house_data.shape

(1460, 75)

3. Drop rows which contain missing values.

In [7]:
processed_train_house_data = processed_train_house_data.dropna(axis=0)
processed_train_house_data.shape

(1338, 75)

### Attributes Selection

For this task was used the J48 desition tree view in Weka, the created desition tree contains the most valued attribiutes to make destitions at the top. Thus, those attributes were selected as relvant attributes for this regression task.

In [7]:
house_features = ['YearBuilt',
'YearRemodAdd',
'LotArea',
'LotShape',
'LotConfig',
'MSSubClass',
'MSZoning',
'Neighborhood',
'OverallQual',
'OverallCond',
'HeatingQC',
'HouseStyle',
'Exterior1st',
'Exterior2nd',
'TotRmsAbvGrd',
'BedroomAbvGr',
'RoofStyle',
'MasVnrArea',
'Fireplaces',
'BldgType',
'BsmtExposure',
'GarageFinish',
'FullBath',
'MasVnrType',
'Condition1']

(1338, 25)

Select the training and test datasetd using the relevant attributes

In [None]:
train_house_DF = processed_train_house_data[house_features].copy()
#processed_train_house_data.isnull().sum().sort_values(ascending=False).head(25)
train_house_DF.shape

In [8]:
test_house_DF = processed_test_house_data[house_features].copy()
#processed_test_house_data.isnull().sum().sort_values(ascending=False).head(25)
test_house_DF.shape

(1459, 25)

#### One Hot Encoding

For categorical variables where no such ordinal relationship exists, the integer encoding is not enough. Encode categorical features as a one-hot numeric array. This is where the integer encoded variable is removed and a new binary variable is added for each unique integer value.

In [9]:
train_house_DF.dtypes

YearBuilt         int64
YearRemodAdd      int64
LotArea           int64
LotShape         object
LotConfig        object
MSSubClass        int64
MSZoning         object
Neighborhood     object
OverallQual       int64
OverallCond       int64
HeatingQC        object
HouseStyle       object
Exterior1st      object
Exterior2nd      object
TotRmsAbvGrd      int64
BedroomAbvGr      int64
RoofStyle        object
MasVnrArea      float64
Fireplaces        int64
BldgType         object
BsmtExposure     object
GarageFinish     object
FullBath          int64
MasVnrType       object
Condition1       object
dtype: object

In [10]:
OHE_train_house_data = pd.get_dummies(train_house_DF)
OHE_train_house_data.shape

(1338, 124)

In [11]:
OHE_train_house_data.head()

Unnamed: 0,YearBuilt,YearRemodAdd,LotArea,MSSubClass,OverallQual,OverallCond,TotRmsAbvGrd,BedroomAbvGr,MasVnrArea,Fireplaces,...,MasVnrType_Stone,Condition1_Artery,Condition1_Feedr,Condition1_Norm,Condition1_PosA,Condition1_PosN,Condition1_RRAe,Condition1_RRAn,Condition1_RRNe,Condition1_RRNn
0,2003,2003,8450,60,7,5,8,3,196.0,0,...,0,0,0,1,0,0,0,0,0,0
1,1976,1976,9600,20,6,8,6,3,0.0,1,...,0,0,1,0,0,0,0,0,0,0
2,2001,2002,11250,60,7,5,6,3,162.0,1,...,0,0,0,1,0,0,0,0,0,0
3,1915,1970,9550,70,7,5,7,3,0.0,1,...,0,0,0,1,0,0,0,0,0,0
4,2000,2000,14260,60,8,5,9,4,350.0,1,...,0,0,0,1,0,0,0,0,0,0


In [12]:
OHE_test_house_data = pd.get_dummies(test_house_DF)
#OHE_test_house_data = OHE_test_house_data.fillna(0)
OHE_test_house_data.shape

(1459, 121)

In [None]:
OHE_test_house_data.head()

Fill NA values using mean value of the NA.

In [13]:
OHE_test_house_data = OHE_test_house_data.fillna(OHE_test_house_data.mean())

Align both datasets so thet have the same number of columns.

In [14]:
final_train, final_test = OHE_train_house_data.align(OHE_test_house_data, join='left', axis=1)

In [15]:
final_test.shape

(1459, 124)

In [16]:
final_train.shape

(1338, 124)

## Linear Regression

A linear approach to modelling the relationship between a scalar response and one or more explanatory variables. This the classic Machine Lerning model used to predict house price.

Since this is supervised learning it's required to define the labeled attribue y. Then, define the training set X.

In [None]:
y = processed_train_house_data.SalePrice

In [None]:
#from sklearn.preprocessing import MinMaxScaler
#scaler = MinMaxScaler()
#scaler.fit(final_train)
#X = scaler.transform(final_train)

X = final_train

Use sklearn to call Linear Regression model an any other Machine Learning library such as train_test_split and mean_absolute_error 

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

# Split data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)

# Define model
salePrice_model = LinearRegression()

# Fit model
salePrice_model.fit(train_X, train_y)

# get predicted prices on validation data
salePrice_predictions = salePrice_model.predict(val_X)

Calculate the Mean Absolute Error for the generated house pricung predictions.

In [17]:
print(mean_absolute_error(salePrice_predictions, val_y))

In [18]:
#from sklearn.preprocessing import MinMaxScaler
#scaler = MinMaxScaler()
#scaler.fit(final_train)
#X = scaler.transform(final_train)

X = final_train

Use sklearn to call Linear Regression model an any other Machine Learning library such as train_test_split and mean_absolute_error 

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

# Split data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)

# Define model
salePrice_model = LinearRegression()

# Fit model
salePrice_model.fit(train_X, train_y)

# get predicted prices on validation data
salePrice_predictions = salePrice_model.predict(val_X)

Calculate the Mean Absolute Error for the generated house pricung predictions.

In [20]:
print(mean_absolute_error(salePrice_predictions, val_y))

24673.84602615535


Return the coefficient of determination ${R^2}$ of the prediction.
The coefficient ${R^2}$ is defined as ${(1-\frac{u}{v})}$, where 
$u$ is the residual sum of squares ((y_true - y_pred) ** 2).sum() and 
$v$
is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). 

The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a ${R^2}$ score of 0.0.

In [21]:
salePrice_model.score(train_X,train_y)

0.8603137068302544

## Price Prediction of Test Dataset

In [22]:
#X_Predict = final_test.fillna(0)
#scaler.fit(X_Predict)
#X_Predict = scaler.transform(X_Predict)
#X_Predict = preprocessing.scale(X_Predict)
final_test.isnull().sum().sort_values(ascending=False)
X_Predict = final_test.fillna(0)

In [23]:
predictions = salePrice_model.predict(X_Predict)
predictions

array([110641.20231566, 149594.00682913, 147005.00914765, ...,
       157435.15027046, 120949.25927442, 241876.2271057 ])

Attach predicted price values to into test dataset

In [25]:
test_house_data['SalePrice'] = predictions

In [26]:
test_house_data

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,6,2010,WD,Normal,110641.202316
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,,,Gar2,12500,6,2010,WD,Normal,149594.006829
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,,MnPrv,,0,3,2010,WD,Normal,147005.009148
3,1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,...,0,,,,0,6,2010,WD,Normal,178434.423910
4,1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,...,0,,,,0,1,2010,WD,Normal,221295.052909
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1454,2915,160,RM,21.0,1936,Pave,,Reg,Lvl,AllPub,...,0,,,,0,6,2006,WD,Normal,70466.480980
1455,2916,160,RM,21.0,1894,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2006,WD,Abnorml,90382.570898
1456,2917,20,RL,160.0,20000,Pave,,Reg,Lvl,AllPub,...,0,,,,0,9,2006,WD,Abnorml,157435.150270
1457,2918,85,RL,62.0,10441,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,Shed,700,7,2006,WD,Normal,120949.259274


In [27]:
# Print Predicted Prices
results = test_house_data[['Id','SalePrice']]

In [28]:
results

Unnamed: 0,Id,SalePrice
0,1461,110641.202316
1,1462,149594.006829
2,1463,147005.009148
3,1464,178434.423910
4,1465,221295.052909
...,...,...
1454,2915,70466.480980
1455,2916,90382.570898
1456,2917,157435.150270
1457,2918,120949.259274


Save results in a CSV file

In [30]:
results.to_csv('Results.csv', index=False)