# Competition Description
Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

# DecisionTree Regressor of Housing Price Competition
Link to competition: \ submitted by: Juan Fakri

This code example uses sklearn Linear_model to predict the price of a house based on some characteristics of the house.

Here are some strategies to solve the problem:

1. Import library and get data
2. clean up data
3. Build, fit, and evaluate Sklearn linear regression models that perform predictions.
4. Run hyperparameter tuning
5. submit your work 



### 1.   Import the Libraries and get the Data

In [1]:
# Author: Juan Pablo Contreras
# This is the notebook used for my submission for the House Prices - Advanced Regression Techniques competition

import math
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import linear_model
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor

In [2]:
# read the data
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

In [3]:
train_data

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,175000
1456,1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1457,1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1458,1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2010,WD,Normal,142125


## 2. Data Cleaning

NaN values ​​in the dataset:
Yes| Categorical values ​​in the dataset:
Yes| Numbers in dataset:
Yes The only difference between the training dataset and the test dataset is that the test dataset does not contain the SalePrice column. To clear the data, do the following: 

- 2.1. The SalePrice column is removed from the training dataset to match the columns of the training and test DFs. 
- 2.2. The test DF is added to the training DF, so we can perform cleanup on both the training DF and the test DF.
- 2.3. Categorical values ​​are hot encoded using pands get_dummies() function 
- 2.4. NaN values ​​are converted to -1.
- 2.5. The dataset is again split into training and testing.
- 2.6. The training dataset is split into two parts (training and validation) to see how well the trained model performs. 
- 2.7. Data are normalized using the training data. Validation and test data are transformed using fitting parameters for normalization of training data. 




### 2.1 Delete SalePrice from Training Dataset



In [4]:
# find number of rows
num_rows_train = len(train_data.index)
print("num rows in training data: " + str(num_rows_train))
print("num rows in test data: " + str(len(test_data.index)))

# store and drop the training predictions
y_train = train_data.iloc[:,-1]
train_data.drop(columns=train_data.columns[-1], 
        axis=1, 
        inplace=True)

num rows in training data: 1460
num rows in test data: 1459


### 2.2 Merge Train and Test Data

In [5]:
# merge all data for cleaning
all_data = train_data.append(test_data, ignore_index=True)
print("num rows in all data: " + str(len(all_data.index)))

# delete id column because ID is not a predictor of house price
ids = all_data.drop('Id', axis=1)

num rows in all data: 2919


  all_data = train_data.append(test_data, ignore_index=True)


### 2.3 & 2.4 Handle Categorical and NaN values

In [6]:
# convert categorical data to numerical (0 or 1)
all_data = pd.get_dummies(data=all_data)

# replace NaN values
all_data = all_data.fillna(-1)

all_data.head()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,1,60,65.0,8450,7,5,2003,2003,196.0,706.0,...,0,0,0,1,0,0,0,0,1,0
1,2,20,80.0,9600,6,8,1976,1976,0.0,978.0,...,0,0,0,1,0,0,0,0,1,0
2,3,60,68.0,11250,7,5,2001,2002,162.0,486.0,...,0,0,0,1,0,0,0,0,1,0
3,4,70,60.0,9550,7,5,1915,1970,0.0,216.0,...,0,0,0,1,1,0,0,0,0,0
4,5,60,84.0,14260,8,5,2000,2000,350.0,655.0,...,0,0,0,1,0,0,0,0,1,0


In [7]:
all_data.head()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,1,60,65.0,8450,7,5,2003,2003,196.0,706.0,...,0,0,0,1,0,0,0,0,1,0
1,2,20,80.0,9600,6,8,1976,1976,0.0,978.0,...,0,0,0,1,0,0,0,0,1,0
2,3,60,68.0,11250,7,5,2001,2002,162.0,486.0,...,0,0,0,1,0,0,0,0,1,0
3,4,70,60.0,9550,7,5,1915,1970,0.0,216.0,...,0,0,0,1,1,0,0,0,0,0
4,5,60,84.0,14260,8,5,2000,2000,350.0,655.0,...,0,0,0,1,0,0,0,0,1,0


In [8]:
all_data.columns

Index(['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual',
       'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1',
       ...
       'SaleType_ConLw', 'SaleType_New', 'SaleType_Oth', 'SaleType_WD',
       'SaleCondition_Abnorml', 'SaleCondition_AdjLand',
       'SaleCondition_Alloca', 'SaleCondition_Family', 'SaleCondition_Normal',
       'SaleCondition_Partial'],
      dtype='object', length=289)

In [9]:
def missing_value_describe(data):
# Check for missing values in data
    missing_value_stats = (data.isnull().sum() / len(data)*100)
    missing_value_col_count = sum(missing_value_stats > 0)
    missing_value_stats = missing_value_stats.sort_values(ascending=False)[:missing_value_col_count]
    print("Number of columns with missing values:", missing_value_col_count)
    if missing_value_col_count != 0:
# Print column names with percentage of missing values
        print("\nPercentage lost (decreasing):")
        print(missing_value_stats)
    else:
        print("No data is lost!!!")
missing_value_describe(all_data)

Number of columns with missing values: 0
No data is lost!!!


### 2.5 Split data back into Train, and Test

In [10]:
# split data back into train and test data
X_train = all_data.iloc[0:num_rows_train,:]
print("num rows in X train: " + str(len(X_train.index)))
print("num rows in y train: " + str(len(y_train.index)))
X_test = all_data.iloc[num_rows_train:,:]
print("num rows in X test: " + str(len(X_test.index)))

print("Training data features head: ")
print(X_train.head())
print()
print("Training data predictions head: ")
print(y_train.head())
print()
print("Test data features head:")
print(X_test.head())

num rows in X train: 1460
num rows in y train: 1460
num rows in X test: 1459
Training data features head: 
   Id  MSSubClass  LotFrontage  LotArea  OverallQual  OverallCond  YearBuilt  \
0   1          60         65.0     8450            7            5       2003   
1   2          20         80.0     9600            6            8       1976   
2   3          60         68.0    11250            7            5       2001   
3   4          70         60.0     9550            7            5       1915   
4   5          60         84.0    14260            8            5       2000   

   YearRemodAdd  MasVnrArea  BsmtFinSF1  ...  SaleType_ConLw  SaleType_New  \
0          2003       196.0       706.0  ...               0             0   
1          1976         0.0       978.0  ...               0             0   
2          2002       162.0       486.0  ...               0             0   
3          1970         0.0       216.0  ...               0             0   
4          2000       

### 2.6 Split train data into train and validation
The new training data will be 80% of the train dataset, and the validation data 20% of the train data set

In [11]:
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

In [12]:
scaler = StandardScaler()
scaler.fit(X_train)
X_train_std = scaler.transform(X_train)
X_val_transf = scaler.transform(X_val)
X_test_transf = scaler.transform(X_test)

## 3. Build, fit, and evaluate linear regression models
A sklearn linear regression model is used.

1. The model is built using the sklearn library
2. The model is fitted using the training data
3. The model is validated against validation data
4. Predictions are made based on test data 

In [20]:
# linear model
regr = linear_model.LinearRegression()
# fit model using training data
regr.fit(X_train, y_train)

LinearRegression()

### 3.3 Evaluating the Model

<b>How good did the model fit the training data?</b>

First, find the R^2 value of the training data predictions to see how well the linear model fits the trained data. 

In [21]:
r2_train = regr.score(X_train,y_train)
print("R^2 on training data: " + str(r2_train))

R^2 on training data: 0.9401454003464302


The model fit is very good, reaching a maximum value of R^2 which is 1.  Saying that the linear model fits the training data perfectly. Let's hope this model works well on untrained data!

<b>How good is the model on new data?</b>

Now that you know how your model behaves on training data, you want to know how it behaves on data you've never seen before. Use the score() function to make predictions on validation data and evaluate how good those predictions are. 

In [22]:
r2_validation = regr.score(X_val,y_val)
print("R^2 on validation data: " + str(r2_validation))

R^2 on validation data: 0.44377267188435565


The value of R^2 in the validation dataset is about 0.443, not good for my heart's feelings... Let's see we'll predict the sale price of a home in a validation data set and then see how far this prediction is from the actual sale price

In [27]:
# predict prices on the validation data
pred_val = regr.predict(X_val)
print("Five first predictions: ")
print(pred_val[:5])
print()

Five first predictions: 
[160536.64307101 343707.37847009  89843.95120924 177402.14668826
 320387.72170794]



In [28]:
# Compare the Predicted values on the validation data set to the actual SalePrices of the validation data set using the root mean squared error
MSE = np.square(np.subtract(y_val,pred_val)).mean()   
   
rsme = math.sqrt(MSE)  
print("Root Mean Square Error:\n")  
print(rsme)

Root Mean Square Error:

65318.030068267515


This roughly means that on average, the predicted home sale price drops by $65,318 This mistake is huge! This prediction will probably never be used in the real world. Therefore, we need to do some hyper parameter tuning to make the model better at predicting house prices

## 4. Hyperparameter tuning
In this section, we perform some hyperparameter tuning to improve the model's house price predictions. Here's what I do:


* 4.1 What is the problem? 
* 4.2 Solving overfitting using PCA 
* 4.3 Retraining the model 
* 4.4 Evaluating a new model 

<b>What's the problem?</b>

This model matches the training data well (R^2 = 1.0). The problem is that this model is very poor at making predictions based on new data (R^2 = 0,77) and therefore does not generalize well. It fits the training data well, but fits the new data poorly, so the model might be too good for the training data. This means that the model uses features that are very important in predicting house prices in the training set, but features that are generally less important in determining house prices. 


<b>Use PCA to combat overfitting</b>

The best way to deal with overfitting in a linear regression model is to not train the model on data whose patterns are purely random. For this, we use the technique of principal component analysis (PCA). PCA can be used to reduce the number of functions used for prediction. 

Features (columns) can be interdependent, so PCA features are blended into independent components. That way, you have enough components to explain most of the variance in your training data, but you can't have too many components to prevent unique patterns in your training data from being learned.

Try the percentage of variance explained by the components. 

In [88]:
# Principal Component Analysis (PCA)
# Create the PCA models
pca_1 = PCA(1)      # use only 1 component
pca_3 = PCA(3)      # use only 3 components
pca_5 = PCA(5)      # use only 5 components
pca_6 = PCA(6)      # use only 5 components
pca_7 = PCA(7)      # use only 5 components
pca_8 = PCA(8)      # use only 8 components
pca_10 = PCA(10)    # use only 10 components

# fit the PCA models
pca_1.fit(X_train)
pca_3.fit(X_train)
pca_5.fit(X_train)
pca_6.fit(X_train)
pca_7.fit(X_train)
pca_8.fit(X_train)
pca_10.fit(X_train)


PCA(n_components=10)

In [89]:
# Transform the data using the PCA models

# using 1 component
pca_1_train_img = pca_1.transform(X_train)
pca_1_val_img = pca_1.transform(X_val)

# using 3 components
pca_3_train_img = pca_3.transform(X_train)
pca_3_val_img = pca_3.transform(X_val)

# using 5 components
pca_5_train_img = pca_5.transform(X_train)
pca_5_val_img = pca_5.transform(X_val)

# using 6 components
pca_6_train_img = pca_6.transform(X_train)
pca_6_val_img = pca_6.transform(X_val)

# using 7 components
pca_7_train_img = pca_7.transform(X_train)
pca_7_val_img = pca_7.transform(X_val)

# using 8 components
pca_8_train_img = pca_8.transform(X_train)
pca_8_val_img = pca_8.transform(X_val)

# using 10 components
pca_10_train_img = pca_10.transform(X_train)
pca_10_val_img = pca_10.transform(X_val)

In [90]:
# Create and fit the new Linear Regression Models to fit the transformed data

# using 1 component
regr_1 = linear_model.LinearRegression()
regr_1.fit(pca_1_train_img, y_train)

# using 3 components
regr_3 = linear_model.LinearRegression()
regr_3.fit(pca_3_train_img, y_train)

# using 5 components
regr_5 = linear_model.LinearRegression()
regr_5.fit(pca_5_train_img, y_train)

# using 5 components
regr_6 = linear_model.LinearRegression()
regr_6.fit(pca_6_train_img, y_train)

# using 5 components
regr_5 = linear_model.LinearRegression()
regr_5.fit(pca_5_train_img, y_train)

# using 5 components
regr_7 = linear_model.LinearRegression()
regr_7.fit(pca_7_train_img, y_train)

# using 8 components
regr_8 = linear_model.LinearRegression()
regr_8.fit(pca_8_train_img, y_train)

# using 10 components
regr_10 = linear_model.LinearRegression()
regr_10.fit(pca_10_train_img, y_train)

LinearRegression()

In [91]:
# Find How good the new Models are

# using 1 component
r2 = regr_1.score(pca_1_train_img, y_train)
print("R^2 of fit on training data for 1 component: " + str(r2))
r2 = regr_1.score(pca_1_val_img, y_val)
print("R^2 of fit on validation data for 1 component: " + str(r2))
print()

# using 3 components
r2 = regr_3.score(pca_3_train_img, y_train)
print("R^2 of fit on training data for 3 components " + str(r2))
r2 = regr_3.score(pca_3_val_img, y_val)
print("R^2 of fit on validation data for 3 components " + str(r2))
print()

# using 5 components
r2 = regr_5.score(pca_5_train_img, y_train)
print("R^2 of fit on training data for 5 components " + str(r2))
r2 = regr_5.score(pca_5_val_img, y_val)
print("R^2 of fit on validation data for 5 components " + str(r2))
print()

# using 6 components
r2 = regr_6.score(pca_6_train_img, y_train)
print("R^2 of fit on training data for 5 components " + str(r2))
r2 = regr_6.score(pca_6_val_img, y_val)
print("R^2 of fit on validation data for 5 components " + str(r2))
print()

# using 5 components
r2 = regr_7.score(pca_7_train_img, y_train)
print("R^2 of fit on training data for 5 components " + str(r2))
r2 = regr_7.score(pca_7_val_img, y_val)
print("R^2 of fit on validation data for 5 components " + str(r2))
print()

# using 8 components
r2 = regr_8.score(pca_8_train_img, y_train)
print("R^2 of fit on training data for 8 components " + str(r2))
r2 = regr_8.score(pca_8_val_img, y_val)
print("R^2 of fit on validation data for 8 components " + str(r2))
print()

# using 3 components
r2 = regr_10.score(pca_10_train_img, y_train)
print("R^2 of fit on training data for 10 components " + str(r2))
r2 = regr_10.score(pca_10_val_img, y_val)
print("R^2 of fit on validation data for 10 components " + str(r2))
print()

R^2 of fit on training data for 1 component: 0.07143060084311081
R^2 of fit on validation data for 1 component: 0.06350840667036806

R^2 of fit on training data for 3 components 0.6075148468250726
R^2 of fit on validation data for 3 components 0.7026626008715731

R^2 of fit on training data for 5 components 0.6205204707232319
R^2 of fit on validation data for 5 components 0.7236616660606632

R^2 of fit on training data for 5 components 0.6206670153574783
R^2 of fit on validation data for 5 components 0.7244516662858383

R^2 of fit on training data for 5 components 0.6217540090610203
R^2 of fit on validation data for 5 components 0.7250567593127974

R^2 of fit on training data for 8 components 0.6239504191395515
R^2 of fit on validation data for 8 components 0.7275278479880866

R^2 of fit on training data for 10 components 0.6512775995792758
R^2 of fit on validation data for 10 components 0.7516055295817703



Using only one component gave very low R^2 values ​​on both the training and validation data. On the other hand, for R^2, the validation data do not change much from 3 to 10 components. We choose her 5 components between 3 and 10 components to ensure we eliminate overfitting. 

<b>Submission</b>
First, predict the SalePrice of the homes in the test dataset. Then send the predictions to the working directory /haggle/working. 

In [103]:
# transform the test data so that it only has 7 components
pca_7_test_img = pca_7.transform(X_test)

In [104]:
# predict prices of houses in the test dataset
pred_test_7 = regr_7.predict(pca_7_test_img)
print("First five predictions for test dataset: ")
print(pred_test[:5])
print()

First five predictions for test dataset: 
[125152.29591648 158436.28549833 186369.0416353  181869.61110209
 168184.60899383]



In [105]:
# submission
sample_submission = pd.read_csv('sample_submission.csv')
sample_submission['SalePrice'] = pred_test_7
sample_submission.to_csv('submission.csv', index=False)
sample_submission.head(10)

Unnamed: 0,Id,SalePrice
0,1461,125152.295916
1,1462,158436.285498
2,1463,186369.041635
3,1464,181869.611102
4,1465,168184.608994
5,1466,168392.627254
6,1467,164611.734966
7,1468,153729.075397
8,1469,180267.001026
9,1470,126091.964269
