# Predicting House Prices 

![alt text](https://image.flaticon.com/icons/svg/70/70016.svg)

[Leonardo Fuchs](https://www.kaggle.com/leofuchs) - May 2020

---

In this notebook, we will go through the basic steps of making predictions based on a given dataset. The fifteen steps that I followed in this notebook are as follows:

**01)** Importing Libraries and Datasets

**02)** Data Description

**03)** Finding Correlation Features

**04)** Removing Outliers

**05)** Imputation of Missing Values

**06)** Correcting Features

**07)** Adding Features

**08)** Skewness and Kurtosis

**09)** Label Encoding

**10)** Transformation and Scaling

**11)** Feature Selection

**12)** Principal Component Analysis

**13)** Testing Different Models

**14)** Hyper-Parameter Tuning

**15)** Making Predictions and Submission


*Observation: This Notebook is based on the notebook presented by [Majdoubi Ahmed Amine](https://www.kaggle.com/mjidiba).*

## 01) Importing Libraries and Datasets

In [None]:
# Importing Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import missingno as msno
import scipy.stats as st
import seaborn as sns
import warnings

%matplotlib inline

pd.options.display.max_columns = None
warnings.filterwarnings('ignore')
color = sns.color_palette()
sns.set_style('darkgrid')

# Importing the train and test datasets in pandas dataframes
train_data = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
test_data = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')

# Drop the 'Id' column from the train dataframe
train_data.drop(columns='Id', inplace=True)

y_train = train_data['SalePrice']

## 02) Data Description

Let is start by taking a general look at the data we have to get an initial idea about it.

In [None]:
# The shape of the data
train_data.shape, test_data.shape, y_train.shape

In [None]:
# Display the first five rows of the training dataset.
train_data.head()

In [None]:
# The description of the train dataset
train_data.describe()

In [None]:
# Looking the type of the columns in the dataset
train_data.info()

## 03) Finding Correlation Features

Let's look at some correlation features between all the features.

In [None]:
# Showing the numerical varibales with the highest correlation with 'SalePrice', sorted from highest to lowest
correlation = train_data.select_dtypes(include=[np.number]).corr()

print(correlation['SalePrice'].sort_values(ascending=False))

In [None]:
# Heatmap of correlation of numeric features
fig, ax = plt.subplots(figsize = (14,14))

plt.title('Correlation Between Numeric Features', size=15)
sns.heatmap(correlation, square=True, vmax=0.8, cmap='coolwarm', linewidths=0.01);

- We observe two red squares **(2,2 and 3,3)** in the heatmap indicating high correlation. The first group of highly correlated variables is `TotalBsmtSF` and `1stFlrSF`. The second group is `GarageYrBlt`, `GarageCars` and `GarageArea`. This indicates the presence of multicollinearity.
- The other four red squares **(1,1)** just indicate an obvious correlation between `GarageYrBlt` and `YearBuilt` and between `TotRmsAbvGrd` and `GrLivArea`

In [None]:
# Zoomed HeatMap of the most Correlayed variables
zoomed_correlation = correlation.loc[['SalePrice','GrLivArea','TotalBsmtSF','OverallQual','FullBath','TotRmsAbvGrd','YearBuilt', 'YearRemodAdd', '1stFlrSF','GarageYrBlt','GarageCars','GarageArea'],
                                     ['SalePrice','GrLivArea','TotalBsmtSF','OverallQual','FullBath','TotRmsAbvGrd','YearBuilt', 'YearRemodAdd', '1stFlrSF','GarageYrBlt','GarageCars','GarageArea']]

fig , ax = plt.subplots(figsize = (14,14))
plt.title('Zoomed Correlation Between Numeric Features', size=15)
sns.heatmap(zoomed_correlation, square=True, vmax=0.8, annot=True, cmap='coolwarm', linewidths=0.01);

We conclude that :
- `TotalBsmtSF` and `1stFlrSF` are strongly correlated (0.82)
- `TotRmsAbvGrd` and `GrLivArea` are strongly correlated (0.83)
- `YearBuilt` and `GarageYrBlt` are strongly correlated (0.83)
- `GarageCars` and `GarageArea` are strongly correlated (0.88)
- `OverallQual` and `GrLivArea` are correlated with `SalePrice` (0.79 and 0.71)

In [None]:
# Pair plot
cols = ['SalePrice','GrLivArea','TotalBsmtSF','OverallQual','FullBath','TotRmsAbvGrd','YearBuilt', 'YearRemodAdd', '1stFlrSF','GarageYrBlt','GarageCars','GarageArea']

sns.set()
sns.pairplot(train_data[cols], size=2, kind='scatter', diag_kind='kde');

- We observe that `SalePrice` increases almost quadratically with `TotalBsmtSF`, `GrLivArea` and `1stFlrSF`. So we conclude that the price of the houses increases quadratically with its surface area. 
- We also observe that `SalePrice` increases exponentially with `OverallQual`.
- We also observe from (`GrLivArea`-`1stFlSF`) and (`1stFlSF`-`TotalBsmSF`) that all the points are above the identity function line, which means that the ground living area has the biggest surface of all floors, and that the first floor area is generally bigger than the basement area.
- We observe the same phenomenon for (`GarageYrBlt`-`YearBuilt`). which makes sense since we start building the garage after building the house, altough there are some exceptions in the data.

## 04) Removing Outliers

From the previous pair plots, we can see that there are outliers for `TotalBsmtSF`, `1stFlrSF` and `GrLivArea`. Let's use the scatterplot to observe these outliers more precisely

In [None]:
plt.figure(figsize=(25,5))

ax1 = plt.subplot(1, 3, 1)
plt.scatter(x=train_data.TotalBsmtSF, y=train_data.SalePrice)
plt.title('TotalBsmtSF x SalePrice', size=15)

ax2 = plt.subplot(1, 3, 2)
plt.scatter(x=train_data['1stFlrSF'], y=train_data.SalePrice)
plt.title('1stFlrSF x SalePrice', size=15)

ax3 = plt.subplot(1, 3, 3)
plt.scatter(x = train_data.GrLivArea, y=train_data.SalePrice)
plt.title('GrLivArea x SalePrice', size=15)

plt.show()

In [None]:
print(train_data.shape)

# Removing the four outliers found 
train_data.drop(train_data[train_data['TotalBsmtSF'] > 5000].index, inplace=True)
train_data.drop(train_data[train_data['1stFlrSF'] > 4000].index,inplace=True)
train_data.drop(train_data[(train_data['GrLivArea'] > 4000) & (train_data['SalePrice'] < 300000)].index, inplace = True)

print(train_data.shape)

Since only two outliers were dropped, it means that the three features shared the same outlier.

## 05) Imputation of Missing Values

Let's look at the missing valeus in our data. We will be using `msno` library. This library provides a small toolset of flexible and easy-to-use missing data visualizations and utilities that allows you to get a quick visual summary of the completeness (or lack thereof) of your dataset

In [None]:
# Visualising missing values of numeric features
msno.matrix(train_data.select_dtypes(include=[np.number]));

In [None]:
# Visualising percentage of missing values of the top 5 numeric variables
total = train_data.select_dtypes(include=[np.number]).isnull().sum().sort_values(ascending=False)
percent = (train_data.select_dtypes(include=[np.number]).isnull().sum() / train_data.select_dtypes(include=[np.number]).isnull().count()).sort_values(ascending=False)

missing_data = pd.concat([total, percent], axis=1, join='outer', keys=['Missing Count', 'Missing Percentage'])
missing_data.index.name=' Numeric Feature'
missing_data.head(5)

We observe that `LotFrontage`, `GarageYrBlt` and `MasVnrArea` are the only one who have missing values

In [None]:
# Visualising missing values of categorical features
msno.matrix(train_data.select_dtypes(include=[np.object]));

In [None]:
# Visualising percentage of missing values of the top 10 categorical variables
total = train_data.select_dtypes(include=[np.object]).isnull().sum().sort_values(ascending=False)
percent = (train_data.select_dtypes(include=[np.object]).isnull().sum() / train_data.select_dtypes(include=[np.object]).isnull().count()).sort_values(ascending=False)

missing_data = pd.concat([total, percent], axis=1,join='outer', keys=['Missing Count', 'Missing Percentage'])
missing_data.index.name =' Object Feature'
missing_data.head(20)

We observe that `PoolQC`, `MiscFeature`, `Alley`, `Fence` and `FireplaceQu` have a significant amount of missing values.

First of all, let's start by replacing the missing values in both the training and the test set. So we will be combining both datasets into one dataset

In [None]:
# Concatenate the training and test datasets into a single dataframe
data_full = pd.concat([train_data,test_data], ignore_index=True)
data_full.drop('Id', axis=1, inplace=True)

data_full.shape

In [None]:
# Sum of missing values by numeric features
sum_missing_values = data_full.select_dtypes(include=[np.number]).isnull().sum()

sum_missing_values[sum_missing_values > 0].sort_values(ascending=False)

In [None]:
# Numeric features with small number of NaNs: replace with 0
for col in ['BsmtHalfBath', 'BsmtFullBath', 'GarageArea', 'GarageCars', 'TotalBsmtSF', 'BsmtUnfSF', 'BsmtFinSF2', 'BsmtFinSF1']:
    data_full[col].fillna(0, inplace=True)

# Check if missing values are imputed successfully
sum_missing_values = data_full.select_dtypes(include=[np.number]).isnull().sum()
sum_missing_values[sum_missing_values > 0].sort_values(ascending=False)

Since 'MasVnrArea' only have 23 missing values, we can replace them with the mean of the column

In [None]:
# Numeric features with medium number of NaNs: replace with the mean
data_full['MasVnrArea'].fillna(data_full['MasVnrArea'].mean(), inplace=True)

# Check if missing values are imputed successfully
sum_missing_values = data_full.select_dtypes(include=[np.number]).isnull().sum()
sum_missing_values[sum_missing_values > 0].sort_values(ascending=False)

Based on the previous correlation heatmap, 'GarageYrBlt' is highly correlated with 'YearBuilt', so let's replace the missing values by medians of 'YearBuilt'. To do that, we need to cut 'YearBuilt' into sections since it is a numeric variable

In [None]:
# Cut 'YearBuilt' into 10 parts
data_full['YearBuiltCut'] = pd.qcut(data_full['YearBuilt'], 10)

# Impute the missing values of 'GarageYrBlt' based on the median of 'YearBuilt' 
data_full['GarageYrBlt'] = data_full.groupby(['YearBuiltCut'])['GarageYrBlt'].transform(lambda x : x.fillna(x.median()))

# Convert the values to integers
data_full['GarageYrBlt'] = data_full['GarageYrBlt'].astype(int)

# Drop 'YearBuiltCut' column
data_full.drop('YearBuiltCut', axis=1, inplace=True)

# Check if missing values are imputed successfully
sum_missing_values = data_full.select_dtypes(include=[np.number]).isnull().sum()
sum_missing_values[sum_missing_values > 0].sort_values(ascending=False)

Based on the previous correlation heatmap, 'LotFrontage' is highly correlated with 'LotArea' and 'Neighborhood'. So let's use the same method to fill the missing values

In [None]:
# Cut 'LotArea' into 10 parts
data_full['LotAreaCut'] = pd.qcut(data_full['LotArea'], 10)

# Impute the missing values of 'LotFrontage' based on the median of 'LotArea' and 'Neighborhood'
data_full['LotFrontage'] = data_full.groupby(['LotAreaCut','Neighborhood'])['LotFrontage'].transform(lambda x : x.fillna(x.median()))
data_full['LotFrontage'] = data_full.groupby(['LotAreaCut'])['LotFrontage'].transform(lambda x : x.fillna(x.median()))

# Drop 'LotAreaCut' column
data_full.drop('LotAreaCut',axis=1,inplace=True)

# Check if missing values are imputed successfully
sum_missing_values = data_full.select_dtypes(include=[np.number]).isnull().sum()
sum_missing_values[sum_missing_values > 0].sort_values(ascending=False)

The only missing values that are left are within SalePrice, which is exactly the number of lignes in the test data (the values that we need to predict).

In [None]:
# Sum of missing values by feature (object)
sum_missing_values = data_full.select_dtypes(include=[np.object]).isnull().sum()
sum_missing_values[sum_missing_values > 0].sort_values(ascending=False)

In [None]:
# Categorical features with less than 5 missing values: replace with the mode (most frequently occured value)
for col in ['MSZoning', 'Functional', 'Utilities', 'Exterior1st', 'SaleType', 'Exterior2nd', 'KitchenQual', 'Electrical']:
    data_full[col].fillna(data_full[col].mode()[0], inplace=True)

# Check if missing values are imputed successfully
sum_missing_values = data_full.select_dtypes(include=[np.object]).isnull().sum()
sum_missing_values[sum_missing_values > 0].sort_values(ascending=False)

In [None]:
# Categorical features with more than 5 missing values: replace with 'None'
for col in ['PoolQC','MiscFeature','Alley','Fence','FireplaceQu','GarageQual','GarageCond','GarageFinish','GarageType','BsmtExposure','BsmtCond','BsmtQual','BsmtFinType2','BsmtFinType1','MasVnrType']:
    data_full[col].fillna('None', inplace=True)

# Check if missing values are imputed successfully
sum_missing_values = data_full.select_dtypes(include=[np.object]).isnull().sum()
sum_missing_values[sum_missing_values > 0].sort_values(ascending=False)

## 06) Correcting Features

If we take a look at the numeric variables, we see that some of them obviously don't make a sense being numerical like year related features. Let's take a closer look at each one of them in the data description file and see which ones need to be converted to categorical type.

In [None]:
data_full.select_dtypes(include=[np.number]).columns

In [None]:
# Converting numeric features to categorical features
str_cols = ['YrSold','YearRemodAdd','YearBuilt','MoSold','MSSubClass','GarageYrBlt']

for col in str_cols:
    data_full[col] = data_full[col].astype(str)

## 07) Adding Features

First, we will map some categorical variable that represent some sort of rating to an integer score.

In [None]:
data_full.select_dtypes(include=[np.object]).columns

In [None]:
data_full['GarageCond'].unique()

In [None]:
# ExterQual = Evaluates the quality of the material on the exterior: Ex(Excellent), Gd(Good), TA(Typical), Fa(Fair), Po(Poor)
data_full["oExterQual"] = data_full['ExterQual'].map({'Fa':1, 'TA':2, 'Gd':3, 'Ex':4})

# ExterCond = Evaluates the present condition of the material on the exterior: Ex(Excellent), Gd(Good), TA(Typical), Fa(Fair), Po(Poor)
data_full["oExterCond"] = data_full['ExterCond'].map({'Po':1, 'Fa':2, 'TA':3, 'Gd':4, 'Ex':5})

# BsmtQual = Evaluates the height of the basement: Ex(Excellent), Gd(Good), TA(Typical), Fa(Fair), Po(Poor), NA(No Basement)
data_full["oBsmtQual"] = data_full['BsmtQual'].map({'None':1, 'Fa':2, 'TA':3, 'Gd':4, 'Ex':5})

# BsmtExposure = Refers to walkout or garden level walls: Gd(Good), Av(Average), Mn(Minimum), No(No Exposure), NA(No Basement)
data_full["oBsmtExposure"] = data_full['BsmtExposure'].map({'None':1, 'No':2, 'Av':3, 'Mn':3, 'Gd':4})

# BsmtCond = Evaluates the general condition of the basement: Ex(Excellent), Gd(Good), TA(Typical), Fa(Fair), Po(Poor), NA(No Basement)
data_full["oBsmtCond"] = data_full['BsmtCond'].map({'None':1, 'Po':2, 'Fa':3, 'TA':3, 'Gd':4})

# HeatingQC = Heating quality and condition: Ex(Excellent), Gd(Good), TA(Average), Fa(Fair), Po(Poor)
data_full["oHeatingQC"] = data_full['HeatingQC'].map({'Po':1, 'Fa':2, 'TA':3, 'Gd':4, 'Ex':5})

# KitchenQual: Kitchen quality: Ex(Excellent), Gd(Good), TA(Typical), Fa(Fair), Po(Poor)
data_full["oKitchenQual"] = data_full['KitchenQual'].map({'Fa':1, 'TA':2, 'Gd':3, 'Ex':4})

# FireplaceQu: Fireplace quality: Ex(Excellent), Gd(Good), TA(Average), Fa(Fair), Po(Poor), NA(No Fireplace)
data_full["oFireplaceQu"] = data_full['FireplaceQu'].map({'None':1, 'Po':2, 'Fa':3, 'TA':4, 'Gd':5, 'Ex':6})

# GarageFinish: Interior finish of the garage: Fin(Finished), RFn(Rough Finished), Unf(Unfinished), NA(No Garage)
data_full["oGarageFinish"] = data_full['GarageFinish'].map({'None':1, 'Unf':2, 'RFn':3, 'Fin':4})

# GarageQual: Garage quality: Ex(Excellent), Gd(Good), TA(Typical), Fa(Fair), Po(Poor), NA(No Garage)
data_full["oGarageQual"] = data_full['GarageQual'].map({'None':1, 'Po':2, 'Fa':3, 'TA':4, 'Gd':5, 'Ex':6})

# GarageCond: Garage condition: Ex(Excellent), Gd(Good), TA(Typical), Fa(Fair), Po(Poor), NA(No Garage)
data_full["oGarageCond"] = data_full['GarageCond'].map({'None':1, 'Po':2, 'Fa':3, 'TA':4, 'Gd':5, 'Ex':6})

# PavedDrive: Paved driveway: Y(Padev), P(Partial Pavement), N(Dirt)
data_full["oPavedDrive"] = data_full['PavedDrive'].map({'N':1, 'P':2, 'Y':3})

Next, we will add up some numeric features with each other to create new features that make sense

In [None]:
data_full.select_dtypes(include=[np.number]).columns

In [None]:
# House square feet = First floor square feet + Second floor square feet + Total square feet of basement area
data_full['HouseSF'] = data_full['1stFlrSF'] + data_full['2ndFlrSF'] + data_full['TotalBsmtSF']

# Porch square feet = Three season porch area in square feet + Enclosed porch area in square feet + Screen porch area in square feet
data_full['PorchSF'] = data_full['3SsnPorch'] + data_full['EnclosedPorch'] + data_full['OpenPorchSF'] + data_full['ScreenPorch']

# Total square feet = House square feet + Porch square feet + Garage area
data_full['TotalSF'] = data_full['HouseSF'] + data_full['PorchSF'] + data_full['GarageArea']

## 08) Skewness and Kurtosis

**Skewness** is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive, zero, negative, or undefined. For a unimodal distribution, negative skew commonly indicates that the tail is on the left side of the distribution, and positive skew indicates that the tail is on the right. 

**Kurtosis** is a measure of the "tailedness" of the probability distribution of a real-valued random variable. Like skewness, kurtosis describes the shape of a probability distribution and there are different ways of quantifying it for a theoretical distribution and corresponding ways of estimating it from a sample from a population. Different measures of kurtosis may have different interpretations.

In [None]:
# Estimate Skewness of the data
train_data.skew()

In [None]:
# Estimate Kurtosis of the data
train_data.kurt()

In [None]:
# Plot the Skewness and Kurtosis of the data
plt.figure(figsize=(15,5))

ax1 = plt.subplot(1, 2, 1)
sns.distplot(train_data.skew(), axlabel ='Skewness')

ax2 = plt.subplot(1, 2, 2)
sns.distplot(train_data.kurt(), axlabel ='Kurtosis')

plt.show()

There isn't much Kurtosis in the data columns, but Skewness is very present, meaning that distribution is not symetrical.

## 09) Label Encoding

For this section we will use Pipelines which are a way to streamline a lot of the routine processes. It provides a way to take code, fit it to the training data, apply it to the test data without having to copy and paste everything.

- **Skewness**: Doing the transformation in the distribution to remove the positive skew. 

- **Label Encoder and One Hot Encoder:** These two encoders are parts of the SciKit Learn library in Python, and they are used to convert categorical data, or text data, into numbers, which our predictive models can better understand.

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
from scipy.stats import skew

# Label encoding class
class labenc(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        label = LabelEncoder()
        
        X['YrSold'] = label.fit_transform(X['YrSold'])
        X['YearRemodAdd'] = label.fit_transform(X['YearRemodAdd'])
        X['YearBuilt'] = label.fit_transform(X['YearBuilt'])
        X['MoSold'] = label.fit_transform(X['MoSold'])
        X['GarageYrBlt'] = label.fit_transform(X['GarageYrBlt'])
        
        return X
    
# Skewness transform class
class skewness(BaseEstimator,TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        skewness = X.select_dtypes(include=[np.number]).apply(lambda x: skew(x))
        skewness_features = skewness[abs(skewness) >= 1].index
        
        X[skewness_features] = np.log1p(X[skewness_features])
        
        return X

# One hot encoding class
class onehotenc(BaseEstimator,TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X = pd.get_dummies(X)
        
        return X

In [None]:
# Creating a copy of the full dataset
data_full_copy = data_full.copy()

# Creating a new data with the applied transformations using a Pipeline
from sklearn.pipeline import Pipeline

pipeline = Pipeline([('labenc', labenc()), ('skewness', skewness()), ('onehotenc', onehotenc())])

data_pipeline = pipeline.fit_transform(data_full_copy)

data_full.shape, data_pipeline.shape

We can see now that the number of features increases from 95 to 332 because of the feature transformations.

In [None]:
data_full.head()

In [None]:
data_pipeline.head()

Now we split the data to training and testing datasets again.

In [None]:
X_train = data_pipeline[:train_data.shape[0]]
y_train = X_train['SalePrice']
X_train.drop(columns='SalePrice', inplace=True)

X_test = data_pipeline[train_data.shape[0]:]
X_test.drop(columns='SalePrice', inplace=True)

X_train.shape, y_train.shape, X_test.shape

## 10) Transformation and Scaling

In [None]:
plt.figure(figsize=(25,5))

ax1 = plt.subplot(1, 3, 1)
sns.distplot(y_train, kde=False, fit=st.norm)
plt.title('Normal', size = 15)

ax2 = plt.subplot(1, 3, 2)
sns.distplot(y_train, kde=False, fit=st.lognorm)
plt.title('Log Normal', size = 15)

ax3 = plt.subplot(1, 3, 3)
sns.distplot(y_train, kde=False, fit=st.johnsonsu)
plt.title('Johnson SU', size = 15)

plt.show()

Normal distribution doesn't fit, so SalePrice need to be transformed before creating the model. Best fit is unbounded Johnson distribution, altough log normal distribution also fits well

In [None]:
# Transforming 'SalePrice' into normal distribution
y_train_transformed = np.log(y_train)

y_train_transformed.skew(), y_train_transformed.kurt()

In [None]:
# Plotting 'SalePrice' before and after the transformation
plt.figure(figsize=(15,5))

ax1 = plt.subplot(1, 2, 1)
sns.distplot(y_train)
plt.title('Before Transformation', size=15)

ax2 = plt.subplot(1, 2, 2)
sns.distplot(y_train_transformed)
plt.title('After Transformation', size=15)

plt.show()

In [None]:
# Using RobustScaler to transform X_train and X_test
from sklearn.preprocessing import RobustScaler
robust_scaler = RobustScaler()

X_train_scaled = robust_scaler.fit(X_train).transform(X_train)
X_test_scaled = robust_scaler.transform(X_test)

In [None]:
# Shape of final data we will be working on
X_train_scaled.shape, y_train_transformed.shape, X_test_scaled.shape

## 11) Feature Selection

We will use lasso regression (l1 regularization method). Lasso shrinks the less important feature’s coefficient to zero thus, removing some feature altogether. We can also use it to find the most important features in our dataset.

In [None]:
# Display features by their importance (lasso regression coefficient)
from sklearn.linear_model import Lasso
lasso = Lasso(alpha = 0.001)

lasso.fit(X_train_scaled, y_train_transformed)

y_pred_lasso = lasso.predict(X_test_scaled)

lasso_coeff = pd.DataFrame({'Feature Importance':lasso.coef_}, index=data_pipeline.drop(columns='SalePrice').columns)
lasso_coeff.sort_values('Feature Importance', ascending=False)

In [None]:
# Plot features by importance (feature coefficient in the model)
lasso_coeff[lasso_coeff['Feature Importance'] != 0].sort_values('Feature Importance').plot(kind='barh',figsize=(20,20))

What's intersting here is that two of the variables that we have created 'HouseSF' and 'PorchSF' perform actually bad compared to their components. But when we sum all the surfaces as in 'TotalSF', which is just a combination of features that are significantly unimportant in this model, we suddently obtain the most important feature in the dataset.

## 12) Principal Components Analysis

Principal Components Analysis (PCA) is used to decompose a multivariate dataset in a set of successive orthogonal components that explain a maximum amount of the variance.

In [None]:
from sklearn.decomposition import PCA

# Concatenate the training and test datasets into a single dataframe
data_full_2 = np.concatenate([X_train_scaled, X_test_scaled])

# Choose the number of principle components such that 95% of the variance is retained
pca = PCA(0.95)
data_full_2 = pca.fit_transform(data_full_2)

var_PCA = np.round(pca.explained_variance_ratio_ * 100, decimals=1)

# Principal Component Analysis of data
print(var_PCA)

In [None]:
# Principal Component Analysis plot of the data
plt.figure(figsize=(15,5))

plt.bar(x=range(1, len(var_PCA) + 1), height=var_PCA)
plt.ylabel("Explained Variance (%)", size=15)
plt.xlabel("Principle Components", size=15)
plt.title("Principle Component Analysis Plot : Training Data", size=15)
plt.show()

In [None]:
# Shape of final data we will be working on
X_train_scaled = data_full_2[:train_data.shape[0]]

X_test_scaled = data_full_2[train_data.shape[0]:]

X_train_scaled.shape, y_train_transformed.shape, X_test_scaled.shape

## 13) Testing Different Models

Now that we have finished preparing our data, it's time to test different models to see which one performs the best.
The models we will be testing are : 
- Linear Regression
- Support Vector Regression
- Stochastic Gradient Descent
- Gradient boosting tree
- Random forest
- Lasso regression
- Ridge regression
- Elastic net regularization
- Extra trees regression

In [None]:
# Importing the models
from sklearn.linear_model import LinearRegression, BayesianRidge, ElasticNet, Lasso, SGDRegressor, Ridge
from sklearn.kernel_ridge import KernelRidge
from sklearn.ensemble import ExtraTreesRegressor, GradientBoostingRegressor, RandomForestRegressor
from sklearn.svm import LinearSVR, SVR

# kfolds = KFold(n_splits=10, shuffle=True, random_state=42)


#alphas_alt = [14.5, 14.6, 14.7, 14.8, 14.9, 15, 15.1, 15.2, 15.3, 15.4, 15.5]
#alphas2 = [5e-05, 0.0001, 0.0002, 0.0003, 0.0004, 0.0005, 0.0006, 0.0007, 0.0008]
#e_alphas = [0.0001, 0.0002, 0.0003, 0.0004, 0.0005, 0.0006, 0.0007]
#e_l1ratio = [0.8, 0.85, 0.9, 0.95, 0.99, 1]


# Adicionar RidgeCV(alpha=alphas_alt, cv=kfolds)
# Adicionar LassoCV(max_iter=1e7, alphas=alphas2, random_state=42, cv=kfolds)
# Adicionar ElasticNetCV(max_iter=1e7, alphas=e_alphas, cv=kfolds, l1_ratio=e_l1ratio)
# Adicionar SVR(C= 20, epsilon= 0.008, gamma=0.0003,)
# Adicionar GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05, max_depth=4, max_features='sqrt', min_samples_leaf=15, min_samples_split=10, loss='huber', random_state =42)
# Adicionar LGBMRegressor(objective='regression', num_leaves=4, learning_rate=0.01, n_estimators=5000, max_bin=200, bagging_fraction=0.75, bagging_freq=5, bagging_seed=7, feature_fraction=0.2,
# feature_fraction_seed=7, verbose=-1,)

# Adicionar XGBRegressor(learning_rate=0.01,n_estimators=3460,
                                    # max_depth=3, min_child_weight=0,
                                    # gamma=0, subsample=0.7,
                                    # colsample_bytree=0.7,
                                    # objective='reg:linear', nthread=-1,
                                    # scale_pos_weight=1, seed=27,
                                    # reg_alpha=0.00006)
                        
# StackingCVRegressor(regressors=(ridge, lasso, elasticnet, gbr, xgboost, lightgbm),
                               # meta_regressor=xgboost,
                               # use_features_in_secondary=True)
                        
# Creating the models
models = [LinearRegression(), 
          SVR(),
          SGDRegressor(),
          SGDRegressor(max_iter=1000, tol=1e-3),
          GradientBoostingRegressor(),
          RandomForestRegressor(),
          Lasso(),
          Lasso(alpha=0.01, max_iter=10000),
          Ridge(),
          BayesianRidge(),
          KernelRidge(),
          KernelRidge(alpha=0.6, kernel='polynomial',degree=2, coef0=2.5),
          ElasticNet(),
          ElasticNet(alpha=0.001, max_iter=10000), ExtraTreesRegressor()
         ]

names = ['Linear Regression',
         'Support Vector Regression',
         'Stochastic Gradient Descent',
         'Stochastic Gradient Descent 2',
         'Gradient Boosting Tree',
         'Random Forest',
         'Lasso Regression',
         'Lasso Regression 2',
         'Ridge Regression',
         'Bayesian Ridge Regression',
         'Kernel Ridge Regression',
         'Kernel Ridge Regression 2',
         'Elastic Net Regularization',
         'Elastic Net Regularization 2',
         'Extra Trees Regression'
        ]

In [None]:
# Define a root mean square error function
def rmse(model, X, y):
    rmse = np.sqrt(-cross_val_score(model, X, y, scoring="neg_mean_squared_error", cv=5))
    return rmse

In [None]:
from sklearn.model_selection import KFold, cross_val_score
warnings.filterwarnings('ignore')

# Perform 5-folds cross-validation to evaluate the models 
for model, name in zip(models, names):
    # Root mean square error
    score = rmse(model, X_train_scaled, y_train_transformed)
    print("- {}: Mean: {:.6f}, Std: {:4f}".format(name, score.mean(), score.std()))

Surprisingly, the Random forest and Extra trees regression models are the ones who performed the worst, and the linear regression model performed actually pretty good relative to the other models.
By compiling the above code several times and observing the different scores each time, we can classify the models by accuracy :

- 1st : Kernel ridge regression
- 2nd : Elastic net regularization and Bayesian ridge regression
- 3rd : Ridge regression and Linear regression
- 4rth : Support vector regression
- 5th : Gradient boosting tree
- 6th : Stochastic gradient  and Lasso regression
- 7th : Random forest and Extra trees regression

I think we got a good score in Elastic net regularization, Lasso regression and Stochastic gradient descent because we chose some good parameters. We can see that their score above is very bad when not specifing parameter values. So if we really want to know to best model, we need to choose optimal parameters for all the models, and tha's what we will do in the next section.

## 14) Hyper-parameter Tuning

For choosing the most optimal hyper-parameters, we will perform gird search. the class GridSearchCV exhaustively considers all parameter combinations and generates candidates from a grid of parameter values specified with the param_grid parameter.
Since we will use the same procedure for all models, we will start by creating a function which takes specified parameter values as entry.

In [None]:
from sklearn.model_selection import GridSearchCV

class gridSearch():
    def __init__(self, model):
        self.model = model
    def grid_get(self, param_grid):
        grid_search = GridSearchCV(self.model, param_grid, cv=5, scoring='neg_mean_squared_error')
        grid_search.fit(X_train_scaled, y_train_transformed)
        grid_search.cv_results_['mean_test_score'] = np.sqrt(-grid_search.cv_results_['mean_test_score'])
        
        #print(pd.DataFrame(grid_search.cv_results_)[['params', 'mean_test_score', 'std_test_score']])
        print('Best Parameters: {}, \nBest Score: {}'.format(grid_search.best_params_, np.sqrt(-grid_search.best_score_)))

### 1. Kernel Ridge Regression

In [None]:
gridSearch(KernelRidge()).grid_get({'alpha':[3.5, 4, 4.5, 5, 5.5, 6, 6.5], 'kernel':["polynomial"], 'degree':[3], 'coef0':[1, 1.5, 2, 2.5, 3, 3.5]})

### 2. Elastic Net Regularization

In [None]:
gridSearch(ElasticNet()).grid_get({'alpha':[0.006, 0.0065, 0.007, 0.0075, 0.008], 'l1_ratio':[0.070, 0.075, 0.080, 0.085, 0.09, 0.095], 'max_iter':[10000]})

### 3. Ridge regression

In [None]:
gridSearch(Ridge()).grid_get({'alpha':[10, 20, 25, 30, 35, 40, 45, 50, 55, 57, 60, 65, 70, 75, 80, 100], 'max_iter':[10000]})

### 4. Support vector regression

In [None]:
gridSearch(SVR()).grid_get({'C':[13, 15, 17, 19, 21], 'kernel':["rbf"], "gamma":[0.0005, 0.001, 0.002, 0.01], "epsilon":[0.01, 0.02, 0.03, 0.1]})

### 5. Lasso regression

In [None]:
gridSearch(Lasso()).grid_get({'alpha':[0.01, 0.001, 0.0001, 0.0002, 0.0003, 0.0004, 0.0005, 0.0006, 0.0007, 0.0008, 0.0009], 'max_iter':[10000]})

We see that the models perform almost the same way with a score of 0.116. Let's define these models with the their respective best hyper-parameters.

In [None]:
ker = KernelRidge(alpha=6.5, coef0=2.5, degree=3, kernel='polynomial')
ela = ElasticNet(alpha=0.007, l1_ratio=0.07, max_iter=10000)
ridge = Ridge(alpha=35, max_iter= 10000)
svr = SVR(C=13, epsilon=0.03, gamma=0.001, kernel='rbf')
lasso = Lasso(alpha=0.0006, max_iter=10000)
bay = BayesianRidge()

## 15) Making Predictions and Submission
Now it's time to make predictions and store them in a csv file with corresponding Ids. after we make prediction we need to transform them to their original shape with exponential function

In [None]:
# Create the model (Random Forest Classifier) and run with the train data
model = SVR(C=13, epsilon=0.03, gamma=0.001, kernel='rbf')
model.fit(X_train_scaled, y_train_transformed)

# Generate the predictions running the model in the test data
predictions = np.exp(model.predict(X_test_scaled))

# Create the output file 
output = pd.DataFrame({'Id': test_data['Id'], 'SalePrice': predictions})
output.to_csv('submission.csv', index=False)

print("Your submission was successfully saved!")