<a href="https://colab.research.google.com/github/Pataweepr/applyML_vistec_2019/blob/master/hw3_House_Price_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Housing Price prediction using linear regression

This homework uses data from [Kaggle's House Prices: Advanced Regression Techniques](https://www.kaggle.com/c/house-prices-advanced-regression-techniques)

**Goal**

The goal of this homework is to predict the sales price for each house.

In [0]:
# import library 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression, LinearRegression, ElasticNet
from sklearn.metrics import f1_score,precision_score,recall_score,accuracy_score,mean_squared_error
from google.colab import files
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from IPython.display import display
import json
import seaborn as sns

from scipy import stats

Get the data into your google drive by accessing this [link](https://drive.google.com/open?id=1jgEPVsZ8CWjoKDrD0VB75Jb0igPCU9vc) and click add to drive.

In [0]:
from google.colab import drive
drive.mount('/content/gdrive/')

In [0]:
!unzip '/content/gdrive/My Drive/house-prices-advanced-regression-techniques.zip'
!ls

You will notice multiple files. Here are the main ones:

1.   data_description.txt explains the data
2.   train.csv contains the training data
3.   test.csv contains the test set for Kaggle



## Spliting the data

For convenience we will ignore the test set of the Kaggle competition and create our own validation set using the training data.

Split the training data into training and validation set with 1:10 proportion.

Use [sklearn.model_selection.train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) to do so. Set the random state to 30, so that we can reproduce the same split every time we re-run the code.

In [0]:
## TODO#1 ##


<details>
    <summary>SOLUTION HERE!</summary>
      <pre>
        <code>
train_data = pd.read_csv('train.csv')
train_data, val_data = train_test_split(train_data, test_size=1/11, random_state=30)
test_data = pd.read_csv('test.csv')
        </code>
      </pre>
</details>

Try looking around the data using pandas.summary, pandas.head, and pandas.tail.

What is the number of training data?

** Ans: **

What is the number of validation data?

** Ans: **

What is the number of test data?

** Ans: **

How many features are in the data?

** Ans: **

## Cleaning the data

Fill the missing values in the training and validation data for features:

1.   LotFrontage
2.   GarageYrBlt
3.  MasVnrArea

with their mode values. (Remember to use the mode values from the training data not the validation data).

In [0]:
## TODO#2 ##


<details>
    <summary>SOLUTION HERE!</summary>
      <pre>
        <code>
train_data["LotFrontage"] = train_data["LotFrontage"].fillna(train_data["LotFrontage"].mode().iloc[0])
        </code>
      </pre>
</details>

## Data exploration and visualization

Since we have many variables in this dataset, we will try to visualize it using  [sns.FacetGrid](https://seaborn.pydata.org/generated/seaborn.FacetGrid.html) via histograms.

However, to use FacetGrid, we need to put our data into [pd.melt](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.melt.html) format.

Look at an example on how to do so [here](https://seaborn.pydata.org/tutorial/axis_grids.html).

Before we can plot histograms, we first need to know which feature columns are categorical and which feature columns are numerical. 

The function below takes in the training data frame and will return two lists. One includes the name of categorical features and the other one containing numerical features.

In [0]:
def get_feature_groups(ames_df):
    # Numerical Features
    num_features = ames_df.select_dtypes(include=['int64','float64']).columns
    # We drop ID and SalePrice since these are not input features
    num_features = num_features.drop(['Id','SalePrice']) 

    # Categorical Features
    cat_features = ames_df.select_dtypes(include=['object']).columns
    return list(num_features), list(cat_features)
  
num_features, cat_features = get_feature_groups(train_data)

In [0]:
## TODO#3 ##
## Plot the two group of featuers. Use sns.distplot for the numerical features
## and snsl.countplot for the categorical features


<details>
    <summary>SOLUTION HERE!</summary>
      <pre>
        <code>
# Grid of distribution plots of all numerical features
f = pd.melt(train_data, value_vars=sorted(num_features))
g = sns.FacetGrid(f, col='variable', col_wrap=4, sharex=False, sharey=False)
g = g.map(sns.distplot, 'value')

# Grid of frequency plots of all categoriccal features
f = pd.melt(train_data, value_vars=sorted(cat_features))
g = sns.FacetGrid(f, col='variable', col_wrap=4, sharex=False, sharey=False)
plt.xticks(rotation='vertical')
g = g.map(sns.countplot, 'value')
[plt.setp(ax.get_xticklabels(), rotation=60) for ax in g.axes.flat]
g.fig.tight_layout()
plt.show()
        </code>
      </pre>
</details>

From the plots, name one feature that should not be helpful for predicting price? Why? (A useless feature is one that does not contain any information.)

** Ans: **


Next we will look into the distribution of house prices. Create a histogram of the SalePrice with 100 bins.

In [0]:
## TODO#4 ##


Write down any observation you have from the histogram. What direction is the tail of the distribution?

** Ans: **

One of the problems we can notice is that there are extremely high values that can cause problems for any prediction model.

One way to normalize data with high spread is to squish it using a logarithmic function.

Normalize the price values using the function:

$y=log(1+x)$

*The addition by one is so that we will not have problems if x = 0.*

Plot the histogram of the price after the log normalization.

Note: Numpy includes a function that does log(1+x) for you called [np.log1p](https://docs.scipy.org/doc/numpy/reference/generated/numpy.log1p.html). The inverse of the function is called [np.expm1](https://docs.scipy.org/doc/numpy/reference/generated/numpy.expm1.html#numpy.expm1).


In [0]:
## TODO#5 ##


Compare and contrast this new histogram with the previous one.

** Ans: **

In the next sections we will start building models. Before we get into that, we created a normalizer function for you to use.

In [0]:
# If you want to scale the SalePrice, set is_one_dim to False.
def normalizer(data,is_one_dim = False):
  if (is_one_dim):
    min_max_scale = preprocessing.MinMaxScaler().fit(data.reshape((len(data),1)))
  else:
    min_max_scale = preprocessing.MinMaxScaler().fit(data)
  return min_max_scale

We also have a function that evaluate the prediction in various ways.

Function inputs:
*   *Predict_data_train_norm* is the predicted price (normzlied using log1p and min-max scaled)
*   *real_price_train* is the unnormalized true price
*   *p_normalizer* is the min-max scaler for the price data

Function outputs:

*   *RMSE* is the Root Mean Square Error of the predicted price.
*   *Mean percentage error* is the mean of the prediction error in percent (relative to the correct price).



In [0]:
def evaluate_lin_reg(predict_data_train_norm,predict_data_val_norm,real_price_train,real_price_val,p_normalizer):
  
  train_predict_log1p = p_normalizer.inverse_transform(predict_data_train_norm.reshape((len(predict_data_train_norm),1))).reshape(len(predict_data_train_norm))
  train_predict = np.expm1(train_predict_log1p)
  
  val_predict_log1p = p_normalizer.inverse_transform(predict_data_val_norm.reshape((len(predict_data_val_norm),1))).reshape(len(predict_data_val_norm))
  val_predict = np.expm1(val_predict_log1p)
  
  rmse_train = np.sqrt(mean_squared_error(train_predict,real_price_train))
  mean_per_train = np.mean(np.absolute(train_predict - real_price_train)/real_price_train)*100
  
  rmse_val = np.sqrt(mean_squared_error(val_predict,real_price_val))
  mean_per_val = np.mean(np.absolute(val_predict - real_price_val)/real_price_val)*100
  
  
  print('rmse train set : ', rmse_train)
  print('mean percentage error train set : ', mean_per_train)
  print('rmse val set : ', rmse_val)
  print('mean percentage error val set : ', mean_per_val)
  
  return rmse_train,mean_per_train,rmse_val,mean_per_val

## Linear Regression model

In this section we will use linear regression to do housing price prediction.

See [Linear Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) for examples on how to use linear regression.

Normalize the input features using min-max normalizer, and normalize the price (output target) using log1p followed by a min-max normalizer.

Use the function evaluate_lin_reg() to report your results.

In [0]:
## TODO#6 ##


<details>
    <summary>SOLUTION HERE!</summary>
      <pre>
        <code>
train_np = np.array(train_data[num_features].values)
train_price_data = np.array(train_data['SalePrice'].values)
train_log1p_price = np.log1p(train_price_data)
train_normalizer = normalizer(train_np)
train_np_norm = train_normalizer.transform(train_np)
price_normalizer = normalizer(train_log1p_price,is_one_dim = True)
train_log1p_price_norm = price_normalizer.transform(train_log1p_price.reshape(1,len(train_log1p_price))).reshape(len(train_log1p_price))

lin_reg = LinearRegression().fit(train_np_norm, train_log1p_price_norm)
train_predict_log1p_norm = lin_reg.predict(train_np_norm)

val_np = np.array(val_data[num_features].values)
val_price_data = np.array(val_data['SalePrice'].values)
val_log1p_price = np.log1p(val_price_data)
val_np_norm = train_normalizer.transform(val_np)
val_log1p_price_norm = price_normalizer.transform(val_log1p_price.reshape(1,len(val_log1p_price))).reshape(len(val_log1p_price))

val_predict_log1p_norm = lin_reg.predict(val_np_norm)

rms_train,mp_train,rms_val,mp_val = evaluate_lin_reg(train_predict_log1p_norm,val_predict_log1p_norm,train_price_data,val_price_data,price_normalizer)
        </code>
      </pre>
</details>

In order to get a sense of how good our model is we can compare it against baselines.

One such baseline is a model that always output the mean of the SalePrice. What is the RMSE of the sale price predicted this way?

** Ans: **

Compute the Standard deviation of the SalePrice, you will find that it is very close to the RMSE value. Explain why.

** Ans: **

### Feature selection

One of the important factor for good classification results is the quality of our input features.

One way to look at feature quality is the calculate the correlation coefficients (np.corrcoef from HW1).

Calculate the correlation between each data column in the training data (including the salePrice). This is called a correlation matrix.

Visualize it using [sns.heatmap](https://seaborn.pydata.org/generated/seaborn.heatmap.html)


In [0]:
## TODO#7 ##


<details>
    <summary>SOLUTION HERE!</summary>
      <pre>
        <code>
data_to_sel_feature = np.vstack((train_np,val_np))
log1p_price_all = np.hstack((train_log1p_price,val_log1p_price))

data_train = np.hstack((data_to_sel_feature,log1p_price_all.reshape((len(log1p_price_all),1))))
print(data_train.shape)

corre_mat = np.corrcoef(data_train.T)
ax = sns.heatmap(corre_mat, vmin=0, vmax=1)
        </code>
      </pre>
</details>

Sort the input columns by the correlation coefficient with the SalePrice. 

What feature has the highest correlation with the SalePrice? Does it make sense?

** Ans: **

In [0]:
## TODO#8 ##


<details>
    <summary>SOLUTION HERE!</summary>
      <pre>
        <code>
corre_mat_abs = np.absolute(corre_mat[36,:36])
ind_feature_imp = np.flip(np.argsort(corre_mat_abs) ,axis = 0)
num_features_np = np.array(num_features)
print(num_features_np[ind_feature_imp])
        </code>
      </pre>
</details>

#### Feature selection using correlation values (univariate analysis)

Use only 6 features that have the highest correlation value to train a new model. Compare the RMSE (both training and validation) with the previous model. Why? (Think about overfitting and underfitting)

** Ans: **

In [0]:
## TODO#9 ##


<details>
    <summary>SOLUTION HERE!</summary>
      <pre>
        <code>
# sort by correlation
num_feature_sel = 6
feature_sel = num_features_np[ind_feature_imp[:num_feature_sel]]

train_cut_np = np.array(train_data[feature_sel].values)
train_cut_normalizer = normalizer(train_cut_np)
train_cut_norm = train_cut_normalizer.transform(train_cut_np)

lin_reg_cut = LinearRegression().fit(train_cut_norm, train_log1p_price_norm)
train_cut_predict_log1p_norm = lin_reg_cut.predict(train_cut_norm)

val_cut_np = np.array(val_data[feature_sel].values)
val_cut_norm = train_cut_normalizer.transform(val_cut_np)
val_cut_predict_log1p_norm = lin_reg_cut.predict(val_cut_norm)

rms_train,mp_train,rms_val,mp_val = evaluate_lin_reg(train_cut_predict_log1p_norm,val_cut_predict_log1p_norm,train_price_data,val_price_data,price_normalizer)
        </code>
      </pre>
</details>

#### Feature selection using machine learing models (multivariate analysis)

Selecting the right features can have a huge impact on the performance of the model. In the previous section, we measure the correlation between a *single* variable with the prediction target. This disregards the interaction with other features and can lead to non-optimal solutions. 

Another method to measure the importance of features is to look at the coefficients $w_i$ of the regression model.

$y = \Sigma_i x_i * w_i$

where $i$ is the feature index.

The size of the weights can tell how important that feature is. Since the weight is calculated in conjunction with other features, it also take into account of other features.

Rank the features by the **absolute value** of the regression weights. Compare it with the ranking given by the correlation coefficient.

You can access the weights by looking at LinearRegression.coef_ in a trained model.

Which makes more sense?

** Ans: **

In [0]:
## TODO#10 ##


Use only 6 features that have the highest absolute weight value to train a new model. Compare the RMSE (both training and validation) with the previous model.

In [0]:
## TODO#11 ##


## Non-linear features

Linear regression is a linear model, meaning the output of the model is a linear function (a straight plane) wrt. the input features.

In order to make linear regression more powerful, we usually add non-linear features by adding higher order features. The additional features are usually features, for example, if we add polynomial of degree 2 to features $\{x_1, x_2\}$, we will get the features $\{x_1^2, x_2^2, x_1x_2, x_1, x_2\}$.

Sk-learn have a function that help you create [polynomial feature](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html).

From the 6 features in the previous step, construct polynomial features of degree 2 with only interaction terms (no $x_1^2$ and $x_2^2$ terms). Then, train a linear regression model using those non-linear features.

Compare the RMSE (both training and validation) with the previous model.

In [0]:
## TODO#11 ##


<details>
    <summary>SOLUTION HERE!</summary>
      <pre>
        <code>
train_poly_np = preprocessing.PolynomialFeatures(2,interaction_only=True).fit_transform(train_cut_np)
train_poly_normalizer = normalizer(train_poly_np)
train_poly_norm = train_poly_normalizer.transform(train_poly_np)

lin_reg_poly = LinearRegression().fit(train_poly_norm, train_log1p_price_norm)
train_poly_predict_log1p_norm = lin_reg_poly.predict(train_poly_norm)

val_poly_np = preprocessing.PolynomialFeatures(2,interaction_only=True).fit_transform(val_cut_np)
val_poly_norm = train_poly_normalizer.transform(val_poly_np)
val_poly_predict_log1p_norm = lin_reg_poly.predict(val_poly_norm)

rms_train,mp_train,rms_val,mp_val = evaluate_lin_reg(train_poly_predict_log1p_norm,val_poly_predict_log1p_norm,train_price_data,val_price_data,price_normalizer)
        </code>
      </pre>
</details>

### Overftting

With higher order features, it is easy to overfit to the training data. 

From the previous step, now construct polynomial features of degree 2 with **all terms**. Then, train a linear regression model using those non-linear features.

Compare the RMSE (both training and validation) with the previous model.

In practice, the features used (feature selection) and the polynomial terms are hyperparameter you need to tune using the validation set. 

In [0]:
## TODO#12 ##


## Elastic Net and Lasso

In this part of the homework we will explore the use of L1 regularization. In machine learning we can reduce the overfitting by introducing regularization terms. This is accomplished by changing the objective function to force desirable properties.

There are two kinds of regularization that is often used, L1 and L2 regularizations.

Lasso uses L1 reguarlization, while Elastic Net use both L1 and L2 at the same time.

An interesting property of L1 regularization is that it can be used to select only meaningful features. The features that are not useful will be ignore (the associated weights are zero). 

Create a linear regression model using polynomial features of degree 3.

Note: you will not be able to evaluate the model using RMSE because you will be getting NaN values. Diagnose the problem by looking at the histogram of the predicted log1p normalized values of the validation set. Cap the predicted values to a reasonable value and evaluate the model.

In [0]:
## TODO#13 ##


Plot a histogram of the absolute values of the weights used in the model.

In [0]:
## TODO#14 ##


Use [ElasticNet](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html#sklearn.linear_model.ElasticNet) to perform L1 regularization. Set alpha = 0.00005 and l1_ratio = 1. This setting turns the Elastic Net into a Lasso model. Evaluate the model. Compare the RMSE (both training and validation) with the previous model.

In [0]:
## TODO#15 ##


<details>
    <summary>SOLUTION HERE!</summary>
      <pre>
        <code>
train_poly_np = preprocessing.PolynomialFeatures(3).fit_transform(train_np)
train_poly_normalizer = normalizer(train_poly_np)
train_poly_norm = train_poly_normalizer.transform(train_poly_np)

val_poly_np = preprocessing.PolynomialFeatures(3).fit_transform(val_np)
val_poly_norm = train_poly_normalizer.transform(val_poly_np)

alpha_sel = 0.00005
l1_ratio_sel = 1.0

lin_reg_l1 = ElasticNet(alpha = alpha_sel, l1_ratio = l1_ratio_sel).fit(train_poly_norm, train_log1p_price_norm)
train_predict_l1_log1p_norm = lin_reg_l1.predict(train_poly_norm)

val_predict_l1_log1p_norm = lin_reg_l1.predict(val_poly_norm)

rms_train,mp_train,rms_val,mp_val = evaluate_lin_reg(train_predict_l1_log1p_norm,val_predict_l1_log1p_norm,train_price_data,val_price_data,price_normalizer)
        </code>
      </pre>
</details>

## Comparing model weights between models with regularization and without

Finally we will look at the weights of the Lasso model. Plot a histogram of the weights and compare it to the model wihtout the regularization. Describe how this is related to feature selection.

** Ans: **

In [0]:
## TODO#16 ##
