# Exercise 1 - Deep Neural Networks for Standard Classification Problems - Price Prediction

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this dataset challenges you to predict the final price of each home. The Kaggle challenge can be found [Here](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data)


Load required libraries for modeling and data processing

In [None]:
import seaborn as sb
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import warnings 
warnings.filterwarnings('ignore')
warnings.filterwarnings('ignore', category=DeprecationWarning)

## Load the dataset

In [None]:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

Authenticate and create the PyDrive client.

In [None]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

When prompted, click on the link to get authentication to allow Google to access your Drive. You should see a screen with “Google Cloud SDK wants to access your Google Account” at the top. After you allow permission, copy the given verification code and paste it in the box in Colab.

In [None]:
dataset_file_id = '1RBxydSbuwpMCVaJ2t6cHyNFYMGYvaD8P'

In [None]:
downloaded = drive.CreateFile({'id':dataset_file_id}) 
downloaded.GetContentFile('kaggle_housing_cleaned.csv')

Load the dataset file to a dataframe using Pandas library.

In [None]:
df = pd.read_csv('kaggle_housing_cleaned.csv')

In [None]:
df.head()

Data description can be found from [this link](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data).

We will use SalePrice variable as the predictor and store the predictor array using variable *Y*.

In [None]:
Y = df[['SalePrice']]

## Exploratory Analytics and pre-processing

First we will remove ID column and target column from the training dataset.

In [None]:
df.drop(['Id', 'SalePrice'], inplace=True, axis=1)

In [None]:
print('Dataset size: Rows - {}, Columns - {}'.format(df.shape[0], df.shape[1]))

Note that for the experiment, we will only use the continuous variables.

In [None]:
df.info()

For the experiment, we will only use the continuous variables.

In [None]:
df_numerical = df.copy()

### Standardization

First we will have a look on the data distribution using a box plot.

In [None]:
df_numerical.boxplot(rot=90)

We will normalize the continous variables using Min Max Normalization technique.  
Seperate normalizing objects are used for features and target variable(s).

In [None]:
from sklearn.preprocessing import MinMaxScaler
feature_scaler = MinMaxScaler(feature_range=(0, 1))
target_scaler = MinMaxScaler(feature_range=(0, 1))

In [None]:
df_numerical = feature_scaler.fit_transform(df_numerical)

In [None]:
Y_scaled = target_scaler.fit_transform(Y)

## Modeling

In this workshop, we use Keras API to develop the deep neural network (DNN), on top of Tensorflow framework.  
Further details on the Keras API and how to customize models can be learnt from [the official Keras Guide](https://keras.io/getting-started/functional-api-guide/).  
  
The DNN model we will use is shown below.

![alt text](https://i.imgur.com/4cyoPiL.png)

Import Keras library with Tensorflow and sklearn for model development

In [None]:
%tensorflow_version 1.x
from keras.callbacks import ModelCheckpoint
from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error 

Split the dataset (use 70/30 for train/test)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df_numerical, Y_scaled, test_size=0.3, random_state=2)

First we will define a sequential model, which is the placeholder for our deep learning model.

In [None]:
NN_model = Sequential()

Next we will setup the first layer of our deep neural network (DNN). Here we initialize the input dimentions.  
Note that we will use 36 hidden nodes for each layer of our DNN. You may vary them in own experimental setup, in order to improve the accuracy.

In [None]:
NN_model.add(Dense(36, kernel_initializer='normal', input_dim = X_train.shape[1], activation='relu'))

After initiaing the first layer, we will define 2nd, 3rd and 4th layers similarly.  
However, we do not need to define the input dimensions in proceeding layers, as it will automatically detected from first layer.

In [None]:
NN_model.add(Dense(24, kernel_initializer='normal',activation='relu'))
NN_model.add(Dense(12, kernel_initializer='normal',activation='relu'))
NN_model.add(Dense(8, kernel_initializer='normal',activation='relu'))

Next we will define the output layer. As our output is a prediction of housing price, we will use a single linear activated output node.

In [None]:
NN_model.add(Dense(1, kernel_initializer='normal',activation='linear'))

Now we have completely defined the DNN model.  
Next step is to compile the DNN with [loss function](https://towardsdatascience.com/common-loss-functions-in-machine-learning-46af0ffc4d23), [optimization](https://medium.com/datadriveninvestor/overview-of-different-optimizers-for-neural-networks-e0ed119440c3) function and metrics.  
In our experiment, we will use Mean Absolute Error loss as the loss function, and ADAM optimizer as the optimization function.


In [None]:
# Compile the DNN
NN_model.compile(loss='mean_absolute_error', optimizer='adam', metrics=['mse'])

In [None]:
# Visualize the model summary
NN_model.summary()

Now we will plot the model in a diagram.

In [None]:
# Plot the model
from keras.utils import plot_model
plot_model(NN_model, to_file='model.png', show_shapes=True, show_layer_names=True)

Once the plotting is completed, you can go to Files tab and double click on the model.png file to visualize the model diagram.

## Model Training

Now, we will work on training the model.  
First we need to define 3 parameters,  


1.   Number of training epochs
2.   [The batch size](https://radiopaedia.org/articles/batch-size-machine-learning). i.e., how many training samples are used to iterate over once.
3.   Validation split (what percentage of data to keep as validation data)



In [None]:
epochs = 100             # Number of training epochs
batch_size = 32          # Number of data points to be used to train as a batch. Use to improve the model training time.
validation_split = 0.3   # Validation dataset size (percentage)

By calling model.fit(), you can initiate the training of the DNN.

In [None]:
# Train the model
history = NN_model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size, validation_split = validation_split)

Plot the learning curve, oppose to traning and validation errors.

In [None]:
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper right')

## Model Evaluation

Now we will test the trained DNN model with respect to the test dataset.

In [None]:
# Predicted output for test dataset
yhat_test = NN_model.predict(X_test)
yhat_train = NN_model.predict(X_train)

Recall how we standardized the data using scaler transform library in scikit-learn. Now, we will inverse transform the predictions back to its original range.

In [None]:
# inverse transform test dataset
inv_yhat = target_scaler.inverse_transform(yhat_test)
inv_y_test = target_scaler.inverse_transform(y_test)

In [None]:
# inverse transform train dataset
yhat_train = target_scaler.inverse_transform(yhat_train)
inv_y_train = target_scaler.inverse_transform(y_train)

Evaluate the root means squared error (RMSE)

In [None]:
from sklearn.metrics import mean_squared_error

In [None]:
error_train = np.sqrt(mean_squared_error(inv_y_train, yhat_train))
error_test = np.sqrt(mean_squared_error(inv_y_test, inv_yhat))
print('Train RMSE: ', error_train)
print('Test RMSE: ', error_test)

Relate the train and test error with bias/variance.  
*  What is the problem we have here?
*  What options we can take to improve the accuracy?

In [None]:
plt.figure(figsize=(5, 10))
fig, ax = plt.subplots(nrows=1, ncols=2)
plt.subplots_adjust(hspace=3, wspace=1)
ax[0].scatter(inv_y_test, inv_yhat, c='g')
ax[0].set(title='Test data', xlabel='Actual Sale Price', ylabel='Predicted Sale Price')
ax[1].scatter(inv_y_train, yhat_train, c='b')
ax[1].set(title='Train data', xlabel='Actual Sale Price', ylabel='Predicted Sale Price')

## Hyperparameter Tuning

Adjusting/finding good values for hyperparameters is a slow process. You have to wait for the whole training process to complete, evaluate the results and adjust the value(s).  

In general, hyperparameter tuning can give you 5-15% accuracy boost on the test data.  

There are number of libraries to ease the process of hyperparameter tuning. Hyperas library [Link to library](https://github.com/maxpumperla/hyperas).  

Credits to Nils Schlüter for the guide on [running hyperas with Google Colab](https://towardsdatascience.com/keras-hyperparameter-tuning-in-google-colab-using-hyperas-624fa4bbf673).