# Introduction

This workshop will introduce a utilization of the Keras open source module with a Tensorflow backend.
Keras has become popular as an alternative to pure Tensorflow beacuse of its easy implementation, and similirity in implentation of other popular open source ML libraries. However, the downside to Keras is that it is significantly slower than pure Tensorflow. We will illustrate an approach to a regression problem using a Deep Neural Network, and describing each step along the way.

## The Data 

The dataset for this workshop is a webscraped dataset from a danish online auction house containing product information. The data has been preprocessed for you convenience, so focus can be on the Machine Learning. The features in question are the sales price (price), the presale valuation (valuation), a procut title (titles) and 106 binary features of the product category. We will try to predict the sales price of the products based on the valuation and the product category.

### Import necessary modules

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.callbacks import EarlyStopping
from keras import optimizers
from keras import regularizers
from keras.utils import plot_model
from IPython.display import Image
from sklearn.metrics import mean_squared_error, r2_score
sns.set(style="whitegrid")

### Setting random seed

In [None]:
RANDOM_SEED = 123 

Random seed is important for reproducability when running on a CPU - However it's indifferent when running on the GPU as a GPU sets a number of random seeds creating a challenge for reproduction.

### Import the data

In [None]:
data = pd.read_csv('input/data.csv')

In [None]:
data.info()

In [None]:
data.head()

As you can see, the data consists of many features having zeros in it. These are binary features. Check for yourself

### Check some of the product titles, prices and valuation

In [None]:
data[['titles', 'price', 'valuation']].head()

### Delete the titles

In [None]:
del data['titles']

We dont need the titles of the product in this application, as we s are simply trying to predict prices based on valuation and category

### Get a feel for the distribution of the data

In [None]:
sns.violinplot(x=data["price"])
plt.show()

In [None]:
sns.violinplot(x=data["valuation"])
plt.show()

Most of the products are realtively low priced products, and then there are som outliers. Maybe we could restrain the input to products only in a smaller range of prices. That also means we would only build an application for relatively low priced products

## Restraining input based on valuation

In [None]:
data_ir = data[data['valuation'] <= 25000]

We are removing all products that have a larger valuation than the defined amount. You are welcome to play around with it

### Check the data again to see the impact of restraining the input

In [None]:
data_ir.info()

In [None]:
sns.violinplot(x=data_ir["price"])
plt.show()

In [None]:
sns.violinplot(x=data_ir["valuation"])
plt.show()

What does the picture tell you? If you have restrained input like me on 10.000 DKK for valuations, then there are som discrepenacy between the valuation and the sales price. It tells us that some products a great dela lower than the actual sales price turned out to be. So is the valuation really a good predictor for sales prices at all?

### Checking the correlation coefficient between price and valuation

In [None]:
np.corrcoef(data_ir[['price', 'valuation']], rowvar=False)[0,1]

Pretty high correlation of 67 pct. This serves as evidence that the valuation turns out to be a great predictor. Keep in mind that this will probably change as you change the input restrictions

### How about the distribution of the product categories?

In Neural nets you tend to like that each instance is represented more than a few times in order to substantiate a pattern. Therefore we should delete product categories if they are represented less than 10 times just to be sure. Also this is a hyperparameter as well. You can play with it

In [None]:
for col in data_ir.columns:
    if data_ir[col].sum() < 10:
        print('Removing column %s that only occurs %i times' %(col, data_ir[col].sum()))
        del data_ir[col]

### Split the data for Cross validation purposes

In [None]:
train, test = train_test_split(data_ir, test_size=0.2, random_state=123)

We are splliting the data in order to validate the model based on unseen data or test data. We split the data into a training and a test set for now. Keras will independently split the training data into a training and validation set. That means we train the model on the training data, optimize the model weights on the validation data, and finally test the model on the test data. 

## Creating a Keras NN 

#### Initlizing the graph

In [None]:
graph = Sequential()

If you are familiar with tensorflow then you have probably heard the word 'Graph' a few times. Sequential is like initliazing this graph. Its basically like defining something with an empty piece of paper. Its has restrains in the form of edges, but it contains nothing. The graph still doesn't know how many input features we add to it, the structure of the hidden layers or how many dimensions the output layer will have.

#### Creating a hidden layer 

In [None]:
nodesHidden1 = int((len(train.drop('price', axis=1).T) + 1 ) / 10)
graph.add(Dense(units=nodesHidden1, kernel_initializer="normal", \
#                 kernel_regularizer=regularizers.l2(0.01), \
#                 activity_regularizer=regularizers.l1(0.01), \
                activation='relu', input_shape = (len(train.drop('price', axis=1).T), ))) 

When we add layers to our graph, then we draw connections between each layer node. These connections are often referred to as weights. When adding a layer we need to define layer dimensions (nodes or neurons), the activation function and what the dimensions of the prior layer is. Play around with the activation function

#### Adding dropout to prevet overfitting

In [None]:
# drop_rate = 0.25
# graph.add(Dropout(drop_rate))

Dropout is a way to prevent overfitting. It randomly drops nodes/neurons from the graph, so the graph wont generalize to the training data.

#### Adding the output layer

In [None]:
graph.add(Dense(units=1, kernel_initializer='normal'))

Since we are dealing with a one-dimensiaonal output, which is the case in regression problems, then we only define one output dimension. Had we been dealing with a classification problem we would define the dimensions by number of categories minus 1

#### Compiling the graph

In [None]:
graph.compile(optimizer='adam', loss='mean_squared_error')

Compiling basically tells the graph to close for any new layers, defines the optimizing algorithm and defines the metric in which we want to measure our results. We are using ADAM optimizer, but without defining the parameters of the optimization. But you can play around with this as well.

#### Using Early stopping

In [None]:
earlystop = EarlyStopping(monitor='val_loss', min_delta=0.0001, patience=3, verbose=1, mode='auto')
callbacks_list = [earlystop] #never figured out why it needs to be changed to a list

We use early stopping because we dont want to keep training our graph whn there are no advantage to gain. The import keyword here is 'patience'. It tells the graph how many times it should train without improving before stopping.

#### Fitting the model

When fitting the model, then it takes the input, the ouput and a the batch (number of rows) that we want to train on at a time. Th epocs are the maximum number of training iterations, and the validion split defines the validation set size. Remember the vcalidation is for measuring improvements during each training iteration. 

In [None]:
trainingHist = graph.fit(train.drop('price', axis=1).values, train['price'].values, \
          batch_size=50, epochs=100, callbacks=callbacks_list, validation_split=0.1)   

In [None]:
plot_model(graph, to_file='graph.png', show_shapes=True)
Image('graph.png')

In [None]:
print(trainingHist.history.keys())
# summarize history for accuracy
plt.plot(trainingHist.history['loss'])
plt.plot(trainingHist.history['val_loss'])
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()

#### Measuring the performance of the graph

Firstly, we calculate the predictions based on the training and test data

In [None]:
y_pred_test = graph.predict(test.drop('price', axis=1).values)
y_pred_train = graph.predict(train.drop('price', axis=1).values)

Secondly, we calculate some scoring statistics. We use root mean squared error and the sum of explained squared errors (R²)

In [None]:
meanPriceArray = np.ones(len(train)) * train['price'].mean()

In [None]:
rmseBenchmark = np.sqrt(mean_squared_error(y_true = train['price'].values, y_pred = meanPriceArray))
rmseTrain = np.sqrt(mean_squared_error(y_true = train['price'].values, y_pred = y_pred_train))
rmseTest = np.sqrt(mean_squared_error(y_true = test['price'].values, y_pred = y_pred_test))

In [None]:
accuraciesBenchmark = r2_score(y_true = train['price'].values, y_pred = meanPriceArray)
accuraciesTest = r2_score(y_true = test['price'].values, y_pred = y_pred_test)
accuraciesTrain = r2_score(y_true = train['price'].values, y_pred = y_pred_train)

Thirdly, we print the scores to check performance and the fitting of the graph

In [None]:
print('\n------------------------------------')
print('\nAccuracy benchmark: ', round(accuraciesBenchmark, 2))
print('\nRMSE benchmark: ', round(rmseBenchmark, 2))
print('\n------------------------------------')
print('\nAccuracy train: ', round(accuraciesTrain,2))
print('\nRMSE train: ', round(rmseTrain, 2))
print('\n------------------------------------')
print('\nAccuracy test: ', round(accuraciesTest, 2))
print('\nRMSE test: ', round(rmseTest, 2))
print('\n------------------------------------')

How did we do? And can you improve our results? How would you go about dealing with the overfitting?