# Modeling with Keras

In this notebook, I will try to use Tensorflow to build a Neural Network that performs better than the GLM results.
                                                                                    

# Todo

Section 3.2 - Load data as dataset type.

In [1]:
# Import libraries

import pandas as pd
import numpy as np

import keras
from keras.layers import Dense
from keras.models import Sequential
from tensorflow.data.experimental import make_csv_dataset

# Loading

## As pandas

In [2]:
dfInput = pd.read_csv('../output/dataReady.csv', index_col = 'idx')
dfInput.quality = pd.Categorical(dfInput.quality)
dfInput.quality = dfInput.quality.cat.codes

columns = [column for column in dfInput.columns if column not in ['Set','citric acid', 'free sulfur dioxide']]

dfTrain = dfInput.loc[dfInput.Set == 'train', columns]
dfVal = dfInput.loc[dfInput.Set == 'valid', columns]
dfTest = dfInput.loc[dfInput.Set == 'test', columns]

In [3]:
dfTrain

Unnamed: 0_level_0,fixed acidity,volatile acidity,residual sugar,chlorides,total sulfur dioxide,density,pH,sulphates,alcohol,quality
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,7.4,0.700,1.9,0.076,34.0,0.99780,3.51,0.56,9.4,2
1,7.8,0.880,2.6,0.098,67.0,0.99680,3.20,0.68,9.8,2
2,7.8,0.760,2.3,0.092,54.0,0.99700,3.26,0.65,9.8,2
3,11.2,0.280,1.9,0.075,60.0,0.99800,3.16,0.58,9.8,3
5,7.4,0.660,1.8,0.075,40.0,0.99780,3.51,0.56,9.4,2
...,...,...,...,...,...,...,...,...,...,...
1591,5.4,0.740,1.7,0.089,26.0,0.99402,3.67,0.56,11.6,3
1593,6.8,0.620,1.9,0.068,38.0,0.99651,3.42,0.82,9.5,3
1595,5.9,0.550,2.2,0.062,51.0,0.99512,3.52,0.76,11.2,3
1596,6.3,0.510,2.3,0.076,40.0,0.99574,3.42,0.75,11.0,3


## As tf data type

I want to practice with objects of the type https://www.tensorflow.org/guide/data, to be as analogous to `PyTorch` `Dataloader` as possible.  
https://www.tensorflow.org/api_docs/python/tf/data/experimental/make_csv_dataset  
https://www.tensorflow.org/tutorials/load_data/csv#using_tfdata

In [4]:
"""tensorTrain = make_csv_dataset('../output/dataReady.csv',
                              label_name = 'quality',
                              batch_size = 512,
                              select_columns = columns)"""

"tensorTrain = make_csv_dataset('../output/dataReady.csv',\n                              label_name = 'quality',\n                              batch_size = 512,\n                              select_columns = columns)"

But I'm still not able to do it. I'm not sure if this is the right code, but I'll revise it in the future.

# Building a Keras Model


## First Architecture

Lets remember that GLM results showed a score of 63% accuracy. Lets try to beat that with a simple neural network.  
I will also take the results taken from analysing the features in the first notebook. That means I will first remove the features `citric acid` and `free sulfur dioxide`.  This step was already done in section 3.

In [5]:
from tensorflow.keras.utils import to_categorical

# Convert the target to categorical: target
target = to_categorical(dfTrain.quality.to_numpy().astype(str))
predictors = dfTrain[[column for column in dfTrain.columns if column != 'quality']]
n_cols = predictors.shape[1]

val_pred_x = dfVal[[column for column in dfVal.columns if column != 'quality']]
val_pred_y = dfVal['quality']
val_data = (val_pred_x.to_numpy(), to_categorical(val_pred_y))
test_pred = dfTest[[column for column in dfTest.columns if column != 'quality']]

In [6]:
predictors.shape

(1107, 9)

https://machinelearningmastery.com/multi-class-classification-tutorial-keras-deep-learning-library/

In [48]:
from tensorflow.random import set_seed 
from keras.layers import LeakyReLU, BatchNormalization
from keras.callbacks import EarlyStopping

callback_earlyStop = EarlyStopping(monitor = 'val_loss', patience = 5)

set_seed(2)
# Set up the model
model = Sequential()

# Add the first layer
# , activation = 'tanh',
model.add(Dense(100, input_shape=(n_cols,), activation = 'relu'))
model.add(BatchNormalization())
model.add(Dense(90, activation = 'relu'))
model.add(BatchNormalization())
model.add(Dense(60, activation = 'relu'))
model.add(BatchNormalization())
model.add(Dense(30, activation = 'relu'))

# Add the output layer
model.add(Dense(6, activation = 'softmax'))

# Compile the model
model.compile(optimizer = 'adam', 
                loss='categorical_crossentropy',
                metrics=['accuracy'])

# Fit the model
model.fit(predictors, target, validation_data = val_data,  
          epochs = 10000,
          batch_size = 120,
          callbacks = [callback_earlyStop])

# Calculate predictions: predictions
predictions = model.predict(test_pred)

# Calculate predicted probability of survival: predicted_prob_true
predicted_prob_true = predictions[:,1]

#class_names = ['1', 'Trouser', 'Pullover', 'Dress', 'Coat',
#               'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']

#class_names[np.argmax(predictions)]

Epoch 1/10000
Epoch 2/10000
Epoch 3/10000
Epoch 4/10000
Epoch 5/10000
Epoch 6/10000
Epoch 7/10000
Epoch 8/10000
Epoch 9/10000
Epoch 10/10000
Epoch 11/10000
Epoch 12/10000
Epoch 13/10000
Epoch 14/10000
Epoch 15/10000
Epoch 16/10000
Epoch 17/10000
Epoch 18/10000
Epoch 19/10000
Epoch 20/10000
Epoch 21/10000
Epoch 22/10000
Epoch 23/10000
Epoch 24/10000
Epoch 25/10000
Epoch 26/10000
Epoch 27/10000
Epoch 28/10000
Epoch 29/10000
Epoch 30/10000
Epoch 31/10000
Epoch 32/10000
Epoch 33/10000
Epoch 34/10000
Epoch 35/10000
Epoch 36/10000
Epoch 37/10000
Epoch 38/10000
Epoch 39/10000
Epoch 40/10000
Epoch 41/10000


https://machinelearningmastery.com/multi-class-classification-tutorial-keras-deep-learning-library/

The first results with a deep learning network approach achieve a result of `54%`. This is much lower than the simpler, Generalized Linear Model approach. Maybe the data is more linear than expected? (https://towardsdatascience.com/comparative-study-on-classic-machine-learning-algorithms-24f9ff6ab222). Maybe other architectures can help solve this problem more correctly?  
Anyway, a good exercise here is to understand what is wrong with the network. For this, lets apply diagnosis methods to the results.  
Since Neural networks are black-boxes, it's a bit more difficult to see what's happening inside them.  
Lets see the precision and recall first, in conjuction with the confusion matrix. This will give us a notion of where the model is getting the wrong predictions.  


In [14]:
predictions

array([[4.4834005e-06, 6.3279282e-10, 4.3329425e-08, ..., 3.5714686e-01,
        5.2631233e-02, 1.8169214e-04],
       [5.0403793e-07, 1.3289708e-11, 2.0113839e-09, ..., 3.0752793e-01,
        3.3972379e-02, 4.7690737e-05],
       [4.4714748e-03, 4.0964858e-04, 1.2691598e-03, ..., 3.3101794e-01,
        1.3937207e-01, 1.9592891e-02],
       ...,
       [1.6748980e-03, 5.5920071e-05, 2.6740556e-04, ..., 3.9610863e-01,
        1.1850980e-01, 1.0648130e-02],
       [1.6470120e-09, 5.2513941e-16, 5.7090042e-13, ..., 2.2778614e-01,
        8.2790209e-03, 1.0019511e-06],
       [1.5414407e-04, 3.8650828e-07, 7.0067526e-06, ..., 4.5771411e-01,
        8.9703560e-02, 1.8022447e-03]], dtype=float32)