# Learning From Data
The focus of this project is to compare deep neural networks with tradicional machine learning methods. For the latter, we decided to use Logistic Regression, Naive Bayes and Random Forests.<br>
The project only required comparing the technologies using one dataset however, two were used. This was done in order to demonstrate better the diferences:<br>
* First is the MNIST dataset (downloaded through TF datasets). We choosed this due to the large size of samples and also because it's composed of images.
* Second is [Heart dataset](https://www.kaggle.com/johnsmith88/heart-disease-dataset). We choosed this to contrast the data types of the MNIST and because of the smaller sample size.

In [83]:
import os 
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.metrics import *
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import train_test_split
from sklearn.ensemble import VotingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from tensorflow.keras.datasets import mnist
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.models import Sequential 
from tensorflow.keras.layers import Dense

This function will be used to print the evaluation metrics of our models.

In [58]:
def stats(y_true,y_pred, binary=False):
    print(confusion_matrix(y_test,y_pred))
    print('Accuracy :',round(accuracy_score(y_test,y_pred),2))
    if binary:
        print('Precision :',round(precision_score(y_test,y_pred),2))
        print('Recall :',round(recall_score(y_test,y_pred),2))
        print('F1 :',round(f1_score(y_test,y_pred),2))
    

## The MNIST dataset
Fetch the data, which already comes separated in train and test.

In [9]:
(x_train, y_train), (x_test, y_test) = mnist.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz


### Getting to know the data

In [10]:
print(x_train.shape)
print(x_test.shape)

(60000, 28, 28)
(10000, 28, 28)


Each sample is composed of a 28x28 pixels image. We are using a "regular" neural network and not a convolutional one and we're using tradicional ML methods, so changes need to be made. <br>
Here we flatten the images so instead of getting a matrix of 28x28, we get a vector of 784 pixels.

In [17]:
x_train = x_train.reshape(60000,-1)
print(x_train.shape)

x_test = x_test.reshape(10000,-1)
print(x_test.shape)

(60000, 784)
(10000, 784)


### Pre-processing
We wanted to used the full image in this task so, we won't do many tranformations. Therefore, our only pre-processing step for this dataset is to normalize it.

In [19]:
scaler = MinMaxScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

### Creating the model
We tried many configurations for our network and the following was the best we found. 

In [20]:
model = Sequential()

model.add(Dense(256, activation='relu', input_shape=(784,)))
model.add(Dense(128, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(10, activation='softmax'))

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['acc'])

### Fit parameters
The EarlyStopping callback stops the fit function once its monitored value doesn't change. This was used in order to not have to worry with the number of epochs.<br>
We monitored the loss value with a patience of 2. This means that if the loss value doesn't decrease two epochs in a row, the fit function stops and the best weights are restored.<br>
The epochs parameter is set to 50 so the EarlyStopping has "room to breathe".<br>
The batch_size parameter is set to 32 because it's a good balance between speed and memory.

In [21]:
es = EarlyStopping(monitor='loss', patience=2, restore_best_weights=True)
callbacks = [es]
model.fit(x_train,y_train,epochs=50, batch_size=32, callbacks=callbacks)

Train on 60000 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50


<tensorflow.python.keras.callbacks.History at 0x7f3f7f3d3ed0>

### Evaluating the model

In [24]:
y_pred = model.predict(x_test)
y_pred = np.argmax(y_pred, axis=1)
stats(y_test,y_pred)

[[ 972    0    1    0    0    0    2    1    3    1]
 [   0 1113    2    5    2    1    5    2    5    0]
 [   1    0  982   28    1    0    1    8   10    1]
 [   0    0    0  982    0   19    0    4    4    1]
 [   1    0    2    1  953    1    5    2    2   15]
 [   2    0    0    2    1  882    2    0    0    3]
 [   3    1    1    0    1    2  950    0    0    0]
 [   2    0    5    4    0    1    0 1010    2    4]
 [   2    0    0    3    1    3    1    4  956    4]
 [   2    1    0    2    4    4    2    2    3  989]]
Accuracy :  0.98


Seems like it is a good model, with only 2% error in the test data. 
Let's see how it compares to other classifiers. 
## Other Machine Learning methods

In [25]:
lr = LogisticRegression(max_iter=1000, random_state=7)
lr.fit(x_train,y_train)
y_pred = lr.predict(x_test)

stats(y_test,y_pred)

[[ 955    0    2    4    1   10    4    3    1    0]
 [   0 1110    5    2    0    2    3    2   11    0]
 [   6    9  930   14   10    3   12   10   34    4]
 [   4    1   16  925    1   23    2   10   19    9]
 [   1    3    6    3  921    0    7    5    6   30]
 [   9    2    3   35   10  777   15    6   31    4]
 [   8    3    8    2    6   16  912    2    1    0]
 [   1    7   23    7    6    1    0  947    4   32]
 [   9   11    6   22    7   29   13   10  855   12]
 [   9    8    1    9   21    7    0   21    9  924]]
Accuracy :  0.93


In [26]:
nb = GaussianNB()
nb.fit(x_train,y_train)
y_pred = nb.predict(x_test)
stats(y_test,y_pred)

[[ 864    0    3    6    2    5   28    1   44   27]
 [   0 1078    2    0    0    0    9    0   40    6]
 [  79   24  265   90    5    2  264    4  279   20]
 [  32   35    6  347    2    3   50    8  423  104]
 [  20    2    5    4  170    7   53    7  218  496]
 [  68   23    1   19    3   43   34    2  599  100]
 [  12   12    3    1    1    7  892    0   29    1]
 [   0   12    2   10    5    1    5  273   41  679]
 [  12   68    3    7    3   11   11    4  658  197]
 [   5    7    3    6    1    0    1   13   19  954]]
Accuracy :  0.55


In [82]:
rf = RandomForestClassifier(n_estimators=20, n_jobs=-1, random_state=7)
rf.fit(x_train,y_train)
y_pred = rf.predict(x_test)
stats(y_test,y_pred)

  


[[141   0]
 [  3 164]]
Accuracy : 0.99


## Classifiers, UNITE!
Lastly, out of curiosity we used the Voting Classifier which takes into account the prediction of every classifier in it and picks the majority vote. With this dataset, the voting parameter is set to "hard", so it uses the predicted label instead of the probability.

In [34]:
lr = LogisticRegression(max_iter=1000, random_state=7)
nb = GaussianNB()
rf = RandomForestClassifier(n_estimators=20, random_state=7)

vc = VotingClassifier(estimators=[('lr', lr), ('rf', rf), ('nb', nb)], voting='hard', n_jobs=-1)
vc.fit(x_train,y_train)
y_pred = vc.predict(x_test)
stats(y_test,y_pred)

KeyboardInterrupt: 

Although the idea of the Voting Classifier is interesting, it doesn't improve the results as we expected which was getting results similar to the DNN.

## The Heart dataset
First, we read the csv.

In [47]:
df = pd.read_csv('heart.csv')
df

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1020,59,1,1,140,221,0,1,164,1,0.0,2,0,2,1
1021,60,1,0,125,258,0,0,141,1,2.8,1,1,3,0
1022,47,1,0,110,275,0,0,118,1,1.0,1,1,2,0
1023,50,0,0,110,254,0,0,159,0,0.0,2,0,2,1


### A brief analysis

In [48]:
df.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0
mean,54.434146,0.69561,0.942439,131.611707,246.0,0.149268,0.529756,149.114146,0.336585,1.071512,1.385366,0.754146,2.323902,0.513171
std,9.07229,0.460373,1.029641,17.516718,51.59251,0.356527,0.527878,23.005724,0.472772,1.175053,0.617755,1.030798,0.62066,0.50007
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,48.0,0.0,0.0,120.0,211.0,0.0,0.0,132.0,0.0,0.0,1.0,0.0,2.0,0.0
50%,56.0,1.0,1.0,130.0,240.0,0.0,1.0,152.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,275.0,0.0,1.0,166.0,1.0,1.8,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


Although the data seems overall balanced, there are some attributes that are not, like the fbs column which has all non-zero values outside the 75% percentile. We will deal with after splitting the data.
<br><br>
Let's divide the data into attributes and target and then into train and test. <br>
We set the random_state to a static value so the results are reproducible.

In [49]:
x = df.iloc[:,:-1]
y = df.iloc[:,-1:]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=7)

### Pre-processing
First, we normalize the data.

In [50]:
scaler = MinMaxScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

Second, select the best 10 features of the train data

In [51]:
sb = SelectKBest(k=10)
x_train = sb.fit_transform(x_train,y_train.values.ravel())
x_test = sb.transform(x_test)

The removed features are:

In [54]:
for i,b in enumerate(sb.get_support()):
    if not b:
        print(df.columns[i])

trestbps
chol
fbs


### Creating the model
This time we reduced the size of our net, both in hidden layers as in neurons per layer, since the data is also smaller.

In [55]:
model = Sequential()

model.add(Dense(10, activation='relu', input_shape=(10,)))
model.add(Dense(8, activation='relu'))
model.add(Dense(4, activation='relu'))
model.add(Dense(2, activation='softmax'))

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['acc'])

### Fit parameters
The fit parameters are the same as the ones used in the MNIST data, for the same reasons.

In [89]:
es = EarlyStopping(monitor='loss', patience=2, restore_best_weights=True)
callbacks = [es]
model.fit(x_train,y_train,epochs=50, batch_size=32, callbacks=callbacks)

Train on 717 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50


<tensorflow.python.keras.callbacks.History at 0x7f3f67bce1d0>

### Evaluating the model
Unlike with the MNIST evaluation, here we can use more metrics that help us understand the models behaviour better.
Besides accuracy, we also use:
* Precision : correctly classified positive cases out of all __predicted__ positive cases;
* Reccal    : correctly classified positive cases out of all __real__ positive cases;
* F1-Score  : the balance between precision and recall.

In [60]:
y_pred = model.predict(x_test)
y_pred = np.argmax(y_pred, axis=1)
stats(y_test, y_pred, binary=True)

[[120  21]
 [ 21 146]]
Accuracy : 0.86
Precision : 0.87
Recall : 0.87
F1 : 0.87


Even though we feed much less data to the net, it still performs reasonably well, having good accuracy and F1. 
<br>
## Other Machine Learning methods

In [63]:
lr = LogisticRegression(random_state=7)
lr.fit(x_train,y_train.values.ravel())
y_pred = lr.predict(x_test)
stats(y_test, y_pred, binary=True)

[[117  24]
 [ 27 140]]
Accuracy : 0.83
Precision : 0.85
Recall : 0.84
F1 : 0.85


In [65]:
nb = GaussianNB()
nb.fit(x_train,y_train.values.ravel())
y_pred = nb.predict(x_test)
stats(y_test, y_pred, binary=True)

[[112  29]
 [ 27 140]]
Accuracy : 0.82
Precision : 0.83
Recall : 0.84
F1 : 0.83


In [81]:
rf = RandomForestClassifier(n_estimators=20, random_state=7)
rf.fit(x_train,y_train.values.ravel())
y_pred = rf.predict(x_test)
stats(y_test, y_pred, binary=True)

[[141   0]
 [  3 164]]
Accuracy : 0.99
Precision : 1.0
Recall : 0.98
F1 : 0.99


## (and again) Classifiers, UNITE!
This time we set the voting parameter to "soft" since it gave us best results.

In [79]:
clf1 = LogisticRegression(random_state=7)
clf2 = GaussianNB()
clf3 = RandomForestClassifier(n_estimators=20, random_state=7)

vc = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('nb', clf3)],voting='soft', n_jobs=-1)
vc.fit(x_train,y_train.values.ravel())
y_pred = vc.predict(x_test)
stats(y_test, y_pred, binary=True)

[[125  16]
 [ 16 151]]
Accuracy : 0.9
Precision : 0.9
Recall : 0.9
F1 : 0.9


(and again) It doesn't improve the results as we expected. However this time the voting classifier is better than our DNN, but not by much.