# Case Study 14 
## Chris Irwin
### 04/19/2018

<center>Experimenting with Neural Network Design using Higgs Boson Data</center>

## Introduction
On 4 July 2012, the ATLAS and CMS experiments at CERN's Large Hadron Collider announced they had each observed a new particle. This particle is consistent with the Higgs boson predicted by the Standard Model. According to the CERN or The European Organization of Nuclear Research, the Standard Model explains how the basic building blocks of matter interact. <sup>1</sup> The Higgs boson, as proposed within the Standard Model, is the simplest manifestation of the Brout-Englert-Higgs mechanism. Other types of Higgs bosons are predicted by other theories that go beyond the Standard Model.<sup>2</sup>

In October of 2013 the Nobel prize in physics was awarded jointly to François Englert and Peter Higgs "for the theoretical discovery of a mechanism that contributes to our understanding of the origin of mass of subatomic particles, and which recently was confirmed through the discovery of the predicted fundamental particle, by the ATLAS and CMS experiments at CERN's Large Hadron Collider." <sup>2</sup> This discovery is important because scientists believe that the Higgs boson is the particle that gives all matter its mass.<sup>3</sup>

In the following case study, we will use neural networks and various experiment settings to work with data created from the Higgs Boson experiment. Using the Keras and Tensorflow packages we will show the benefit of using multiple combinations of levels and activations to attempt to find a model to validate the existence of the boson particle. To test the model's overall ability we will be using three different values. The first is an ROC value or Receiver Operationg Characteristic, this calculation is most commonly used as a way to help show the ability and performance of a binary classifier.<sup>9</sup> The second is a loss function, which will be calculated using the Keras package built in loss function of binary_crossentropy. This loss function calculates the probability is the true label, and the vien distribution is the predicted value of the current model.<sup>10</sup> The final value will be an accuracy calculation, this calculation is built into the Keras pacakge and is different from the ROC calculation because the accuracy funcation uses batchs instead of the entire data set to test results. 

## Methods 
### Tensorflow Package
Tensorflow is an open source program that was developed by Google to help with the creation of neural networks.<sup>4</sup> The package the ability to leverage both the computers CPU as well as its GPU in order to speed up the process of training and testing data.

### Keras Package
Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. It was developed with a focus on enabling fast experimentation.<sup>5</sup>The Keras package has four guiding principles which are user friendliness, modularity, easy extensibility and ability to work with Python. These guiding principles have created an easy to use package that makes creating deep learning neural networks and gives the ability for the data scientists to tune different parameters for enhanced performance.  

## Building Neural Networks
The first step in building neural networks is to add the Keras and Tensorflow packages to you project. The benefits of adding both packages are explained above in greater detail. The next step is to import a number of sub-packages which will help with creating and developing a neural network. 

* Sequential - Allows for individual layers to be added to a neural network.
* Activation - Imports multiple functions that can be passed to create layers.
* Dense - Helps to implement a complex operation of activation and kernels, which are weights matrix created by the layers
* SGD - This package includes the optimizer for Stochastic gradient descent.

In [2]:
import keras
import tensorflow as tf
import numpy as np
import pandas as pd

from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.optimizers import SGD
from keras.optimizers import RMSprop
from keras.optimizers import Adagrad
from keras.optimizers import Nadam
from sklearn.metrics import roc_auc_score


  from ._conv import register_converters as _register_converters
Using Theano backend.


After adding the packages to the case study, the next step is import data into a pandas data frame and then reshape the data to the expected format. The reshape of the data is a very important step as the keras and tensorflow packages are extremely particular when it comes the shape of the data.

In [3]:
data=pd.read_csv("./HIGGS.csv",nrows=5000000,header=None)
test_data=pd.read_csv("./HIGGS.csv",nrows=200000,header=None,skiprows=6000000)

In [4]:
y=np.array(data.loc[:,0])
x=np.array(data.loc[:,1:])
x_test=test_data.loc[:,1:]
y_test=test_data.loc[:,0]

This case study will take a look at number of scenarios, combinations and import totals to attempt to increase the model’s ability to correctly predict outcomes. The first combination we will try is a neural net with 5 layers and using an initializer of Uniform and the sigmoid activation function on all layers.

In [48]:
model = Sequential()
model.add(Dense(200, input_dim=x.shape[1], kernel_initializer='uniform')) # X_train.shape[1] == 15 here
model.add(Activation('sigmoid'))
model.add(Dense(200, kernel_initializer='uniform'))
model.add(Activation('sigmoid'))
model.add(Dense(200, kernel_initializer='uniform'))
model.add(Activation('sigmoid'))
model.add(Dense(50, kernel_initializer='uniform'))
model.add(Activation('sigmoid'))
model.add(Dense(1, kernel_initializer='uniform')) 
model.add(Activation('sigmoid'))

sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer=sgd)



After setting up the layers of the neural network the next step is to test and train the data to see what the possible outcomes are. Additionally, the fit function that is included with the Keras packages uses Epochs. According to the Keras documentation an Epoch is an arbitrary cutoff, generally defined as "one pass over the entire dataset", used to separate training into distinct phases, which is useful for logging and periodic evaluation. <sup>6</sup> Also we use batch sizes of 5,000 to test the model's overall accuracy. 

In [49]:
model.fit(x, y, epochs=5, batch_size=5000)
score = model.evaluate(x_test, y_test, batch_size=5000)
print(roc_auc_score(y_test,model.predict(x_test)))
print(score)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
0.5610751371472434
[0.6909302428364754, 0.5333549976348877]


Overall the neural net above is set up with four layers using a batch size of 5,000 and 5 epochs, the neural net created a model with an overall accuracy of 53.3%, with an ROC score of .5610 and a loss value of .6909 when compared to the test set, which was created earlier in the analysis. This is a very poor performing model, to improve our score the next step is to alter the layers and attempt to alter the outcome. Additionally, In the following neural net, we will increase the number of Epochs to see if there is any additional information gained from the additional runs.

In [50]:
model = Sequential()
model.add(Dense(200, input_dim=x.shape[1], kernel_initializer='uniform')) # X_train.shape[1] == 15 here
model.add(Activation('sigmoid'))
model.add(Dense(200, kernel_initializer='uniform'))
model.add(Activation('sigmoid'))
model.add(Dense(200, kernel_initializer='uniform'))
model.add(Activation('sigmoid'))
model.add(Dense(50, kernel_initializer='uniform'))
model.add(Activation('sigmoid'))
model.add(Dense(1, kernel_initializer='uniform')) 
model.add(Activation('sigmoid'))


sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer=sgd)

model.fit(x, y, epochs=15, batch_size=5000)
score = model.evaluate(x_test, y_test, batch_size=5000)
print(roc_auc_score(y_test,model.predict(x_test)))
print(score)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
0.5730254816583903
[0.6909645915031433, 0.5333549976348877]


After looking at the results we see that adding Epochs did not really increase the overall accuracy of the model, but we do see an increase in the ROC score, with this additional increase we will continue to use this higher number of Epochs going forward in order to see what additional information we can glean. The next step we will try to improve the score with is to remove a single layer and attempt to see if that increases the overall model accuracy.

In [51]:
model = Sequential()
model.add(Dense(200, input_dim=x.shape[1], kernel_initializer='uniform')) # X_train.shape[1] == 15 here
model.add(Activation('sigmoid'))
model.add(Dense(200, kernel_initializer='uniform'))
model.add(Activation('sigmoid'))
model.add(Dense(200, kernel_initializer='uniform'))
model.add(Activation('sigmoid'))
model.add(Dense(1, kernel_initializer='uniform')) 
model.add(Activation('sigmoid'))

sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer=sgd)

model.fit(x, y, epochs=15, batch_size=5000)
score = model.evaluate(x_test, y_test, batch_size=5000)
print(roc_auc_score(y_test,model.predict(x_test)))
print(score)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
0.7654891441877527
[0.5766516178846359, 0.6966700002551078]


After removing a single the layer with a node size of fifty we see a considerable increase in the model’s overall accuracy, which is an increase of sixteen percentage points, .192 increase in ROC score and a .12 decrease in the loss value. With this in mind the next model we will run we will decrease the node size per layer from 200 to 150 to see what the overall effect is to the model.

In [52]:
model = Sequential()
model.add(Dense(150, input_dim=x.shape[1], kernel_initializer='uniform')) # X_train.shape[1] == 15 here
model.add(Activation('sigmoid'))
model.add(Dense(150, kernel_initializer='uniform'))
model.add(Activation('sigmoid'))
model.add(Dense(150, kernel_initializer='uniform'))
model.add(Activation('sigmoid'))
model.add(Dense(1, kernel_initializer='uniform')) 
model.add(Activation('sigmoid'))

sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer=sgd)

model.fit(x, y, epochs=15, batch_size=5000)
score = model.evaluate(x_test, y_test, batch_size=5000)
print(roc_auc_score(y_test,model.predict(x_test)))
print(score)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
0.7655810422045217
[0.5763749212026597, 0.6988150030374527]


Looking at the results from the model which was run above we find that by decreasing the node size of each layer creates a similar overall accuracy, ROC score and loss value when applied to the test data set. The interesting part in this section of the analysis is that similar results were achieved using a lower node count, but the smaller nodes sizes ran in a considerably less amount of time by averaging approximately 18-20 seconds less per run. 

## Alter Activation Methods

The next step in this case study will be to examine the usage of different activations. Using the model above that had the highest accuracy, which was the model with 3 layers of 200 per node, we will apply multiple activations in an attempt to raise the score. Keras gives access to a number of different activations and we will focus specifically on the relu, tanh and linear activations. Additionally, the other two models are subjected to the same process of altering the activations with code and results available in the appendix located at the end of this case study.

### relu Activation

In [53]:
model = Sequential()
model.add(Dense(200, input_dim=x.shape[1], kernel_initializer='uniform')) # X_train.shape[1] == 15 here
model.add(Activation('relu'))
model.add(Dense(200, kernel_initializer='uniform'))
model.add(Activation('relu'))
model.add(Dense(200, kernel_initializer='uniform'))
model.add(Activation('relu'))
model.add(Dense(1, kernel_initializer='uniform')) 
model.add(Activation('sigmoid'))

sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer=sgd)

model.fit(x, y, epochs=15, batch_size=5000)
score = model.evaluate(x_test, y_test, batch_size=5000)
print(roc_auc_score(y_test,model.predict(x_test)))
print(score)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
0.852819234661032
[0.472449616342783, 0.7687800034880639]


The results above are very compelling, by removing the sigmoid activation and replacing it with relu we found there to be a large increase in the ROC score, a seven percentage point increase in accuracy and a .10 decrease in the loss value. An additional benefit to the relu function also seems to be an overall decrease in the amount of run time per Epoch of almost 30 seconds per run.

Continuing the attempt to improve the score of the model we will next compare the results of the model when we change the relu activation for the tanh activation to see if there is any increase in accruacy or decrease in run time. 

### tanh

In [54]:
model = Sequential()
model.add(Dense(200, input_dim=x.shape[1], kernel_initializer='uniform')) # X_train.shape[1] == 15 here
model.add(Activation('tanh'))
model.add(Dense(200, kernel_initializer='uniform'))
model.add(Activation('tanh'))
model.add(Dense(200, kernel_initializer='uniform'))
model.add(Activation('tanh'))
model.add(Dense(1, kernel_initializer='uniform')) 
model.add(Activation('sigmoid'))

sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer=sgd)

model.fit(x, y, epochs=15, batch_size=5000)
score = model.evaluate(x_test, y_test, batch_size=5000)
print(roc_auc_score(y_test,model.predict(x_test)))
print(score)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
0.8301749923582097
[0.5036467663943768, 0.7492049977183342]


The results of the tanh are still an overall improvement over the use of sigmoid activations with an overall increase in the ROC Score, an increase of 14 percentage points in accuracy and decrease in the loss value of .7. When compared to the relu result the tanh there is a glaring increase in the run time with each run lasting approximently 55 seconds on average, with an overall decrease in accuracy and overall increase in loss. 

Next we will look at the linear activation to see how it performs and if there is any increase in overall model percision. 

### linear

In [55]:
model = Sequential()
model.add(Dense(200, input_dim=x.shape[1], kernel_initializer='uniform')) # X_train.shape[1] == 15 here
model.add(Activation('linear'))
model.add(Dense(200, kernel_initializer='uniform'))
model.add(Activation('linear'))
model.add(Dense(200, kernel_initializer='uniform'))
model.add(Activation('linear'))
model.add(Dense(1, kernel_initializer='uniform')) 
model.add(Activation('sigmoid'))

sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
#rms = RMSprop(lr=0.001, rho=0.9, epsilon=None, decay=1e-6)
#ada = Adagrad(lr=0.01, epsilon=None, decay=0.0)
#nad = Nadam(lr=0.002, beta_1=0.9, beta_2=0.999, epsilon=None, schedule_decay=0.004)
model.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer=sgd)

model.fit(x, y, epochs=15, batch_size=5000)
score = model.evaluate(x_test, y_test, batch_size=5000)
print(roc_auc_score(y_test,model.predict(x_test)))
print(score)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
0.6849953880342189
[0.6369215965270996, 0.6425300002098083]


The final activation that we attempted had results that were the lowest amongst the three activations we tried. Even when compared to the original model with sigmoid activation we find that the model’s accuracy decreased by approximately five percentage points and gained approximately .07 in loss value. Interestingly the run times per Epoch improved considerably and were the best out of all the activations used. 

## Altering Batch Sizes

After finding that the relu activation is gives us the highest accuracy the next item to see if we can increase the overall accuracy is to increase and decrease the total batch size. We will run two different scenarios with batch sizes of 25,000 records a batch of 75. 

### Batch of 25,000

In [56]:
model = Sequential()
model.add(Dense(200, input_dim=x.shape[1], kernel_initializer='uniform')) # X_train.shape[1] == 15 here
model.add(Activation('relu'))
model.add(Dense(200, kernel_initializer='uniform'))
model.add(Activation('relu'))
model.add(Dense(200, kernel_initializer='uniform'))
model.add(Activation('relu'))
model.add(Dense(1, kernel_initializer='uniform')) 
model.add(Activation('sigmoid'))

sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer=sgd)

model.fit(x, y, epochs=15, batch_size=25000)
score = model.evaluate(x_test, y_test, batch_size=25000)
print(roc_auc_score(y_test,model.predict(x_test)))
print(score)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
0.8284343653780737
[0.517031840980053, 0.7381500005722046]


### Batches of 75

In [57]:
model = Sequential()
model.add(Dense(200, input_dim=x.shape[1], kernel_initializer='uniform')) # X_train.shape[1] == 15 here
model.add(Activation('relu'))
model.add(Dense(200, kernel_initializer='uniform'))
model.add(Activation('relu'))
model.add(Dense(200, kernel_initializer='uniform'))
model.add(Activation('relu'))
model.add(Dense(1, kernel_initializer='uniform')) 
model.add(Activation('sigmoid'))

sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer=sgd)

model.fit(x, y, epochs=15, batch_size=75)
score = model.evaluate(x_test, y_test, batch_size=75)
print(roc_auc_score(y_test,model.predict(x_test)))
print(score)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
0.8448842197665147
[0.48791625937446953, 0.7626800020560622]


After running two different batch sizes we can see that by increasing the batch size to 25,000 records per batch decreased the model’s accuracy by approximately 2 percentage points and increased the loss value by approximately .03. Additionally, decreasing the batch size to 75 slightly decreased the accuracy by .6 percentage point and slightly increased the loss value. Overall the performance was very similar between the batches of 75 vs 5,000, but the run time between the two is dramatically different with the batches of 75 taking on average 2 minutes per Epoch or an increase of 150%.

## Altering Kernal Initializers

As we explained earlier in this case study there are a number of different variables that can be altered to increase the accuracy and effectiveness of the model. These values are available because of the different type data that can be ran through the neural net process, whether the data is image related, scientific results or sports statistics different data requires different algorithms to get the best results. According to the Keras documentation the Initializations define the way to set the initial random weights of Keras layers. <sup>7</sup>

In the following part of the case study we will run the previous defined model with 3 layers of node sizes 200 with batch sizes of 5,000 with three different initializer values. Previously in this case study we have used a constant kernel initializer of uniform and have achieved very good overall results. Now we will run the same model and compare the results by changing the kernel initializer to use either orthogonal, VarianceScaling or truncated_normal initializer values.

### Orthogonal

In [58]:
model = Sequential()
model.add(Dense(200, input_dim=x.shape[1], kernel_initializer='orthogonal')) # X_train.shape[1] == 15 here
model.add(Activation('relu'))
model.add(Dense(200, kernel_initializer='orthogonal'))
model.add(Activation('relu'))
model.add(Dense(200, kernel_initializer='orthogonal'))
model.add(Activation('relu'))
model.add(Dense(1, kernel_initializer='orthogonal')) 
model.add(Activation('sigmoid'))

sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer=sgd)

model.fit(x, y, epochs=15, batch_size=5000)
score = model.evaluate(x_test, y_test, batch_size=5000)
print(roc_auc_score(y_test,model.predict(x_test)))
print(score)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
0.8525290446002298
[0.473652059584856, 0.7681900024414062]


After updating the kernel initializer value to Orthogonal, we see the results are very similar to the when we ran the process with the Uniform kernel initializer value. All four categories, ROC Score, accuracy, loss value and time per Epoch show little difference between the two initializer values.

### VarianceScaling

In [59]:
model = Sequential()
model.add(Dense(200, input_dim=x.shape[1], kernel_initializer='VarianceScaling')) # X_train.shape[1] == 15 here
model.add(Activation('relu'))
model.add(Dense(200, kernel_initializer='VarianceScaling'))
model.add(Activation('relu'))
model.add(Dense(200, kernel_initializer='VarianceScaling'))
model.add(Activation('relu'))
model.add(Dense(1, kernel_initializer='VarianceScaling')) 
model.add(Activation('sigmoid'))

sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer=sgd)

model.fit(x, y, epochs=15, batch_size=5000)
score = model.evaluate(x_test, y_test, batch_size=5000)
print(roc_auc_score(y_test,model.predict(x_test)))
print(score)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
0.8500474277993356
[0.4765773430466652, 0.7669149965047837]


The next step of the analysis is to change Orthogonal to an initialization value of VarianceScaling. Once again, we see a very minimal amount of difference between the results of both the Uniform, Orthogonal and VarianceScaling. Overall we see a slight descrease in ROC Score, a descrease in accuracy and a slight increase in loss value. 


### Truncated_Normal

In [60]:
model = Sequential()
model.add(Dense(200, input_dim=x.shape[1], kernel_initializer='truncated_normal')) # X_train.shape[1] == 15 here
model.add(Activation('relu'))
model.add(Dense(200, kernel_initializer='truncated_normal'))
model.add(Activation('relu'))
model.add(Dense(200, kernel_initializer='truncated_normal'))
model.add(Activation('relu'))
model.add(Dense(1, kernel_initializer='truncated_normal')) 
model.add(Activation('sigmoid'))

sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer=sgd)

model.fit(x, y, epochs=15, batch_size=5000)
score = model.evaluate(x_test, y_test, batch_size=5000)
print(roc_auc_score(y_test,model.predict(x_test)))
print(score)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
0.8544244084440862
[0.4704068087041378, 0.7702399984002113]


The final initializer value we attempted is the truncated_normal. This value gave us our first lowering of the loss value by approximately .05 when compared to all three of the previously used initializer values. Additionally, we saw an increase in the model’s accuracy with an approximately 1 percentage point and an increase in ROC Score. Moving forward to the next step of our case study we will use an initialization value of truncated_normal


## Altering Neural Net Optimizers 
Finally, after altering layer sizes and counts, changing activation types, testing different batch sizes and running multiple kernels we are going to attempt to improve the accuracy of the model with trying different optimizers. The choice of optimizer is an incredibly important one as it will commonly affect the overall performance of your model. Below we will attempt to build the model using three specific RMSProp, Nadam and Adagrad

*RMSProp - RMSprop is an unpublished, adaptive learning rate method proposed by Geoff Hinton in Lecture 6e of his Coursera Class. <sup>8</sup>

*Nadam – Is similar to RMSProp with an adjustment for momentum by using the Nesterov accelerated gradient <sup>8</sup>

*Adagrad - is an algorithm for gradient-based optimization that does just this: It adapts the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters. <sup>8</sup>

### RMSProp

In [61]:
model = Sequential()
model.add(Dense(200, input_dim=x.shape[1], kernel_initializer='truncated_normal')) # X_train.shape[1] == 15 here
model.add(Activation('relu'))
model.add(Dense(200, kernel_initializer='truncated_normal'))
model.add(Activation('relu'))
model.add(Dense(200, kernel_initializer='truncated_normal'))
model.add(Activation('relu'))
model.add(Dense(1, kernel_initializer='truncated_normal')) 
model.add(Activation('relu'))

rms = RMSprop(lr=0.001, rho=0.9, epsilon=None, decay=1e-6)
model.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer=rms)

model.fit(x, y, epochs=15, batch_size=5000)
score = model.evaluate(x_test, y_test, batch_size=5000)
print(roc_auc_score(y_test,model.predict(x_test)))
print(score)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
0.8335705549727903
[0.5254958048462868, 0.7487200036644935]


Once we altered the optimizer to be RMSProp we found that there was a decrease in ROC Score of approximately .02, a decrease in accuracy of approximately .02, an increase in loss function of approximately .05 and finally an increase in the amount of time it took for each Epoch to run. 

Next we tested the optimizer of Adagrad to see what effect it has on the models performance.
### Adagrad

In [62]:
model = Sequential()
model.add(Dense(200, input_dim=x.shape[1], kernel_initializer='truncated_normal')) # X_train.shape[1] == 15 here
model.add(Activation('relu'))
model.add(Dense(200, kernel_initializer='truncated_normal'))
model.add(Activation('relu'))
model.add(Dense(200, kernel_initializer='truncated_normal'))
model.add(Activation('relu'))
model.add(Dense(1, kernel_initializer='truncated_normal')) 
model.add(Activation('sigmoid'))

ada = Adagrad(lr=0.01, epsilon=None, decay=0.0)
#nad = Nadam(lr=0.002, beta_1=0.9, beta_2=0.999, epsilon=None, schedule_decay=0.004)
model.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer=ada)

model.fit(x, y, epochs=15, batch_size=5000)
score = model.evaluate(x_test, y_test, batch_size=5000)
print(roc_auc_score(y_test,model.predict(x_test)))
print(score)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
0.8415409605136148
[0.4891367875039577, 0.7584949985146523]


After altering the optimizer to Adagrad and rerunning the experiment we saw a slight increase in ROC score and accuracy from the new optimizer when compared to the RMSProp run. In comparing the new optimizer to our current best test we found that there is still a decrease in ROC Score and accuracy with an increase in loss. 

Finally, we will run the experiment one last time to see if the Nadam optimizer has any positive effect on the overall model.

### Nadam

In [63]:
model = Sequential()
model.add(Dense(200, input_dim=x.shape[1], kernel_initializer='truncated_normal')) # X_train.shape[1] == 15 here
model.add(Activation('relu'))
model.add(Dense(200, kernel_initializer='truncated_normal'))
model.add(Activation('relu'))
model.add(Dense(200, kernel_initializer='truncated_normal'))
model.add(Activation('relu'))
model.add(Dense(1, kernel_initializer='truncated_normal')) 
model.add(Activation('sigmoid'))

nad = Nadam(lr=0.002, beta_1=0.9, beta_2=0.999, epsilon=None, schedule_decay=0.004)
model.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer=nad)

model.fit(x, y, epochs=15, batch_size=5000)
score = model.evaluate(x_test, y_test, batch_size=5000)
print(roc_auc_score(y_test,model.predict(x_test)))
print(score)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
0.8593213225090737
[0.4634594313800335, 0.775160001218319]


The Nadam optimizer produced the best results of any model we have produced so far. Additionally, it performed the best of all three of the optimizers with an overall ROC score of .859 and an accuracy of .775. When compared to the highest achieved accuracy in this case study the Nadam model increased the accuracy by approximately .05.

After numerous attempts at maximizing the score of our model we have found that having the following settings has produced a very successful model.

* Record Count – 5 million
* Batch Size – 5000
* Layers – 3
* Node Size - 200
* Epochs - 15
* Kernel Initializer – truncated_normal
* Activation – Relu
* Optimizer – Nadam

Earlier in the case study we attempted to lower the node size to see if it would help with improving the model. The results of that experiment were a slight overall change in the model’s accuracy and ROC score. In a final attempt to maximize the score we attempted to increase the size node size to see if the additional data enhanced the ability of the model. Below we will attempt two additional analysis, the first is increasing the node size to 350 and the second is increasing the node size to 750. 


### Nadam with Nodes of 350

In [5]:
model = Sequential()
model.add(Dense(350, input_dim=x.shape[1], kernel_initializer='truncated_normal')) # X_train.shape[1] == 15 here
model.add(Activation('relu'))
model.add(Dense(350, kernel_initializer='truncated_normal'))
model.add(Activation('relu'))
model.add(Dense(350, kernel_initializer='truncated_normal'))
model.add(Activation('relu'))
model.add(Dense(1, kernel_initializer='truncated_normal')) 
model.add(Activation('sigmoid'))

nad = Nadam(lr=0.002, beta_1=0.9, beta_2=0.999, epsilon=None, schedule_decay=0.004)
model.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer=nad)

model.fit(x, y, epochs=15, batch_size=5000)
score = model.evaluate(x_test, y_test, batch_size=5000)
print(roc_auc_score(y_test,model.predict(x_test)))
print(score)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
0.8616081784806315
[0.4603735379874706, 0.7772049993276596]


Upon the completion of the model with node sizes of 350 we found there to an increase in accuracy, by approximently .02, and increase in ROC Score, by approximently .2, and slight decrease in the loss value of .003. By increasing the node size by 150, or 75% over the previous model we saw an increase in time per Epoch approximently 60 seconds or 200% on average. With the increase model’s overall accuracy measures we continued to increase the node size to 750 to see if the model would continue to increase.

### Increasing Node Size to 750

In [7]:
model = Sequential()
model.add(Dense(750, input_dim=x.shape[1], kernel_initializer='truncated_normal')) # X_train.shape[1] == 15 here
model.add(Activation('relu'))
model.add(Dense(750, kernel_initializer='truncated_normal'))
model.add(Activation('relu'))
model.add(Dense(750, kernel_initializer='truncated_normal'))
model.add(Activation('relu'))
model.add(Dense(1, kernel_initializer='truncated_normal')) 
model.add(Activation('sigmoid'))

nad = Nadam(lr=0.002, beta_1=0.9, beta_2=0.999, epsilon=None, schedule_decay=0.004)
model.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer=nad)

model.fit(x, y, epochs=15, batch_size=5000)
score = model.evaluate(x_test, y_test, batch_size=5000)
print(roc_auc_score(y_test,model.predict(x_test)))
print(score)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
0.8597218605430856
[0.46676433756947516, 0.7760750010609627]


Upon the completion of the final model using node sizes of 750 we find there to be a decrease in the models accuracy and ROC score. Additionally, the loss value was slightly higher, with the largest difference being the amount of time per Epoch. The average Epoch lasted 5.5 minutes, which is approximately 300% increase in run time when compared to the model with node sizes of 350. With the increase in time and resources, but a decrease in all the scoring attributes we chose to stop experiment and end with the model that gave us the highest results.


## Conclusion

The case study above used a number of different variables to attempt to create a neural network based on the Higgs Boson data set. One of the first variables we looked at was the creation of a model with different layers, our original model had five-layers with our second attempt having only four. The decrease in layers created a significant gain in both accuracy the ROC scores. For example, the second model with three layers had a ROC score of .7654 and the original five-layer model had a ROC score of .5730 respectively, which an approximate increase of .192. Additionally, the accuracy of the four-layer model is .6966 and the original five-layer model is .5333 respectively, which is an increase of approximately .1633. Through multiple iterations we found that by adding an extra layer to the model and using the original settings for initialization kernel and optimizer there was added noise, which the final result was a less accurate model with a larger loss function. To further test whether the four-layer model was the most efficient we tested a three-layer model, the code and results are available in the appendix, using the original settings with node sizes of 200 and the final settings with node sizes of 200 and 350. Overall the three-layer model performed better with the original settings, but when starting to alter additional features the model ended up performing slightly worse. With this in mind we chose to move forward in the case with using a 3-layer neural net with node sizes of 200. 

After deciding on a 3-layer model with node sizes of 200, we then set out to try multiple activation values. These three separate values were relu, tanh and linear. The results of this experiment were to use the relu activation, which resulted in a ROC score of .8528 and an accuracy of approximately .7688. These values were a dramatic increase from the original sigmoid activation that was previously attempted with an increase of approximately .09 for the model’s ROC score and a .07 increase in accuracy of the model. 

Next, we continued the testing of different variables by changing the initialization kernel that was used for each of the layers. For this case study we used the uniform, truncated_normal, VarianceScaling and Orthogonal initialization kernels. After multiple iterations we found that the truncated_normal initialization kernel gave us increased results over the other three values with a ROC score of .8544, which is an increase of approximately .002 and an approximately .0014 increase in the model’s accuracy when compared to a relu model with four-layers and a uniform initializer.  

Finally, we tested multiple optimizer values, which where the Stochastic Gradient Descent or SGD,  RMSProp, Adagrad and Nadam. When compared to the model’s using SGD both the RMSProp and Adagrad models performed worse and created ROC and accuracy scores that were slightly less than the SGD model. The Nadam model produced the best model with a final ROC curve of .8544 and an accuracy score of .7702 respectively.

In conclusion this case study took an extensive look at neural networks and how different settings can affect a model’s overall performance. After multiple iterations we found that a four-layer model with an activation of relu for the first three layers and sigmoid for the final layer, all with node sizes of 350 with initialization kernels of truncated_normal and a Nadam optimizer allowed us to find a final model with a ROC score of .8616 an accuracy score of .7772 and a loss value of .4704. Overall we found that the largest increase in the model’s accuracy came from the testing of different layers, with the second most being attributed to the changes in activation. 


## References 
1. https://home.cern/about/physics/standard-model
2. https://home.cern/topics/higgs-boson
3. https://www.cnn.com/2011/12/13/world/europe/higgs-boson-q-and-a/index.html
4. https://en.wikipedia.org/wiki/TensorFlow
5. https://keras.io/
6. https://keras.io/getting-started/faq/#what-does-sample-batch-epoch-mean
7. https://keras.io/initializers/
8. http://ruder.io/optimizing-gradient-descent/index.html
9. http://www.dataschool.io/roc-curves-and-auc-explained
10. https://en.wikipedia.org/wiki/Cross-entropy

## Appendix

### Model 1 with relu

In [64]:
model = Sequential()
model.add(Dense(200, input_dim=x.shape[1], kernel_initializer='uniform')) # X_train.shape[1] == 15 here
model.add(Activation('relu'))
model.add(Dense(200, kernel_initializer='uniform'))
model.add(Activation('relu'))
model.add(Dense(200, kernel_initializer='uniform'))
model.add(Activation('relu'))
model.add(Dense(50, kernel_initializer='uniform'))
model.add(Activation('relu'))
model.add(Dense(1, kernel_initializer='uniform')) 
model.add(Activation('sigmoid'))


sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer=sgd)

model.fit(x, y, epochs=15, batch_size=5000)
score = model.evaluate(x_test, y_test, batch_size=5000)
print(roc_auc_score(y_test,model.predict(x_test)))
print(score)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
0.85499839948284
[0.4692159779369831, 0.7706750005483627]


### Model 1 with tanh

In [65]:
model = Sequential()
model.add(Dense(200, input_dim=x.shape[1], kernel_initializer='uniform')) # X_train.shape[1] == 15 here
model.add(Activation('tanh'))
model.add(Dense(200, kernel_initializer='uniform'))
model.add(Activation('tanh'))
model.add(Dense(200, kernel_initializer='uniform'))
model.add(Activation('tanh'))
model.add(Dense(50, kernel_initializer='uniform'))
model.add(Activation('tanh'))
model.add(Dense(1, kernel_initializer='uniform')) 
model.add(Activation('sigmoid'))


sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer=sgd)

model.fit(x, y, epochs=15, batch_size=5000)
score = model.evaluate(x_test, y_test, batch_size=5000)
print(roc_auc_score(y_test,model.predict(x_test)))
print(score)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
0.8333002811939059
[0.4996121749281883, 0.7516699999570846]


### Model 1 with linear

In [66]:
model = Sequential()
model.add(Dense(200, input_dim=x.shape[1], kernel_initializer='uniform')) # X_train.shape[1] == 15 here
model.add(Activation('linear'))
model.add(Dense(200, kernel_initializer='uniform'))
model.add(Activation('linear'))
model.add(Dense(200, kernel_initializer='uniform'))
model.add(Activation('linear'))
model.add(Dense(50, kernel_initializer='uniform'))
model.add(Activation('linear'))
model.add(Dense(1, kernel_initializer='uniform')) 
model.add(Activation('sigmoid'))


sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer=sgd)

model.fit(x, y, epochs=15, batch_size=5000)
score = model.evaluate(x_test, y_test, batch_size=5000)
print(roc_auc_score(y_test,model.predict(x_test)))
print(score)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
0.6847885730110183
[0.6370486095547676, 0.6418449997901916]


### Model 3 with relu

In [67]:
model = Sequential()
model.add(Dense(150, input_dim=x.shape[1], kernel_initializer='uniform')) # X_train.shape[1] == 15 here
model.add(Activation('relu'))
model.add(Dense(150, kernel_initializer='uniform'))
model.add(Activation('relu'))
model.add(Dense(150, kernel_initializer='uniform'))
model.add(Activation('relu'))
model.add(Dense(1, kernel_initializer='uniform')) 
model.add(Activation('sigmoid'))


sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer=sgd)

model.fit(x, y, epochs=15, batch_size=5000)
score = model.evaluate(x_test, y_test, batch_size=5000)
print(roc_auc_score(y_test,model.predict(x_test)))
print(score)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
0.8512386947040244
[0.4747511692345142, 0.7681350007653236]


### Model 3 with tanh

In [68]:
model = Sequential()
model.add(Dense(150, input_dim=x.shape[1], kernel_initializer='uniform')) # X_train.shape[1] == 15 here
model.add(Activation('tanh'))
model.add(Dense(150, kernel_initializer='uniform'))
model.add(Activation('tanh'))
model.add(Dense(150, kernel_initializer='uniform'))
model.add(Activation('tanh'))
model.add(Dense(1, kernel_initializer='uniform')) 
model.add(Activation('sigmoid'))


sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer=sgd)

model.fit(x, y, epochs=15, batch_size=5000)
score = model.evaluate(x_test, y_test, batch_size=5000)
print(roc_auc_score(y_test,model.predict(x_test)))
print(score)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
0.8294619663828303
[0.504516588151455, 0.7482000023126603]


### Model 3 with linear

In [69]:
model = Sequential()
model.add(Dense(150, input_dim=x.shape[1], kernel_initializer='uniform')) # X_train.shape[1] == 15 here
model.add(Activation('linear'))
model.add(Dense(150, kernel_initializer='uniform'))
model.add(Activation('linear'))
model.add(Dense(150, kernel_initializer='uniform'))
model.add(Activation('linear'))
model.add(Dense(1, kernel_initializer='uniform')) 
model.add(Activation('sigmoid'))


sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer=sgd)

model.fit(x, y, epochs=15, batch_size=5000)
score = model.evaluate(x_test, y_test, batch_size=5000)
print(roc_auc_score(y_test,model.predict(x_test)))
print(score)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
0.6851573061059237
[0.6369462415575982, 0.642044997215271]


## Model with 3-layers and Node size of 200

In [9]:
model = Sequential()
model.add(Dense(200, input_dim=x.shape[1], kernel_initializer='uniform')) # X_train.shape[1] == 15 here
model.add(Activation('sigmoid'))
model.add(Dense(200, kernel_initializer='uniform'))
model.add(Activation('sigmoid'))
model.add(Dense(1, kernel_initializer='uniform')) 
model.add(Activation('sigmoid'))


sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer=sgd)

model.fit(x, y, epochs=15, batch_size=5000)
score = model.evaluate(x_test, y_test, batch_size=5000)
print(roc_auc_score(y_test,model.predict(x_test)))
print(score)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
0.7795043947435436
[0.5631455138325692, 0.7060449972748757]


## 3-layer model with Node Size 200 and final settings

In [10]:
model = Sequential()
model.add(Dense(200, input_dim=x.shape[1], kernel_initializer='truncated_normal')) # X_train.shape[1] == 15 here
model.add(Activation('relu'))
model.add(Dense(200, kernel_initializer='truncated_normal'))
model.add(Activation('relu'))
model.add(Dense(1, kernel_initializer='truncated_normal')) 
model.add(Activation('sigmoid'))

nad = Nadam(lr=0.002, beta_1=0.9, beta_2=0.999, epsilon=None, schedule_decay=0.004)
model.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer=nad)

model.fit(x, y, epochs=15, batch_size=5000)
score = model.evaluate(x_test, y_test, batch_size=5000)
print(roc_auc_score(y_test,model.predict(x_test)))
print(score)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
0.8503651488793429
[0.47695332765579224, 0.7665249973535537]


## 3-layer model with Node Size 350 and final settings

In [11]:
model = Sequential()
model.add(Dense(350, input_dim=x.shape[1], kernel_initializer='truncated_normal')) # X_train.shape[1] == 15 here
model.add(Activation('relu'))
model.add(Dense(350, kernel_initializer='truncated_normal'))
model.add(Activation('relu'))
model.add(Dense(1, kernel_initializer='truncated_normal')) 
model.add(Activation('sigmoid'))

nad = Nadam(lr=0.002, beta_1=0.9, beta_2=0.999, epsilon=None, schedule_decay=0.004)
model.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer=nad)

model.fit(x, y, epochs=15, batch_size=5000)
score = model.evaluate(x_test, y_test, batch_size=5000)
print(roc_auc_score(y_test,model.predict(x_test)))
print(score)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
0.8555201526011412
[0.4690599337220192, 0.7713600009679794]
