In this simple notebook we use a fully connected neural network to solve a previously seen problem in classification: the particle physics ID problem.

It accompanies Chapter 8 of the book.

Author: Viviana Acquaviva, with contributions by Jake Postiglione and Olga Privman

In [None]:
import numpy as np
import pandas as pd
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.utils import shuffle

In [None]:
import matplotlib
import matplotlib.pyplot as plt

%matplotlib inline
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_colwidth', 150)

font = {'size'   : 16}
matplotlib.rc('font', **font)
matplotlib.rc('xtick', labelsize=14) 
matplotlib.rc('ytick', labelsize=14) 
matplotlib.rcParams['figure.dpi'] = 300

Tensorflow is a very commonly used library used in development of Deep Learning models. It is an open-source platform that was developed by Google. It supports programming in several languages, e.g. C++, Java, Python, and many others.

Keras is a high-level API (Application Programming Interface) that is built on top of TensorFlow (or Theano, another Deep Learning library). It is Python-specific, and we can think of it as the equivalent of the sklearn library for neural network. It is less general, and less customizable, but it is very user-friendly and comparatively easier than TensorFlow. We will use keras with the tensorflow back-end.

In [None]:
import tensorflow as tf

In [None]:
tf.__version__

In [None]:
import keras

from keras.models import Sequential #the model is built adding layers one after the other

from keras.layers import Dense #fully connected layers: every output talks to every input

from keras.layers import Dropout #for regularization

We begin with the 4top vs ttbar problem, and we use the configuration where we added the features "number of leptons", "number of jets" etc. For reference, the optimal SVM achieved 94-95% accuracy. Note that those numbers had not been run through <b> nested </b> cross validation so they might be slightly optimistic. 

In [None]:
X = pd.read_csv('../data/Features_lim_2.csv')

In [None]:
y = np.genfromtxt('../data/Labels_lim_2.txt')

In [None]:
X.values.shape

There is no "built-in" cross validation (or nested cross validation) process, so we would need to build it ourselves. For now, we can build three sets: train, validation (for parameter optimization), and test (for final evaluation). We should ideally build this as a cross-validation structure.

In [None]:
#Always shuffle first

X,y = shuffle(X,y, random_state = 10)

In [None]:
X_train = X.values[:3000,:]
y_train = y[:3000]

In [None]:
X_val = X.values[3000:4000,:]
y_val = y[3000:4000]

In [None]:
X_test = X.values[4000:,:]
y_test = y[4000:]

In [None]:
X_train.shape, X_val.shape, X_test.shape

### Building the network

Let's think about the model architecture.

Our input layer has 24 neurons. 

Our output layer has one neuron (the output is the probability that the object belongs to the positive class). We could also set it up as two neurons (and have softmax as the final non-linearity), but this is redundant in a binary classification problem.

We will add two hidden layers. Here I'm making their sizes = 20 (I should optimize this hyperparameter!). We can also reserve the possibility of adding a dropout layer after each one. The dropout fraction should also be optimized through CV.

Other decisions that we have to make are: which nonlinearities we use (for now: ReLU for hidden layers, sigmoid for the final one), which optimizer we use (Adam), which starting learning rate we adopt (here 0.001, but again this should be decided through CV), the number of epochs (e.g. 100; we can plot quantities of interest to check that we have enough), the batch size for the gradient descent step (here 200, but can explore!) and the loss function. The latter is the binary cross entropy, which is the standard choice for classification problems where we output a probability. It rewards "confidence" in a correct prediction (high probability). 

The commands below can be used to explore possible choices.

In [None]:
dir(keras.optimizers)

In [None]:
dir(keras.losses)

A standard choice for a case like ours, where the labels are 0/1 but we can predict a probability, is the binary cross-entropy or log loss:

L = - $\frac{1}{N} \sum_{i=1}^N y_i \cdot log(p(y_i)) + (1-y_i) \cdot log (1 - p(y_i))$

p is the probability that an object belongs to the positive class. It penalizes positive examples that are associated with predicted low probability, and negative examples that are associated with predicted high probability.

In [None]:
dir(keras.activations)

### This is how we build a fully connected neural network in keras.

In [None]:
model = Sequential()

# Add an input layer and specify its size (number of original features)

model.add(Dense(20, activation='relu', input_shape=(24,)))

# Add one hidden layer and specify its size

model.add(Dense(20, activation='relu'))

# Add an output layer 

model.add(Dense(1, activation='sigmoid'))

optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)

model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics = ['accuracy']) 

The "metric" keyword here serves to specify other possible metrics we would like to monitor. The loss itself is not interpretable, so we'll keep an eye on the accuracy.

### Ready to fit?

I hope so! Note also the additional hyperparameters "epochs" (the number of of back-and-forth passages), and batch size (how many of the data are used at every step in updating weights).

In [None]:
mynet = model.fit(X_train, y_train, validation_data= (X_val, y_val), epochs = 100,  batch_size=200)

This looks not so good.

In [None]:
plt.hist(model.predict(X_test), alpha = 0.5, label = 'pred')
plt.hist(y_test, alpha = 0.5, label = 'true')
plt.legend();

It's also helpful to plot the training and validation losses throughout the epochs.

In [None]:
plt.figure(figsize=(14,5))

plt.subplot(121)

plt.plot(mynet.history['loss'], label = 'train')
plt.plot(mynet.history['val_loss'],'-.m', label = 'validation')
plt.ylabel('Loss', fontsize = 14)
plt.xlabel('Epoch', fontsize = 14)
plt.legend(loc='upper right', fontsize = 12)

plt.subplot(122)

plt.plot(mynet.history['accuracy'], label = 'train')
plt.plot(mynet.history['val_accuracy'], '-.m', label = 'validation')
plt.ylabel('Accuracy', fontsize = 14)
plt.xlabel('Epoch', fontsize = 14)
plt.legend(fontsize = 12)
plt.subplots_adjust(wspace=0.5)

#plt.show()

#plt.savefig('FirstNN.png', dpi= 300)

### Learning Check-in
    
Based on the graphs above, how would you say this classifier is doing? Does it suffer from high variance or high bias?

<br>

<details>
<summary style="display: list-item;">Click here for the answer!</summary>
<p>
    
```
The train and validation scores are close, so it's a high bias, not high variance problem. This is confirmed by the fact that the scores are really poor: around 70% accuracy, compared to the > 90% we obtained with SVMs.
```

</p>
</details>

### When something goes wrong, our first step should always be going back to the fundamentals of data exploration/setup.

In [None]:
X.describe()

### Yep, we forgot scaling!

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()

### Learning Check-in
    
Apply the scaler above to the correct sample.

<br>

<details>
<summary style="display: list-item;">Click here for the answer!</summary>
<p>

As usual, we only use the training set to derive the scaling! We need to run:

```python
scaler.fit(X_train)
```

</p>
</details>

In [None]:
# Run the code from the learning checkin to proceed!



We can now use the scaler that had been fit to transform the relevant data sets.

In [None]:
Xst = scaler.transform(X)

In [None]:
Xst.mean(axis=1) #Note: not exactly zero on the whole data set!

In [None]:
Xst_train = scaler.transform(X_train)
Xst_val = scaler.transform(X_val)
Xst_test = scaler.transform(X_test)

In [None]:
mynet = model.fit(Xst_train, y_train, validation_data= (Xst_val, y_val), epochs=100, batch_size=200)

In [None]:
plt.figure(figsize=(14,5))

plt.subplot(121)

plt.plot(mynet.history['loss'], label = 'train')
plt.plot(mynet.history['val_loss'],'-.m', label = 'validation')
plt.ylabel('Loss', fontsize = 14)
plt.xlabel('Epoch', fontsize = 14)
plt.legend(loc='upper right', fontsize = 12)

plt.subplot(122)

plt.plot(mynet.history['accuracy'], label = 'train')
plt.plot(mynet.history['val_accuracy'], '-.m', label = 'validation')
plt.ylabel('Accuracy', fontsize = 14)
plt.xlabel('Epoch', fontsize = 14)
plt.legend(fontsize = 12)
plt.subplots_adjust(wspace=0.5)
#plt.show()

#plt.savefig('ScaledNN.png', dpi= 300)

### Learning Check-in
    
What is your assessment of the above classifier?

<br>

<details><summary><b>Click here for the answer!</b></summary>
<p>
    
```
The performance is now comparable to what we had obtained with SVMs. There are hints of high variance/overfitting, as shown by the gap between train and validation scores; it is hard to know how significant the gap is without a cross-validated approach. We can also see that the validation loss is increasing; this indicates that some regularization technique, such as early stopping and/or a Dropout layer, could help here.
```

In [None]:
model = Sequential()

# Add an input layer and specify its size (number of original features)

model.add(Dense(20, activation='relu', input_shape=(24,)))

model.add(Dropout(0.2)) #This is the dropout fraction

# Add one hidden layer and specify its size

model.add(Dense(20, activation='relu'))

model.add(Dropout(0.2)) #This is the dropout fraction

# Add an output layer 

model.add(Dense(1, activation='sigmoid'))

optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)

model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics = ['accuracy']) 

#The metric keyword here is for other possible metrics we would like to monitor 

In [None]:
mynet = model.fit(Xst_train, y_train, validation_data= (Xst_val, y_val), epochs=100, batch_size=200)

In [None]:
plt.figure(figsize=(14,5))

plt.subplot(121)

plt.plot(mynet.history['loss'], label = 'train')
plt.plot(mynet.history['val_loss'],'-.m', label = 'validation')
plt.ylabel('Loss', fontsize = 14)
plt.xlabel('Epoch', fontsize = 14)
plt.legend(loc='upper right', fontsize = 12)

plt.subplot(122)

plt.plot(mynet.history['accuracy'], label = 'train')
plt.plot(mynet.history['val_accuracy'], '-.m', label = 'validation')
plt.ylabel('Accuracy', fontsize = 14)
plt.xlabel('Epoch', fontsize = 14)
plt.legend(fontsize = 12)
plt.subplots_adjust(wspace=0.5)

#plt.savefig('RegularizedNN.png', dpi= 300)
#plt.show()

In [None]:
# Final evaluation of the model (note this is done on the test set, so that if we do parameter optimization in the validation fold, this will be outside).

scores = model.evaluate(Xst_test, y_test, verbose=1)

print("Accuracy: %.2f%%" % (scores[1]*100)) #"scores" contains the test loss and the accuracy, which we are monitoring

In [None]:
scores