# Discover the Higgs with Deep Neural Networks
# Chapter 2: Create and Train a Neural Network

In this chapter you will train your very first neural network. At the beginning the data preparation of the first chapter is done to enable the training.

In [None]:
# Necessary imports
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from numpy.random import seed
import os

# Import some common functions created for this notebook
import common

# Random state
random_state = 21
_ = np.random.RandomState(random_state)

## Data Preparation

### Load the Data

In [None]:
# Define the input samples
sample_list_signal = ['ggH125_ZZ4lep', 'VBFH125_ZZ4lep', 'WH125_ZZ4lep', 'ZH125_ZZ4lep']
sample_list_background = ['llll', 'Zee', 'Zmumu', 'ttbar_lep']

In [None]:
sample_path = 'input'
# Read all the samples
no_selection_data_frames = {}
for sample in sample_list_signal + sample_list_background:
    no_selection_data_frames[sample] = pd.read_csv(os.path.join(sample_path, sample + '.csv'))

### Event Pre-Selection

Import the pre-selection functions saved during the first chapter. If the modules are not found solve and execute the notebook of the first chapter.

In [None]:
from functions.selection_lepton_charge import selection_lepton_charge
from functions.selection_lepton_type import selection_lepton_type

In [None]:
# Create a copy of the original data frame to investigate later
data_frames = no_selection_data_frames.copy()

# Apply the chosen selection criteria
for sample in sample_list_signal + sample_list_background:
    # Selection on lepton type
    type_selection = np.vectorize(selection_lepton_type)(
        data_frames[sample].lep1_pdgId,
        data_frames[sample].lep2_pdgId,
        data_frames[sample].lep3_pdgId,
        data_frames[sample].lep4_pdgId)
    data_frames[sample] = data_frames[sample][type_selection]

    # Selection on lepton charge
    charge_selection = np.vectorize(selection_lepton_charge)(
        data_frames[sample].lep1_charge,
        data_frames[sample].lep2_charge,
        data_frames[sample].lep3_charge,
        data_frames[sample].lep4_charge)
    data_frames[sample] = data_frames[sample][charge_selection]

### Get Test and Training Data

During training, the parameters in the neural network are adjusted to the training data. In order to investigate  how the neural network performs on completely unseen data, we will only train on a subset of the given data. 60% of the data will be used for training and 40% for later testing. The split can be done by the `split_data_frames` of the common module. In this addition to the splitting this function rescales the event weights in the given dataframes. So if we split in training and validation data both will have less simulated events than the original dataframes but the same total prediction.

In [None]:
train_data_frames, test_data_frames = common.split_data_frames(data_frames, 0.6)

In [None]:
# Print the number of simulated events and the total prediction for the full, train, and test datasets
for name in data_frames.keys():
    print(name)
    full_set = data_frames[name]
    train_set = train_data_frames[name]
    test_set = test_data_frames[name]
    full_set_pred = full_set['totalWeight'].sum()
    train_set_pred = train_set['totalWeight'].sum()
    test_set_pred = test_set['totalWeight'].sum()
    
    
    print(f'Number of simulated events:\tfull:{len(full_set)}\ttrain:{len(train_set)}\ttest:{len(test_set)}')
    print(f'Number of predicted events:\tfull:{round(full_set_pred, 3)}\ttrain:{round(train_set_pred, 3)}\ttest:{round(test_set_pred, 3)}')

## Create the Neural Network

Lets start with a very simple training on the transverse momentum of the leptons. To speed up the training not all processes are considered for the training.

In [None]:
# The training input variables
training_variables = ['lep1_pt', 'lep2_pt', 'lep3_pt', 'lep4_pt']

Extract the training variables, the event weights, and the classification of the events. The classification is 0 for background processes and 1 for Higgs events. 

In [None]:
values, weights, classification = common.get_dnn_input(train_data_frames, training_variables, sample_list_signal, sample_list_background)

In [None]:
# Show values of 6 random events
random_idx = [1841, 11852, 15297, 263217, 278357, 331697]
# Training variables
print('Training values:')
print(values[random_idx])
# Event weights
print('Event weights:')
print(weights[random_idx])
# Classification
print('Classification:')
print(classification[random_idx])

Tensorflow is used to create and train neural networks. For this purpose, the required module is imported and the existing training values and classifications are transformed into a Tensorflow dataset.

In [None]:
# Import the tensorflow module to create a neural network
import tensorflow as tf
from tensorflow.data import Dataset

In [None]:
# Convert the data to tensorflow datasets
train_data = Dataset.from_tensor_slices((values, classification))
train_data = train_data.shuffle(len(values), seed=random_state)
# Set the batch size
train_data = train_data.batch(128)

So lets create a neural network consisting of several layers.

A full list of layer types can be found here https://www.tensorflow.org/api_docs/python/tf/keras/layers <br>
Some important examples for layers are:
- Normalization: Shift and scale the training input to have the center at 0 and a standard deviation of 1 
- Dense: a densely connected NN layer
- Dropout: randomly sets input to 0. Can decrease overtraining

The neurons of each layer are activated by the so called activation function. <br>
A full list of provided activation functions can be found here https://www.tensorflow.org/api_docs/python/tf/keras/activations <br>
Some examples are:
- Linear: $linear(x) = x$
- Relu: $relu(x) = max(0, x)$
- Exponential: $exponential(x) = e^x$
- Sigmoid: $sigmoid(x) = 1 / (1 + e^{-x})$

The activation of i-th node in the k-th layer $a_i^{(k)}$ is given by the sum of the activations of the previous layer $a_j^{(k-1)}$ multiplied by the trainings weights $w_{j; i}^{(k)}$ plus a bias $b_i^{(k)}$. For this sum the activation function $f_{activ}$ is called resulting in the activation of the current node.<br>
$a_i^{(k)} = f_{activ}(\sum_j w_{j; i}^{(k)} \cdot a_j^{k-1} + b_i^{(k)})$

Let's create following model:
<div>
<img src='figures/simple_model.png' width='900'/>
</div>

- The input variables are scaled and shifted to have the mean of 0 and the standard deviation of 1. This normalization improves the convergence of the training process.
- Use two dense layers with 60 neurons each. The neurons are activated with the relu function.
- To classify background and signal the last layer should only consist of one neuron which has an activation between 0 and 1. The activation between 0 and 1 if given by the sigmoid function

<font color='blue'>
Task:

How many parameters do you expect for this model?
</font>

Create a normalization layer and adapt it to the training values:

In [None]:
# Normalization layer
normalization_layer = tf.keras.layers.Normalization(name='Input_normalization')
normalization_layer.adapt(values)

Create a list of model layers and built the model:

In [None]:
# Create model
model_layers = [
    normalization_layer,
    tf.keras.layers.Dense(60, activation='relu', name='Hidden_layer_1'),
    tf.keras.layers.Dense(60, activation='relu', name='Hidden_layer_2'),
    tf.keras.layers.Dense(1, activation='sigmoid', name='Output_layer'),
]
model = tf.keras.models.Sequential(model_layers, name='simple_model')

Before we start the training of the model lets check the shape and the number of trainable parameters.

In [None]:
# Display the model's architecture
model.summary()

In [None]:
tf.keras.utils.plot_model(model, show_shapes=True, show_layer_names=True, show_layer_activations=True)

## Train the Neural Network

For the training we have to choose an optimization criteria, the loss function. For supervised learning, the loss function returns a value for the divergence between model prediction and the true value. The loss value is high for bad agreement and low for good agreement. Thus, training the model means minimizing the loss function.
The seperartion of signal and background is binary classification problem and thus the binary cross-entropy is used to calculate the loss.
The binary cross-entropy is given by:<br>
$H = -\frac{1}{N} \sum_i^N (y_i^{true} log(y_i^{predict}) + (1 - y_i^{true}) log(1 - y_i^{predict}))$

As you can see the formula distinguishes signal events with $y_i^{true} = 1$ and background events with $y_i^{true} = 0$. In the following figure you can see the cross entropy for a signal and a background event for different prediction values. The closer the prediction value is to the true classification the smaller the crossentropy is. This is exactly the behavior we need for our loss function, and thus training the neural network means minimizing the mean cross-entropy of all training events.
<div>
<img src='figures/binary_cross_entropy.png' width='700'/>
</div>

If you are still wondering why we use the logarithms and why it is called entropy you are encouraged to check out some information theory. The origin of this cross-entropy can be found in the entropy definition in information theory. The entropy of the true distribution is always smaller than the cross-entropy of its estimator. Therefore, minimizing the cross-entropy is equivalent to becomming closer to the true distribution.

In [None]:
# Loss function
loss_fn = tf.keras.losses.BinaryCrossentropy(from_logits=False)

The loss function itself seems not complicated but keep in mind in our example it depends on several thousand parameters of the neural network. Thus, finding the correct parameter combination to minimize the loss is a quite complex problem. The Adam optimizer we will use for the training is a stochastic gradient descent method.

When we converted the training a batch size was chosen `train_data = train_data.batch(128)` which are subsets of the given data. While training the optimizer computes the gradient descent und updates the parameters for each batch.
After updating the parameters for all batches one after the other, this process can be started again from the beginning. A complete run over all training data is called an epoch. After several epochs, the parameters have been adjusted several times for all training data and hopefully a good combination of neural network parameters has been found.

In [None]:
# Optimizer
adam_optimizer = tf.keras.optimizers.legacy.Adam(learning_rate=0.0002, beta_1=0.9)

Configure the model for the training

In [None]:
# Compilation
model.compile(optimizer=adam_optimizer, loss=loss_fn, metrics=['binary_accuracy'])

Lets train the model for 5 epochs and store the training history in a variable.

In [None]:
history = model.fit(train_data, epochs=5)

Lets visualize the training progress of the model.

In [None]:
# Plot the training history
fig, ax = plt.subplots(figsize=(7, 6))
ax.plot(history.history['loss'])
ax.set_xlabel('epoch')
ax.set_ylabel('loss')
_ = plt.show()

<font color='blue'>
Task:

Do you think the training was done after 5 epochs? 
</font>

## Save and Load the Neural Network

As you have seen training a model takes some time. So retrain a model each time you need it is definitely not the way to go. Instead one should save the model after the training and load it for application.

Saving and loading a model is very straight forward with:<br>
Use `model.save('path/to/location')` to save a model and `model = tf.keras.models.load_model('path/to/location')` to load the model.

Lets save the model you just created

In [None]:
model.save('models/chapter2_model')

Now load the model and plot its summary

In [None]:
saved_model = tf.keras.models.load_model('models/chapter2_model')

saved_model.summary()