# Discover the Higgs with Deep Neural Networks
# Chapter 4: Validation Data and Early Stopping

In this chapter you will create a validation dataset to judge the training progress. To automatically stop the training at right point early stopping will be used.

In [None]:
# Necessary imports
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import os

# Import the tensorflow module to create a neural network
import tensorflow as tf
from tensorflow.data import Dataset

# Import some common functions created for this notebook
import common

# Random state
random_state = 21
np.random.seed(random_state)
tf.random.set_seed(random_state)

## Data Preparation

### Load the Data

In [None]:
# Define the input samples
sample_list_signal = ['ggH125_ZZ4lep', 'VBFH125_ZZ4lep', 'WH125_ZZ4lep', 'ZH125_ZZ4lep']
sample_list_background = ['llll', 'Zee', 'Zmumu', 'ttbar_lep']

In [None]:
sample_path = 'input'
# Read all the samples
no_selection_data_frames = {}
for sample in sample_list_signal + sample_list_background:
    no_selection_data_frames[sample] = pd.read_csv(os.path.join(sample_path, sample + '.csv'))

### Event Pre-Selection

Import the pre-selection functions saved during the first chapter. If the modules are not found solve and execute the notebook of the first chapter.

In [None]:
from functions.selection_lepton_charge import selection_lepton_charge
from functions.selection_lepton_type import selection_lepton_type

In [None]:
# Create a copy of the original data frame to investigate later
data_frames = no_selection_data_frames.copy()

# Apply the chosen selection criteria
for sample in sample_list_signal + sample_list_background:
    # Selection on lepton type
    type_selection = np.vectorize(selection_lepton_type)(
        data_frames[sample].lep1_pdgId,
        data_frames[sample].lep2_pdgId,
        data_frames[sample].lep3_pdgId,
        data_frames[sample].lep4_pdgId)
    data_frames[sample] = data_frames[sample][type_selection]

    # Selection on lepton charge
    charge_selection = np.vectorize(selection_lepton_charge)(
        data_frames[sample].lep1_charge,
        data_frames[sample].lep2_charge,
        data_frames[sample].lep3_charge,
        data_frames[sample].lep4_charge)
    data_frames[sample] = data_frames[sample][charge_selection]

### Get Training and Test Data

In [None]:
# Split data to keep 40% for testing
train_data_frames, test_data_frames = common.split_data_frames(data_frames, 0.6)

## Validation Data

How long do we have to train? The right amount of training is very important for the final performance of the network. If the training was too short the model parameters are poorly adapted to the underlying concepts and the model performance is bad. This is called undertraining. If the training was too long the model will start to learn the training data by heart. This overtraining will lead to a very godd performance at the training data but bad performance on unseen data.

Thus, to test the performance of the model the model has to be applied on unseen data. After each epoch the model is the model is applied on validation data not used for training. If the classification of the validation data has improved for the current epoch the model performance is still improving. If the performance on the validation data does not improve anymore the training can be stopped.

<div>
<img src='figures/over_and_under_training.png' width='700'/>
</div>

In [None]:
# The training input variables
training_variables = ['lep1_pt', 'lep2_pt', 'lep3_pt', 'lep4_pt']

In [None]:
# Import function to split data into train and test data
from sklearn.model_selection import train_test_split

In [None]:
# Extract the values and classification
values, _, classification = common.get_dnn_input(train_data_frames, training_variables, sample_list_signal, sample_list_background)

Use again 2/3 of the training data for the actual training and 1/3 to validate the model performance.

In [None]:
# Split into train and validation data
train_values, val_values, train_classification, val_classification = train_test_split(values, classification, test_size=1/3, random_state=random_state)

## Create the Neural Network

<font color='blue'>
Task:

Let's follow the same strategy as before:
- Create tensorflow datasets for training and validation data with 128 events per batch
- Recreate and adapt the normalization layer
- Recreate the tensorflow model with 2 hidden layers and 60 nodes per layer
</font>

In [None]:
# Convert the data to tensorflow datasets
train_data = Dataset.from_tensor_slices((train_values, train_classification))
train_data = train_data.shuffle(len(train_data), seed=random_state)
train_data = train_data.batch(128)
val_data = Dataset.from_tensor_slices((val_values, val_classification))
val_data = val_data.shuffle(len(val_data), seed=random_state)
val_data = val_data.batch(128)

In [None]:
# Normalization layer
normalization_layer = tf.keras.layers.Normalization()
normalization_layer.adapt(train_values)
# Create a simple NN
model_layers = [
    normalization_layer,
    tf.keras.layers.Dense(60, activation='relu'),
    tf.keras.layers.Dense(60, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid'),
]
model = tf.keras.models.Sequential(model_layers)

## Train the Neural Network

Lets choose the same loss function and optimizer and compile the model.

In [None]:
# Loss function
loss_fn = tf.keras.losses.BinaryCrossentropy(from_logits=False)
# Optimizer
adam_optimizer = tf.keras.optimizers.legacy.Adam(learning_rate=0.0002, beta_1=0.9)

In [None]:
# Compile model
model.compile(optimizer=adam_optimizer, loss=loss_fn, metrics=['binary_accuracy'])

So how to stop the training if the perfomance does not improve anymore?

The answer is early stopping. With early stopping you set a value which should be monitored, in our case the loss on the validation data `val_loss`. Since there can be fluctuations in the tested model performance, it is recommended to use a certain patience after which the training should be stopped. If we set `patience=5` the training is stopped if the `va_loss` has not improved for 5 epochs. Since the model performance has potentionally decreased during this 5 epochs set `restore_best_weights=True` to restore the model state with the best performance. The requested number of epochs can be set to a very high number to ensure that the training is only stopped if no improvement was observed anymore.

In [None]:
# Early stopping
early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

In [None]:
# Train model with early stopping for the validation data performance
history = model.fit(train_data, validation_data=val_data, callbacks=[early_stopping], epochs=1000)

<font color='blue'>
Task:

Plot the training loss (<code>history.history['loss']</code>) and validation loss (<code>history.history['val_loss']</code>) of the training history. Describe what behavior you can observe for each performance trend.
</font>

In [None]:
# Plot the training history
fig, ax = plt.subplots(figsize=(7, 6))
# Training loss
ax.plot(history.history['loss'], label='training')
# Validation loss
ax.plot(history.history['val_loss'], label='validation')
ax.set_xlabel('epoch')
ax.set_ylabel('loss')
ax.legend()
_ = plt.show()

## Apply and Evaluate the Neural Network on Training and Validation Data

The model itself has already an implemented evaluation function. When a tensorflow set is provided it returns the loss and accuracy on this dataset.

In [None]:
# Evaluate model on training
model_train_evaluation = model.evaluate(train_data)

print(f'train loss = {round(model_train_evaluation[0], 5)}\ttrain accuracy = {round(model_train_evaluation[1], 5)}')

<font color='blue'>
Task:

Evaluate the model on validation data and compare the results to the validation on training data.
</font>

In [None]:
# Evaluate model on validation data
model_val_evaluation = model.evaluate(val_data)

print(f'val loss = {round(model_val_evaluation[0], 5)}\tval accuracy = {round(model_val_evaluation[1], 5)}')

Now lets apply the model on the train and validation data and plot the classification.

In [None]:
# Apply the model for training and validation values
train_prediction = model.predict(train_values)
val_prediction = model.predict(val_values)

In [None]:
# Plot the model output
common.plot_dnn_output(train_prediction, train_classification, val_prediction, val_classification)
_ = plt.show()

As you can see the classification by the model on traning and validation data is very consistent. This is great :)<br>
If we would see a significant difference in train and validation classification this would be a clear sign for overtraining.

## Prediction on Test Data

<font color='blue'>
Task:

Use <code>common.apply_dnn_model(...)</code> to apply the model for all samples in <code>test_data_frames</code> and plot the classification.
</font>

In [None]:
# Apply the model
data_frames_apply_dnn = common.apply_dnn_model(model, data_frames, training_variables)

In [None]:
model_prediction = {'variable': 'model_prediction',
                    'binning': np.linspace(0, 1, 50),
                    'xlabel': 'prediction'}
common.plot_hist(model_prediction, data_frames_apply_dnn, show_data=False)
plt.show()

<font color='blue'>
Task:

The $llll$ events are mostly classified as background and the Higgs events tend to the signal classification. However, the classification of the other backgrounds seems mostly random and even a bit shifted towards the signal classification. What could be the reason for this?
</font>

<font color='green'>
Answer:

The number of training events for Zee, Zmumu and ttbar_lep is much smaller than for the other samples. Thus, they are hardly considered in the training.
    
In the first chapter we have observed following number of events in the full dataset:
- ggH125_ZZ4lep: 161451
- VBFH125_ZZ4lep: 186870
- WH125_ZZ4lep: 9772
- ZH125_ZZ4lep: 11947
- $llll$: 523957
- Zee: 243
- Zmumu: 257
- ttbar_lep: 334
</font>

## Save and Load a Model

Save the model for a later comparison

In [None]:
model.save('models/chapter4_model')