# Discover the Higgs with Deep Neural Networks
# Chapter 8: Your Own Model

Now it is your turn to create your own model to hunt for the Higgs boson. For the beginning, we will have a look on a baseline model. This baseline model does already a good job in the classification and it is up to you to create a better one.

In [None]:
# Necessary imports
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import itertools
from numpy.random import seed
import os

# Import the tensorflow module to create a neural network
import tensorflow as tf
from tensorflow.data import Dataset

# Import function to split data into train and test data
from sklearn.model_selection import train_test_split

# Import the kFold module for cross-validation
from sklearn.model_selection import KFold

# Import some common functions created for this notebook
import common

# Random state
random_state = 21
_ = np.random.RandomState(random_state)

## Data Preparation

### Load the Data

In [None]:
# Define the input samples
sample_list_signal = ['ggH125_ZZ4lep', 'VBFH125_ZZ4lep', 'WH125_ZZ4lep', 'ZH125_ZZ4lep']
sample_list_background = ['llll', 'Zee', 'Zmumu', 'ttbar_lep']

In [None]:
sample_path = 'input'
# Read all the samples
no_selection_data_frames = {}
for sample in sample_list_signal + sample_list_background:
    no_selection_data_frames[sample] = pd.read_csv(os.path.join(sample_path, sample + '.csv'))

### Event Pre-Selection

Import the pre-selection functions saved during the first chapter. If the modules are not found solve and execute the notebook of the first chapter.

In [None]:
from functions.selection_lepton_charge import selection_lepton_charge
from functions.selection_lepton_type import selection_lepton_type

In [None]:
# Create a copy of the original data frame to investigate later
data_frames = no_selection_data_frames.copy()

# Apply the chosen selection criteria
for sample in sample_list_signal + sample_list_background:
    # Selection on lepton type
    type_selection = np.vectorize(selection_lepton_type)(
        data_frames[sample].lep1_pdgId,
        data_frames[sample].lep2_pdgId,
        data_frames[sample].lep3_pdgId,
        data_frames[sample].lep4_pdgId)
    data_frames[sample] = data_frames[sample][type_selection]

    # Selection on lepton charge
    charge_selection = np.vectorize(selection_lepton_charge)(
        data_frames[sample].lep1_charge,
        data_frames[sample].lep2_charge,
        data_frames[sample].lep3_charge,
        data_frames[sample].lep4_charge)
    data_frames[sample] = data_frames[sample][charge_selection]

### Get Training and Test Data

In [None]:
# Split data to keep 40% for testing
train_data_frames, test_data_frames = common.split_data_frames(data_frames, 0.6)

Import the reweighting function to train with event weights. If the module is not found solve and execute the notebook of chapter 5.

In [None]:
from functions.reweight_weights import reweight_weights

## Baseline Model

Lets load the baseline models. The baseline model was trained with cross validation with the same split as in chapter 6.

In the cross validation the baseline model resulted in the validation loss:<br>
`[0.25824, 0.27069, 0.23760]`

Thus the val loss of the model is $0.256 \pm 0.014$.

In [None]:
baseline_models = []
for idx in range(3):
    print(f'baseline_models/model_crossval{idx}')
    model = tf.keras.models.load_model(f'baseline_models/model_crossval{idx}')
    baseline_models.append(model)

## Create the Neural Network

<font color='blue'>
Task:

1. Use reweighted event weights for your training
2. Choose a setup for your model
3. Train your model with early stopping
4. Plot the training history and binary classification on training and validation data
5. check your validation loss <br>
    If the validation loss is not better than for the baseline model by two standard deviations of the baseline validation loss: go back to step 2.
6. Validate your results with cross validation and calculate the mean validation loss and its standard deviation
    If the loss is not significantly better than for the baseline model: go back to step 2.
7. Save your models and training plots
8. Plot the training history

Document what you can observe for your own model.
</font>

Train your model on all <u><b>low level</b></u> variables.

In [None]:
# The training input variables
training_variables = ['lep1_pt', 'lep2_pt', 'lep3_pt', 'lep4_pt']
training_variables += ['lep1_e', 'lep2_e', 'lep3_e', 'lep4_e']
training_variables += ['lep1_charge', 'lep2_charge', 'lep3_charge', 'lep4_charge']
training_variables += ['lep1_pdgId', 'lep2_pdgId', 'lep3_pdgId', 'lep4_pdgId']
training_variables += ['lep1_phi', 'lep2_phi', 'lep3_phi', 'lep4_phi']
training_variables += ['lep1_eta', 'lep2_eta', 'lep3_eta', 'lep4_eta']

In [None]:
# Extract the values, weights, and classification of the data
values, weights, classification = common.get_dnn_input(train_data_frames, training_variables, sample_list_signal, sample_list_background)

In [None]:
# Split into train and validation data
train_values, val_values, train_classification, val_classification = 
train_weights, val_weights = 

If you want you can play around with the batch size

In [None]:
# Get reweighted weights


# Convert the data to tensorflow datasets


<font color='blue'>
Task:

Try different neural network shapes with different number of hidden layers and nodes per layer. The number of nodes can also differ from layer to layer.
    
Note down all the networks you have trained and their corresponding validation loss.
</font>

In [None]:
# Normalization layer

# Create a simple NN
model_layers = [
]
model = tf.keras.models.Sequential(model_layers)

## Train the Neural Network

You can change the learning rate of the optimizer to improve your training

In [None]:
# Loss function
loss_fn = 
# Optimizer
adam_optimizer = 

In [None]:
# Compile model now with the weighted metric


In [None]:
# Early stopping
early_stopping = 

In [None]:
# Train model


In [None]:
# Plot the training history
fig, ax = plt.subplots(figsize=(7, 6))
ax.plot(history.history['loss'], label='training')
ax.plot(history.history['val_loss'], label='validation')
ax.set_xlabel('epoch')
ax.set_ylabel('loss')
ax.legend()
_ = plt.show()

## Apply and Evaluate the Neural Network on Training and Validation Data

In [None]:
# Apply the model for training and validation values
train_prediction = 
val_prediction = 
# Plot the model output
common.plot_dnn_output(train_prediction, train_classification, val_prediction, val_classification)
_ = plt.show()

In [None]:
# Evaluate the model on training and validation data


<font color='blue'>
Is the validation loss by two standard deviations lower than the validation loss of the baseline model?
If so save your model and continue with the cross-validation to proof your model setup is better.
</font>

In [None]:
# Save your model
model.save(f'models/chapter8_own_model')

## Cross-Validation

In [None]:
# Define the K-fold Cross Validator
kfold = 

In [None]:
# Store the models and their training history
kfold_history = []
kfold_model = []
# Store the evaluation on training and validation data
kfold_train_eval_loss = []
kfold_train_eval_acc = []
kfold_val_eval_loss = []
kfold_val_eval_acc = []
split_idx = 1
for train_indices, val_indices in kfold.split(values):
    print(f'Use fold {split_idx}')
    split_idx += 1
    # Get train and validation data 
    train_values = values[train_indices]
    train_classification = classification[train_indices]
    train_weights = weights[train_indices]
    val_values = values[val_indices]
    val_classification = classification[val_indices]
    val_weights = weights[val_indices]
    # Get reweighted weights
    train_weights_reweighted = 
    val_weights_reweighted = 
    # Get train and validation datasets


    # Normalization layer

    # Create a simple NN
    model_layers = [
    ]
    model = tf.keras.models.Sequential(model_layers)
    # Compile model


    # Train model
    history = 

    # Append to list
    kfold_history.append(history)
    kfold_model.append(model)

    # Evaluate model on training and validation data
    model_train_evaluation = 
    model_val_evaluation = 
    kfold_train_eval_loss.append(model_train_evaluation[0])
    kfold_train_eval_acc.append(model_train_evaluation[1])
    kfold_val_eval_loss.append(model_val_evaluation[0])
    kfold_val_eval_acc.append(model_val_evaluation[1])

In [None]:
# Plot the training history
fig, ax = plt.subplots(figsize=(7, 6))

ax.set_xlabel('epoch')
ax.set_ylabel('loss')
ax.legend()
_ = plt.show()

In [None]:
val_loss_mean = 
val_loss_std = 
print(f'The val loss of the model is {round(val_loss_mean, 3)} +- {round(val_loss_std, 3)}')

<font color='blue'>
Now perform a t-test between your validation losses and the ones of the baseline model. If the p-value is lower than 5% you have proven that your model is indeed better than the baseline one and you can continue with the Higgs search.
</font>

In [None]:
from scipy import stats

In [None]:
# The validation losses of the baseline model
basline_val_values = [0.25824, 0.27069, 0.23760]

# Perform t-test
t_stat, p_value = stats.ttest_ind(basline_val_values, kfold_val_eval_loss)

print(f'The p-value of the two models having the same performance is {p_value}')

In [None]:
# Save all cross-validation models
for idx, model in enumerate(kfold_model):
    # Save the model
    model.save(f'models/chapter8_own_model_crossval{idx}')

## Higgs Search

<font color='blue'>
Task:

Load your model created before the cross-validation. Apply the model on the test data to get an prediction for unseen data. Compare this prediction with the ones we had so far in the chapters before.
    
Calculate the significances you can get by this model for different cut values and compare this to the results in chapter 7.
</font>

In [None]:
own_model = tf.keras.models.load_model('models/chapter8_own_model')

In [None]:
# Apply the model on test data
data_frames_apply_dnn = 

In [None]:
model_prediction = {'variable': 'model_prediction',
                    'binning': np.linspace(0., 1, 20),
                    'xlabel': 'prediction'}
common.plot_hist(model_prediction, data_frames_apply_dnn, show_data=False)
plt.show()

In [None]:
# Extract the values, weights, and classification of the test dataset
test_values, test_weights, test_classification = common.get_dnn_input(test_data_frames, training_variables, sample_list_signal, sample_list_background)

In [None]:
# Split the data in signal and background
test_signal_values = 
test_signal_weights = 
test_bkg_values = 
test_bkg_weights = 

In [None]:
from functions.get_significances import get_significances

In [None]:
# Load the baseline model for comparison
model_baseline = tf.keras.models.load_model(f'baseline_models/model_crossval0')

In [None]:
# Calculate the significances
model_baseline_cut_values, model_baseline_significances = 
own_model_cut_values, own_model_significances = 

In [None]:
# Plot the significances of baseline and own model
