# Analysis of Higgs Dataset, UCI way

In questo notebook riproduco la rete neurale descritta nell'articolo:

[Baldi, P., P. Sadowski, and D. Whiteson. “Searching for Exotic Particles in High-energy Physics with Deep Learning.” Nature Communications 5 (July 2, 2014)](https://www.nature.com/articles/ncomms5308.pdf) 

per lo studio delle capacità di un algoritmo ML nella discriminazione SB alla ricerca di nuove particelle (bosoni di Higgs pesanti).

## Struttura notebook

Il notebook avrà la seguente struttura:

1. Download dei dati e preparazione dei set di training e di test;
2. Definizione del modello di classificazione dei dati (in `tensorflow.keras`);
3. Training del modello;
4. Valutazione del modello su un set di test;
5. Bellurie rafiche per mostrare i risultati.


In questo notebook sfrutto alcune funzioni di uso generico per il download dei dati e la realizzazione di gif sviluppate per il progetto SpQR-Flow.

In [None]:
# Download della repository HitHub con SpQR-Flow

! git clone https://github.com/MarcoRiggirello/SpQR-Flow.git

fatal: destination path 'SpQR-Flow' already exists and is not an empty directory.


In [None]:
# per importare moduli in locale

import sys
import os

py_file_location = "/content/SpQR-Flow/SpQR-Flow"
sys.path.append(os.path.abspath(py_file_location))

In [None]:
# Caricamento di alcune librerie
import time

import numpy as np
import pandas as pd
import seaborn as sns
import imageio
from PIL import Image

from download import download_file

## Downloading data

To download the dataset we use the function download_file in the \texttt{download.py} module: it will check if the dataset already exists in the current directory or download it from the internet.

In [None]:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz'
file_path = os.path.join('../model_test/download/HIGGS.csv.gz')
download_file(url, file_path)

## Loading the dataset

In [None]:
t0 = time.time()

column_labels = ['label','lepton pT', 'lepton eta', 'lepton phi',
                 'missing energy magnitude', 'missing energy phi',
                 'jet 1 pt', 'jet 1 eta', 'jet 1 phi',
                 'jet 1 b-tag', 'jet 2 pt', 'jet 2 eta',
                 'jet 2 phi', 'jet 2 b-tag', 'jet 3 pt',
                 'jet 3 eta', 'jet 3 phi', 'jet 3 b-tag',
                 'jet 4 pt', 'jet 4 eta', 'jet 4 phi',
                 'jet 4 b-tag', 'm_jj', 'm_jjj','m_lv',
                 'm_jlv', 'm_bb', 'm_wbb', 'm_wwbb']
data = pd.read_csv(file_path,
                   header=0,
                   names=column_labels,
                   nrows=8192*512)
t1 = time.time()
print(f'time to read {data.ndim} events = {t1 - t2 :.0f} s')

## Slicing data: train, test

We slice the dataframe to obtain train and test samples.
Data appear to have mixed labels already, so that a simple slicing is sufficient to obtain good sampling.

There are 11 milions entries: following the article, the first 2M events are used are training data, the last 100k as tests.

In [None]:
BATCH_SIZE = 8192
DATASET_SIZE = BATCH_SIZE * 256
TEST_SIZE = BATCH_SIZE * 16
SEED = 42

data_train = data[column_labels[1:22]].head(DATASET_SIZE).astype('float32')
data_test = data[column_labels[1:22]].tail(TEST_SIZE).astype('float32')
y_data = data[column_labels[0]].head(DATASET_SIZE).astype('int32')
y_test = data[column_labels[0]].tail(TEST_SIZE).astype('int32')

# Batching the data is not needed with keras
#labels_batched = [y_data.sample(BATCH_SIZE, random_state=(SEED+i)) 
#                for i in range(int(DATASET_SIZE/BATCH_SIZE))]
#features_batched = [data_train.sample(BATCH_SIZE, random_state=(SEED+i)) 
#                for i in range(int(DATASET_SIZE/BATCH_SIZE))]

 ## Building the model

 The model to train is a Deep Forward Neural Network, with 5 hidden layers, 300 nodes each. Learning Rate = 0.05 and weight decay coefficient = 
 $1 \times 10^{-5}$.
 Regularization techniques are used, such as Dropout. Other techniques were not used in the article because of computing time expenses, ence we won't use them either.

In [None]:
import tensorflow as tf
from tensorflow.keras import Input, layers, Model 
from tensorflow.keras.optimizers import Adam

NODES = 300
LR = 5e-2

inputs = Input(shape=(21,))
h_layer = layers.Dense(NODES, activation='relu')(inputs)
h_layer = layers.Dense(NODES, activation='relu')(h_layer)
h_layer = layers.Dropout(0.2)(h_layer)
h_layer = layers.Dense(NODES, activation='relu')(h_layer)
h_layer = layers.Dense(NODES, activation='relu')(h_layer)
h_layer = layers.Dense(NODES, activation='relu')(h_layer)
output = layers.Dense(1, activation='sigmoid')(h_layer)

model = Model(inputs=inputs, outputs=outputs, name='uci')

# Other options to the model
model.compile(optimizer=Adam(learning_rate=LR),
              loss='binary_crossentropy',
              metrics=['accuracy'])

## Training

In this section we train the model.

In [None]:
model.fit(data_train, y_data,
          batch_size=BATCH_SIZE,
          epochs=100,
          verbose=2,
          use_multiprocessing=True)