# Higgs Challenge Example using Neural Networks
In this part we continue to work with the data from the **[Higgs Boson ML Challenge][1]** on Kaggle and present solutions using neural networks (NN). 

It is based on [HiggsChallenge-NN.ipynb.ipynb from LMU course][2]



[1]: https://www.kaggle.com/c/Higgs-boson
[2]: https://github.com/fuenfundachtzig/LMU_DA_ML/blob/master/HiggsChallenge-NN.ipynb
[3]: NN_Activation.ipynb

## Neural Networks to discover the Higgs

Now let's start trying to apply a NN to the Higgs Challenge data. We will start using Scikit Learn, and then try **[Keras](https://keras.io/)**.

### Load the data and preprocessing

In [None]:
# the usual setup: 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from pathlib import Path
import urllib

In [None]:
path = Path("atlas-higgs-challenge-2014-v2.csv.gz")

def prepare_data(path):
    if path.exists():
        return
    url = "http://opendata.cern.ch/record/328/files/atlas-higgs-challenge-2014-v2.csv.gz"
    path_prev_tutorial = Path("../05-validation-and-metrics") / path
    if path_prev_tutorial.exists():
        path.symlink_to(path_prev_tutorial)
    if not path.exists():
        urllib.request.urlretrieve(url, path)

prepare_data(path)

df = pd.read_csv(path)

In [None]:
n_sig_tot = df["Weight"][df.Label == "s"].sum()
n_bkg_tot = df["Weight"][df.Label == "b"].sum()
# comment this out if you want to run on the full dataset
df = df.sample(frac=0.3)

In [None]:
# map y values to integers
df['Label'] = df['Label'].map({'b':0, 's':1})

In [None]:
# let's create separate arrays
X = df.loc[:,'DER_mass_MMC':'PRI_jet_all_pt']
columns = list(X.columns)
X = X.to_numpy()
y = df['Label'].to_numpy()
weight = df['Weight'].to_numpy()

In [None]:
#now split into testing and training samples
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test, weight_train, weight_test = train_test_split(
    X, y, weight, test_size=0.33, random_state=42)

# Neural networks (**M**ulti**L**ayer **P**erceptrons - MLP) in sklearn

Let's first look at the [MLPClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html)

In [None]:
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(verbose=True, early_stopping=True, max_iter=40)

In [None]:
mlp.get_params()

In [None]:
%%time
mlp.fit(X_train, y_train)

In [None]:
mlp.score(X_test, y_test)

We will again use the [approximate median significance][1] from the Kaggle competition to determine how good a solution was. Note that if you do not use the full data set (i.e. you split into training and testing) you have to reweigh the inputs so that the subsample yield matches to the total yield, which we will do below.

[1]: AMS.ipynb

In [None]:
# load function to compute approximate median significance (AMS)
from mltools import ams

In [None]:
# Determine probability scores
y_train_prob = mlp.predict_proba(X_train)[:, 1]
y_test_prob = mlp.predict_proba(X_test)[:, 1]

In [None]:
# add the probability to the original data frame
df['Prob']=mlp.predict_proba(X)[:, 1]

In [None]:
from mltools import plot_proba
plot_proba(df, mlp, X )

In [None]:
# calculate the total weights (yields)
#sigall  = weight[y==1].sum()
#backall = weight[y==0].sum()
# need to use numbers for full sample
sigall,backall = n_sig_tot, n_bkg_tot

In [None]:
from mltools import ams_scan
label='Train'
pcutv,amsv = ams_scan(y_train, y_train_prob, weight_train, sigall, backall)

# calculate size and pcut of ams maximum
pcutmax,amsmax = pcutv[np.argmax(amsv)] , amsv.max()
print(f"{label} Maximum AMS {amsmax:.3f} for pcut {pcutmax:.3f}")
plt.plot(pcutv,amsv,label=label)
label='Test'
pcutv,amsv = ams_scan(y_test, y_test_prob, weight_test, sigall, backall)

# calculate size and pcut of ams maximum
pcutmax,amsmax = pcutv[np.argmax(amsv)] , amsv.max()
print(f"{label} Maximum AMS {amsmax:.3f} for pcut {pcutmax:.3f}")
plt.plot(pcutv,amsv,label=label)
plt.xlim(0., 1.)
plt.grid()
plt.xlabel('Pcut')
plt.ylabel('AMS')
plt.legend();

How did we do? Worse than the BDT from 
[higgs_challenge.ipynb](../05-validation-and-metrics/higgs_challenge.ipynb)

![Comparison with submissions](figures/tr150908_davidRousseau_TMVAFuture_HiggsML.001.png)

## Rescaling
Neural networks are quite sensitive to feature scaling, so let's try to scale the features.

And set missing values to 0 before.

In [None]:
X_train[X_train == -999] = 0
X_test[X_test == -999] = 0


In [None]:
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
plt.hist(X_train[:, columns.index("DER_mass_MMC")], bins=100);

In [None]:
plt.hist(X_train_scaled[:, columns.index("DER_mass_MMC")], bins=100);

Train a new network using the rescaled features:

In [None]:
mlp_scaled = MLPClassifier(verbose=True, early_stopping=True, max_iter=40)
mlp_scaled.fit(X_train_scaled, y_train)

In [None]:
mlp_scaled.score(X_test_scaled, y_test)

In [None]:
mlp_scaled.get_params()

In [None]:
# Determine probability scores
y_train_prob_scaled = mlp_scaled.predict_proba(X_train_scaled)[:, 1]
y_test_prob_scaled = mlp_scaled.predict_proba(X_test_scaled)[:, 1]

In [None]:
from mltools import ams_scan
label='Train'
pcutv,amsv = ams_scan(y_train, y_train_prob_scaled, weight_train, sigall, backall)
# calculate size and pcut of ams maximum
pcutmax,amsmax = pcutv[np.argmax(amsv)] , amsv.max()
print(f"{label} Maximum AMS {amsmax:.3f} for pcut {pcutmax:.3f}")
plt.plot(pcutv,amsv,label=label)
label='Test'
pcutv,amsv = ams_scan(y_test, y_test_prob_scaled, weight_test, sigall, backall)
# calculate size and pcut of ams maximum
pcutmax,amsmax = pcutv[np.argmax(amsv)] , amsv.max()
print(f"{label} Maximum AMS {amsmax:.3f} for pcut {pcutmax:.3f}")
plt.plot(pcutv,amsv,label=label)
plt.xlim(0., 1.)
plt.grid()
plt.xlabel('Pcut')
plt.ylabel('AMS')
plt.legend();

We improved quite a bit by using the same classifier but with rescaled data!

<div class="alert alert-block alert-success">
    <h2>Exercise 1</h2>
    Check documentation of the MLPClassifier (https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html) and vary the structure of the network (number of hidden layers, number of neurons)
</div>

# Neutral networks with Keras
SciKit Learn has simple NNs, but if you want to do deep NNs, or train on GPUs, you probably want to use something like [Keras](https://keras.io/getting_started/) instead. 

Let's try to create a simple NN, similar to the one sklearn gave us using Keras.

In [None]:
np.random.seed(1337)  # for reproducibility

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

In [None]:
from tensorflow.keras import regularizers

model = Sequential([
    Dense(units=100, activation="relu", input_shape=X_train.shape[1:], kernel_regularizer=regularizers.l2(0.0001)),
    Dense(units=1, activation="sigmoid")
])

* `Dense`: "Just your regular densely-connected NN layer."
  * implements the operation: output = activation(dot(input, kernel) + bias)
    * kernel is a weights matrix created by the layer
    * bias is a bias vector created by the layer (only applicable if `use_bias` is True)
  * `units`: dimensionality of the output array (note: we do not need to specify to size of the input array, except...)
  * `input_shape`: expected shape of the input arrays (...only needed for the first layer)
  * `activation`: element-wise activation function
  * `kernel_regularizer`: constraint function applied to the kernel weights matrix (see [regularizers][2])
  
  
[1]: https://keras.io/constraints/
[2]: https://keras.io/api/layers/regularizers/

In [None]:
model.summary()

In [None]:
# compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

* `optimizer`: name of optimizer or optimizer instance. See [optimizers][1].
  * _Adam_: an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments ([paper][2], a short [summary][4])
* `loss`: name of objective function or objective function. See [losses][3].
  * _binary crossentropy_: 
    $$H_p(q) = -\frac{1}{N}\sum_{i=1}^N [{y_i} \log(\hat{y}_i)+(1-y_i) \log(1-\hat{y}_i)]$$
    * a measure of dissimilarity, used here to define the loss function that should be minimized: 
    
        "The cross entropy between two probability distributions p and q over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set if a coding scheme used for the set is optimized for an estimated probability distribution q, rather than the true distribution p."
        
        (The minimum number of bits to encode an independent event that occurs with probability $y_i$ is $-\log_2(y)$.)
   * here the true labels are $y_i=1$ for the positive class and $y_i=0$ for the negative class
   * the estimated probabilities are $\hat y_{i}$
   * $N$ runs over all samples
* `metrics`: list of metrics to be evaluated by the model during training and testing (typically accuracy)

[1]: https://keras.io/optimizers/
[2]: https://arxiv.org/abs/1412.6980v8
[3]: https://keras.io/losses/
[4]: https://medium.com/@nishantnikhil/adam-optimizer-notes-ddac4fd7218
[5]: https://datascience.stackexchange.com/questions/9302/the-cross-entropy-error-function-in-neural-networks

In [None]:
history = model.fit(X_train_scaled, y_train, epochs=10, batch_size=200, validation_split=0.1)

* `batch_size`: number of samples per gradient update
* `epochs`: number of epochs to train the model. An epoch is an iteration over the entire training dataset provided. 

[Further discussion...](https://machinelearningmastery.com/difference-between-a-batch-and-an-epoch/)

In [None]:
history.history.keys()

In [None]:
# visualize training history returned by model.fit

# Plot training & validation accuracy values
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()

# Plot training & validation loss values
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()

The `.predict` method will give us the output

In [None]:
y_train_prob_keras = model.predict(X_train_scaled)[:, 0]

Alternatively we can treat keras models like functions (note this will return tensorflow tensors which you might want to convert to numpy). When data fits into memory this is often fastest

In [None]:
y_train_prob_keras = model(X_train_scaled).numpy()
y_test_prob_keras = model(X_test_scaled).numpy()

In [None]:
from mltools import ams_scan
label='Train'
pcutv,amsv = ams_scan(y_train, y_train_prob_keras, weight_train, sigall, backall)
# calculate size and pcut of ams maximum
pcutmax,amsmax = pcutv[np.argmax(amsv)] , amsv.max()
print(f"{label} Maximum AMS {amsmax:.3f} for pcut {pcutmax:.3f}")
plt.plot(pcutv,amsv,label=label)
label='Test'
pcutv,amsv = ams_scan(y_test, y_test_prob_keras, weight_test, sigall, backall)
# calculate size and pcut of ams maximum
pcutmax,amsmax = pcutv[np.argmax(amsv)] , amsv.max()
print(f"{label} Maximum AMS {amsmax:.3f} for pcut {pcutmax:.3f}")
plt.plot(pcutv,amsv,label=label)
plt.xlim(0., 1.)
plt.grid()
plt.xlabel('Pcut')
plt.ylabel('AMS')
plt.legend();

We only made a single layer NN in Keras. However, you can easily change the structure of the network. As an assignment, try adding an extra hidden layer and changing the number of neurons.

<div class="alert alert-block alert-success">
    <h2>Exercise 2</h2>
    We only made a single layer NN in Keras. However, you can easily change the structure of the network. Try adding an extra hidden layer and changing the number of neurons.
</div>    
<div class="alert alert-block alert-success">
    <h3>Further variations:</h3>
    - Vary the activation.
    - Vary the regularization. May have to do this as the structure changes.
    - Try using derivied variables only or primary variables only.

</div>