# Apply methods
In this notebook we will show how to load dataset saved in the previous notebook
and apply built-in methods on it. Finally we summarize their performance into a
table

## Introduction to built-in methods
Currently `hml` have three types of methods: cuts, trees, and (neural) networks.

With `hml.methods.cuts.CutAndCount`, we apply a series of cuts on the input data
to select as many signal events and as few background events as possible. Each
cut reduces the number of background events and inevitably also signal events.
The goal usually is to find a set of cuts that maximizes the significance.

The boosted decision tree is one of trees methods. It is a machine learning
method commonly used in high energy physics. The name "boosted" comes from the
idea to combine weak classifiers into a strong one. 

The neural network now only contains a simple fully connected neural network
named `ToyMLP`.

## Load the dataset
First we load the dataset from the previous notebook.

In [1]:
from hml.datasets import Dataset
from sklearn.model_selection import train_test_split
from keras.utils import to_categorical

To load dataset, we use the class method `load` of `Dataset` class. The method
takes the dataset directory as input and returns a `Dataset` object.

In [2]:
# Split the data into training and testing sets
dataset = Dataset.load("./data/z_vs_qcd")

x_train, x_test, y_train, y_test = train_test_split(
    dataset.data, dataset.target, test_size=0.2, random_state=42
)

# Convert the labels to categorical
y_train = to_categorical(y_train, dtype="int32")
y_test = to_categorical(y_test, dtype="int32")

# Show the shape of the training and testing sets
print("> Train and test shapes:")
print("x_train shape:", x_train.shape, "y_train shape:", y_train.shape)
print("x_test shape:", x_test.shape, "y_test shape:", y_test.shape)
print(f"> target names: {dataset.target_names}")

> Train and test shapes:
x_train shape: (12917, 4) y_train shape: (12917, 2)
x_test shape: (3230, 4) y_test shape: (3230, 2)
> target names: ['pp2jj', 'pp2zz']


## Apply methods
Then we apply a boosted decision tree. It comes from `scikit-learn` package
originally. The `compile` method takes loss function name, optimizer name, and a
list of metrics as input. Here we use default parameters as in `scikit-learn`.

In [3]:
from hml.methods import CutAndCount, BoostedDecisionTree, ToyMLP
from keras.losses import CategoricalCrossentropy
from keras.metrics import CategoricalAccuracy
from hml.metrics import MaxSignificance, RejectionAtEfficiency

In [4]:
method1 = BoostedDecisionTree(n_estimators=10)
method1.compile(
    metrics=[
        CategoricalAccuracy(name="acc"),
        MaxSignificance(name="max_sig"),
        RejectionAtEfficiency(name="r50"),
    ]
)

print("> Training model:")
history = method1.fit(x_train, y_train)

> Training model:
Iter 1/10 - loss: 1.2097 - acc: 0.8795 - max_sig: 209.1248 - r50: 793.0361
Iter 2/10 - loss: 1.0733 - acc: 0.9162 - max_sig: 270.3814 - r50: 185.3986
Iter 3/10 - loss: 0.9599 - acc: 0.9328 - max_sig: 327.6703 - r50: 669.1434
Iter 4/10 - loss: 0.8636 - acc: 0.9407 - max_sig: 379.0039 - r50: 324.4439
Iter 5/10 - loss: 0.7811 - acc: 0.9460 - max_sig: 424.5269 - r50: 540.7352
Iter 6/10 - loss: 0.7090 - acc: 0.9502 - max_sig: 463.4917 - r50: 396.5432
Iter 7/10 - loss: 0.6471 - acc: 0.9528 - max_sig: 500.4489 - r50: 393.4188
Iter 8/10 - loss: 0.5926 - acc: 0.9551 - max_sig: 634.4835 - r50: 588.6602
Iter 9/10 - loss: 0.5450 - acc: 0.9571 - max_sig: 789.3623 - r50: 462.1482
Iter 10/10 - loss: 0.5021 - acc: 0.9586 - max_sig: 799.2162 - r50: 580.2921


In [5]:
method2 = CutAndCount()
method2.compile(
    loss=CategoricalCrossentropy(),
    metrics=[
        CategoricalAccuracy(name="acc"),
        MaxSignificance(name="max_sig"),
        RejectionAtEfficiency(name="r50"),
    ],
)
print("> Training model:")
history = method2.fit(x_train, y_train)

> Training model:
Cut 1/4 - loss: 1.9366 - acc: 0.8798 - max_sig: 113.1778 - r50: 8.2616
Cut 2/4 - loss: 2.1924 - acc: 0.8719 - max_sig: 173.7675 - r50: 15.8622
Cut 3/4 - loss: 3.8445 - acc: 0.8351 - max_sig: 209.4424 - r50: 23.7669
Cut 4/4 - loss: 4.3686 - acc: 0.8086 - max_sig: 237.2822 - r50: 31.6540


In [6]:
method3 = ToyMLP(input_shape=(x_train.shape[1],))
method3.compile(
    loss=CategoricalCrossentropy(),
    metrics=[
        CategoricalAccuracy(name="acc"),
        MaxSignificance(name="max_sig"),
        RejectionAtEfficiency(name="r50"),
    ],
)

print("> Training model:")
history = method3.fit(x_train, y_train, epochs=10, batch_size=256, verbose=2)

> Training model:
Epoch 1/10
51/51 - 6s - loss: 0.9719 - acc: 0.8862 - max_sig: 186.6020 - r50: 31.5840 - 6s/epoch - 117ms/step
Epoch 2/10
51/51 - 1s - loss: 0.8845 - acc: 0.8881 - max_sig: 204.8537 - r50: 38.1710 - 1s/epoch - 23ms/step
Epoch 3/10
51/51 - 1s - loss: 0.7423 - acc: 0.8981 - max_sig: 209.3404 - r50: 44.6123 - 1s/epoch - 22ms/step
Epoch 4/10
51/51 - 1s - loss: 0.6906 - acc: 0.9013 - max_sig: 219.8567 - r50: 52.8738 - 1s/epoch - 24ms/step
Epoch 5/10
51/51 - 1s - loss: 0.6771 - acc: 0.9052 - max_sig: 225.4868 - r50: 54.4882 - 1s/epoch - 24ms/step
Epoch 6/10
51/51 - 1s - loss: 0.6907 - acc: 0.8941 - max_sig: 217.4594 - r50: 41.9881 - 1s/epoch - 24ms/step
Epoch 7/10
51/51 - 1s - loss: 0.5566 - acc: 0.9212 - max_sig: 222.2641 - r50: 76.7521 - 1s/epoch - 25ms/step
Epoch 8/10
51/51 - 1s - loss: 0.5627 - acc: 0.9083 - max_sig: 227.2444 - r50: 70.6728 - 1s/epoch - 22ms/step
Epoch 9/10
51/51 - 1s - loss: 0.5455 - acc: 0.9077 - max_sig: 235.5335 - r50: 43.2605 - 1s/epoch - 24ms/step


## Compare the performance
Finally we compare the performance of the three methods. We use the `evaluate`
to evaluate the performance via the loss and metrics we specified in the
`compile` method. The `evaluate` method returns a dictionary of the loss and
metrics.

Here we use the `tabulate` function from `tabulate` package to summarize the
performance into a table.

In [7]:
from tabulate import tabulate

In [8]:
results1 = method1.evaluate(x_test, y_test)
results2 = method2.evaluate(x_test, y_test)
results3 = method3.evaluate(x_test, y_test, verbose=2)
results = {}

results['name'] = [method1.name, method2.name, method3.name]
for k in results1.keys():
    results[k] = results1[k] + results2[k] + results3[k]

print("> Results:")
print(tabulate(results, headers="keys", floatfmt=".4f"))

loss: 0.2611 - acc: 0.9586 - max_sig: 601.7032 - r50: 647.3771
loss: 4.4163 - acc: 0.8037 - max_sig: 243.9667 - r50: 33.6241
101/101 - 4s - loss: 0.5475 - acc: 0.9350 - max_sig: 111.5401 - r50: 444.2333 - 4s/epoch - 44ms/step
> Results:
name                     loss     acc    max_sig       r50
---------------------  ------  ------  ---------  --------
boosted_decision_tree  0.2611  0.9586   601.7032  647.3771
cut_and_count          4.4163  0.8037   243.9667   33.6241
toy_mlp                0.5475  0.9350   111.5401  444.2333
