# Apply methods
In this notebook we will show how to load dataset saved in the previous notebook
and apply built-in methods on it. Finally we summarize their performance into a
table

## Introduction to built-in methods
Currently `hml` have three types of methods: cuts, trees, and (neural) networks.

With `hml.methods.cuts.CutAndCount`, we apply a series of cuts on the input data
to select as many signal events and as few background events as possible. Each
cut reduces the number of background events and inevitably also signal events.
The goal usually is to find a set of cuts that maximizes the significance.

The boosted decision tree is one of trees methods. It is a machine learning
method commonly used in high energy physics. The name "boosted" comes from the
idea to combine weak classifiers into a strong one. 

The neural network now only contains a simple fully connected neural network
named `ToyMLP`.

## Load the dataset
First we load the dataset from the previous notebook.

In [3]:
from hml.datasets import Dataset
from sklearn.model_selection import train_test_split
from keras.utils import to_categorical

To load dataset, we use the class method `load` of `Dataset` class. The method
takes the dataset directory as input and returns a `Dataset` object.

In [9]:
# Split the data into training and testing sets
dataset = Dataset.load("./data/z_vs_qcd")

x_train, x_test, y_train, y_test = train_test_split(
    dataset.data, dataset.target, test_size=0.2, random_state=42
)

# Convert the labels to categorical
y_train = to_categorical(y_train, dtype="int32")
y_test = to_categorical(y_test, dtype="int32")

# Show the shape of the training and testing sets
print("> Train and test shapes:")
print("x_train shape:", x_train.shape, "y_train shape:", y_train.shape)
print("x_test shape:", x_test.shape, "y_test shape:", y_test.shape)
print(f"> target names: {dataset.target_names}")

> Train and test shapes:
x_train shape: (12943, 4) y_train shape: (12943, 2)
x_test shape: (3236, 4) y_test shape: (3236, 2)
> target names: ['pp2jj', 'pp2zz']


## Apply methods
Then we apply a boosted decision tree. It comes from `scikit-learn` package
originally. The `compile` method takes loss function name, optimizer name, and a
list of metrics as input. Here we use default parameters as in `scikit-learn`.

In [10]:
from hml.methods import CutAndCount, BoostedDecisionTree, ToyMLP
from keras.losses import CategoricalCrossentropy
from keras.metrics import CategoricalAccuracy
from hml.metrics import MaxSignificance, RejectionAtEfficiency

In [11]:
method1 = BoostedDecisionTree(n_estimators=10)
method1.compile(
    metrics=[
        CategoricalAccuracy(name="acc"),
        MaxSignificance(name="max_sig"),
        RejectionAtEfficiency(name="r50"),
    ]
)

print("> Training model:")
history = method1.fit(x_train, y_train)

> Training model:
Iter 1/10 - loss: 1.2112 - acc: 0.8960 - max_sig: 188.1243 - r50: 187.9428
Iter 2/10 - loss: 1.0756 - acc: 0.9256 - max_sig: 430.6300 - r50: 158.7088
Iter 3/10 - loss: 0.9628 - acc: 0.9376 - max_sig: 440.3328 - r50: 824.0009
Iter 4/10 - loss: 0.8664 - acc: 0.9435 - max_sig: 443.3213 - r50: 391.3265
Iter 5/10 - loss: 0.7844 - acc: 0.9492 - max_sig: 442.3824 - r50: 595.1335
Iter 6/10 - loss: 0.7132 - acc: 0.9531 - max_sig: 640.1672 - r50: 556.4827
Iter 7/10 - loss: 0.6516 - acc: 0.9554 - max_sig: 642.4824 - r50: 510.1214
Iter 8/10 - loss: 0.5973 - acc: 0.9577 - max_sig: 651.5129 - r50: 714.1466
Iter 9/10 - loss: 0.5500 - acc: 0.9594 - max_sig: 640.4816 - r50: 573.8807
Iter 10/10 - loss: 0.5070 - acc: 0.9610 - max_sig: 646.3892 - r50: 811.5255


In [12]:
method2 = CutAndCount()
method2.compile(
    loss=CategoricalCrossentropy(),
    metrics=[
        CategoricalAccuracy(name="acc"),
        MaxSignificance(name="max_sig"),
        RejectionAtEfficiency(name="r50"),
    ],
)
print("> Training model:")
history = method2.fit(x_train, y_train)

> Training model:
Cut 1/4 - loss: 1.9140 - acc: 0.8812 - max_sig: 114.3669 - r50: 8.5635
Cut 2/4 - loss: 2.1843 - acc: 0.8729 - max_sig: 175.8987 - r50: 16.6093
Cut 3/4 - loss: 3.7870 - acc: 0.8369 - max_sig: 212.5988 - r50: 24.9139
Cut 4/4 - loss: 4.3486 - acc: 0.8102 - max_sig: 240.9536 - r50: 33.2185


In [13]:
method3 = ToyMLP(input_shape=(x_train.shape[1],))
method3.compile(
    loss=CategoricalCrossentropy(),
    metrics=[
        CategoricalAccuracy(name="acc"),
        MaxSignificance(name="max_sig"),
        RejectionAtEfficiency(name="r50"),
    ],
)

print("> Training model:")
history = method3.fit(x_train, y_train, epochs=10, batch_size=256, verbose=2)

> Training model:
Epoch 1/10
51/51 - 6s - loss: 1.2039 - acc: 0.8699 - max_sig: 151.3282 - r50: 15.6967 - 6s/epoch - 117ms/step
Epoch 2/10
51/51 - 1s - loss: 0.8593 - acc: 0.8857 - max_sig: 221.7106 - r50: 32.6118 - 1s/epoch - 23ms/step
Epoch 3/10
51/51 - 1s - loss: 0.8083 - acc: 0.8883 - max_sig: 216.4992 - r50: 33.3737 - 1s/epoch - 23ms/step
Epoch 4/10
51/51 - 1s - loss: 0.8024 - acc: 0.8968 - max_sig: 216.4713 - r50: 37.9892 - 1s/epoch - 24ms/step
Epoch 5/10
51/51 - 1s - loss: 0.7807 - acc: 0.8884 - max_sig: 214.3831 - r50: 44.6373 - 1s/epoch - 23ms/step
Epoch 6/10
51/51 - 1s - loss: 0.6410 - acc: 0.9088 - max_sig: 218.3920 - r50: 51.3810 - 1s/epoch - 24ms/step
Epoch 7/10
51/51 - 1s - loss: 0.6488 - acc: 0.9068 - max_sig: 222.0505 - r50: 57.5965 - 1s/epoch - 23ms/step
Epoch 8/10
51/51 - 1s - loss: 0.6380 - acc: 0.9013 - max_sig: 222.8664 - r50: 51.0140 - 1s/epoch - 22ms/step
Epoch 9/10
51/51 - 1s - loss: 0.6653 - acc: 0.9111 - max_sig: 219.5327 - r50: 61.5685 - 1s/epoch - 22ms/step


## Compare the performance
Finally we compare the performance of the three methods. We use the `evaluate`
to evaluate the performance via the loss and metrics we specified in the
`compile` method. The `evaluate` method returns a dictionary of the loss and
metrics.

Here we use the `tabulate` function from `tabulate` package to summarize the
performance into a table.

In [14]:
from tabulate import tabulate

In [17]:
results1 = method1.evaluate(x_test, y_test)
results2 = method2.evaluate(x_test, y_test)
results3 = method3.evaluate(x_test, y_test, verbose=2)
results = {}

results['name'] = [method1.name, method2.name, method3.name]
for k in results1.keys():
    results[k] = results1[k] + results2[k] + results3[k]

print("> Results:")
print(tabulate(results, headers="keys", floatfmt=".4f"))

loss: 0.2529 - acc: 0.9618 - max_sig: 616.9097 - r50: 835.0746
loss: 4.3782 - acc: 0.7973 - max_sig: 260.2736 - r50: 39.5114
102/102 - 2s - loss: 3.5197 - acc: 0.4425 - max_sig: 86.3768 - r50: 1.5445 - 2s/epoch - 22ms/step
> Results:
name                     loss     acc    max_sig       r50
---------------------  ------  ------  ---------  --------
boosted_decision_tree  0.2529  0.9618   616.9097  835.0746
cut_and_count          4.3782  0.7973   260.2736   39.5114
toy_mlp                3.5197  0.4425    86.3768    1.5445
