# Exercise

Below you find the Dataset from the Higgs Kaggle challenge [you have already seen last week][1].

* Fit a Boosted Decision Trees classifier based on [HistGradientBoostingClassifier][2] from scikit-learn
* The default setting fits 100 trees. Can you get better performance by fitting more trees? Also have a look at `model.train_score_` and `model.validation_score_`
* What are the most important features? Can you get similar accuracy by only fitting them?

Tips:
* [HistGradientBoostingClassifier][2] does not support showing feature importances, you need to use [GradientBoostingClassifier][3]
* This one is a bit slower, you can speed up the training by utilizing the `subsample` option
* In this case you can also plot `model.oob_scores_` as a replacement for the validation score.
* Alternatively you can use `staged_predict_proba` like in [this tutorial][4]
* [GradientBoostingClassifier][3] uses by default `max_depth=3` while [HistGradientBoostingClassifier][2] uses larger trees via `max_leaf_nodes=31`. You may want to adjust this.

[1]: https://gitlab.physik.uni-muenchen.de/damlpartphys24/05-validation-and-metrics/-/blob/main/higgs_challenge.ipynb
[2]: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingClassifier.html
[3]: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html
[4]: https://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_regularization.html

In [None]:
from pathlib import Path
import urllib

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, roc_auc_score

## Prepare data

In [None]:
path = Path("atlas-higgs-challenge-2014-v2.csv.gz")

def prepare_data(path):
    if path.exists():
        return
    url = "http://opendata.cern.ch/record/328/files/atlas-higgs-challenge-2014-v2.csv.gz"
    path_prev_tutorial = Path("../05-overfitting-validation-metrics") / path
    if path_prev_tutorial.exists():
        path.symlink_to(path_prev_tutorial)
    if not path.exists():
        urllib.request.urlretrieve(url, filename)

prepare_data(path)

df = pd.read_csv(path)

In [None]:
feature_names = [col for col in df.columns if col.startswith("DER") or col.startswith("PRI")]
feature_names

In [None]:
df

In [None]:
X = df[feature_names]
y = df['Label']
weight = df['Weight']

In [None]:
(
    X_train,
    X_test,
    y_train,
    y_test,
    weight_train,
    weight_test,
) = train_test_split(
    X.to_numpy(),
    (y == "s").to_numpy(),
    weight.to_numpy(),
    test_size=0.33,
    random_state=42
)

# to balance weighted sum of signal and background
class_weight = np.array([
    len(y_train) / weight_train[y_train==0].sum(),
    len(y_train) / weight_train[y_train==1].sum(),
])

# to have average weight = 1
# use this weight in the fit
weight_for_fit = weight_train * class_weight[y_train.astype(int)]
weight_for_fit /= weight_for_fit.mean()

## Train models

(your task)

## Evaluate performance

In [None]:
def ams(s, b):
    """
    Approximate median significance, as defined in Higgs Kaggle challenge

    The number 10, added to the background yield, is a regularization term to decrease the variance of the AMS.
    """
    return np.sqrt(2 * ((s + b + 10) * np.log(1 + s / (b + 10)) - s))

sumw = df.groupby("Label").Weight.sum()
nsig_tot = sumw["s"]
nbkg_tot = sumw["b"]

In [None]:
model.score(X_test, y_test, sample_weight=weight_test)

In [None]:
p_test = model.predict_proba(X_test)[:, 1]

In [None]:
roc_auc_score(y_test, p_test, sample_weight=weight_test)

In [None]:
fpr, tpr, thr = roc_curve(y_test, p_test, sample_weight=weight_test)

In [None]:
ams(tpr * nsig_tot, fpr * nbkg_tot).max()