- Dataset
- Analysis strategy
- Feature Work (only for DNNs)
- Scaling of the Data
- BDT Model
- Neural Network Structure
- Evaluation of the classification
- Preliminar results
- Results
Dataset is divided into some subsets:
• Training Set: KaggleSet = t (used for the training set), 250.000 events.
• Validation Set: KaggleSet = b (used for the validation set), 100.000 events.
• Test Set: KaggleSet = v (used for the test set), 450.000 events.
• Unused: KaggleSet = u (not yet used), 18.000 events.
Informations about the variables:
• 13 Derived and 17 Primitive variables.
• Derived variables are calculated from the primitive variables.
• High correlations.
Signal and Background events are labelled and weighted.
For the analysis, the analysis.py script is used.
To split the Dataset, according to the number of jets, into subsets (useful only for DNNs):
• Events with 0 jets (100.000 events).
• Events with 1 jet (78.000 events).
• Events ≥ 2 jets (72.000 events).
To drop features that are meaningless for the new subsets.
• Drop 13 variables for 0 jets.
• Drop 8 variables for 1 jet.
• Keep all the variables for the ≥ 2 jets Set.
First, perform a classification on the whole dataset (without considering subsets) using the Gradient Boosted Decision Trees (BDT). Than, train 3 Deep Neural Networks, one for each subset considering the number of jets.
Distributions of some of the angular variables are uniform (this is a problem, because I coudn't use them for the discrimination between signal and background). So the idea is to build new features according to relative angles:
• Delta_phi_tau_lep
• Delta_phi_met_lep
• Delta_phi_jet_jet
• Delta_eta_tau_lep
They have different distributions for Signal and Background. Finally drop all phi variables: PRI_tau_phi
, PRI_lep_phi
, PRI_jet_leading_phi
, PRI_jet_subleading_phi
.
Here is shown an example of some distributions of the new variables (here it's possible to see a clear discrimination between signal and background):
Before start working on the DNN and the BDT, the input data have been scaled with a Standard Scaling.
Has been used the class HistGradientBoostingClassifier
from the library sklearn
. This BDT model has been used to train the whole data set. It runs very fast and accurate in respect to the DNN one.
Have been used 3 DNNs, one for each subset according to the number of jets. Has been used the library keras
. They are structured as follows:
• Using of 5 and 6 hidden layers.
• Relu and elu activation functions.
• Adam and Adagrad optimizers.
• Dropout and L1 regularization.
• Loss: binary crossentropy.
• Metric: Accuracy.
• 2D softmax output.
• (0, 1) for perfect signal and (1, 0) for perfect background events.
Here are shown the different model accuracy plots for each subset (depending on the number of jets):
Plot for 0 jets classification:
Plot for 1 jet classification:
Plot for 2 jets classification:
The combination of the models has been performed using the Logistic Regression on both outpus of DNN and BDT. The class used is LogisticRegression
from the library sklearn
.
Evaluation of the classification process is given by a metric called "AMS" (see the PDF_dataset.pdf). At the end have been combined all the AMS of each classification procedure with the Logistic Regression method.
In this section are presented all the preliminar final results. Numerical results (AMS scores) are:
- Best AMS Score of the DNN: 3.529 at a Cut Parameter of 0.83.
- Best AMS Score of the BDT: 3.578 at a Cut Parameter of 0.83.
- Combination of the two AMS scores with Logistic Regression: 3.652 at a Cut Parameter of 0.88.
Graphical results are shown here:
- Comparison between total, DNNs and BDT AMS:
- Unweighted distribution for signal-background discrimination for validation set:
- Weighted distribution for signal-background discrimination for validation set:
NOTE: in this latter case, it's possible to see a strange result for the last bin. I've investigated this from my own and I've interpreted it as caused by a bit of overtraning in the last part of the graph.
Final results plots are obtained with the plots.py and shown below:
Final weighted distribution: