# Lab 5 Machine Learning and Data Classification

## Table of Contents
1. **[Data: Signal and Background](#Data)**  
   1.1. [What is Data?](#what_is_data)  
   1.2. [Visually Comparing Signal and Background](#compar)  
2. **[Classifier Training](#class_train)**  
   2.1 [Scikit-learn](#scikit)  
   2.2 [Decision Trees and Boosting](#dtrees_and_boosting)  
3. **[Classifier Testing](#class_test)**  
   3.1 [Overtraining Check with Cross Validation](#overtrain_check)  
   3.2 [Roc Curve and Score](#ROC)  


# 1. TTrees <a name="Data" />  
## 1.1 What is Data? <a name="what_is_data" />  
When performing analysis, we are interested in analyzing what we consider to be signal. However, background noise is unavoidable and one of the objectives of an analysis is to reduce the noise as much as possible. We are mostly interested in what the signal is telling us about physics. To eliminate noise, we compare what signal and background look like and use that to classify our data - which is some unknown combination of signal and background. This can be done by simple visual comparison of distributions, or through the use of a neural network which can compare signal and background on a multi-variate level. A classifier has the added benefit of looking at many variables of an single event and comparing that across millions events. Before it can classify the data though, it must be trained to know what signal and background both look like. 

## 1.2 Visually Comparing Signal and Background <a name="compar" />  
For this exercise we will be working with three ROOT files: a background only file, a signal and background file, and a signal only file. Let's visually compare the background and signal before employing the use of the classifier. For this example, we will be working with a generated dataset of particle events with mass, eta, and phi information.

We need to enable JavaScript visualization as some of the interactive parts of TMVA are supported in the notebook through JSROOT.

In [1]:
%jsroot on

In [2]:
// Grab the three decay trees
TFile * bkgFile = TFile::Open("../../Datasets/Gen/Lab5_root/example/bkg.root");
TTree * bkgTree = (TTree *) bkgFile->Get("DecayTree");

TFile * bkgSigFile = TFile::Open("../../Datasets/Gen/Lab5_root/example/bkg_and_sig.root");
TTree * bkgSigTree = (TTree *) bkgSigFile->Get("DecayTree");

TFile * sigFile = TFile::Open("../../Datasets/Gen/Lab5_root/example/sig.root");
TTree * sigTree = (TTree *) sigFile->Get("DecayTree");

// Create TCanvas and plot the mass, phi, and eta from each TTree
TCanvas canvas("canvas", "Comparing Sig and Bkg", 900, 900);
canvas.Divide(3,3);

// Draw each distribution
// background
canvas.cd(1);
bkgTree->Draw("mass");
canvas.cd(2);
bkgTree->Draw("eta");
canvas.cd(3);
bkgTree->Draw("phi");

// signal
canvas.cd(4);
sigTree->Draw("mass");
canvas.cd(5);
sigTree->Draw("eta");
canvas.cd(6);
sigTree->Draw("phi");

// background and signal
canvas.cd(7);
bkgSigTree->Draw("mass");
canvas.cd(8);
bkgSigTree->Draw("eta");
canvas.cd(9);
bkgSigTree->Draw("phi");

canvas.Draw();

Visually, we can already see the mass distributions between signal and background are different - a gaussian and an exponential respectively. Additionally, there seems to be a coorelation between mass and eta that differs between signal and background. There isn't any clear coorelation between signal and background on the phi distribution. Since these distributions overlap, we cannot simply cut at a certain mass or eta. Although this would eliminate background, it would also eliminate signal. We must find a cut that reduces the most background but also preserves our signal. This is where the classifier becomes useful. Though we can see the coorelation between mass and eta, a classifier can consider both these distributions simultaneously when evaluating whether an event is signal or background. I.e. a low mass event with low eta is probably background, while an event with a mass of 172 MeV and eta of 2 may be difficult to distinguish from signal or background. The classifier will give us its determination as a -1 or 1, or in some cases 0 or 1. This can be on a continuum or boolean based. This depends entirely on the classifier and decision function called from it.

# 2. Classifier Training <a name="class_train" /> 

We can think of classification via supervised learning requiring two phases:
  1. Training Phase
  2. Application Phase
  
The training phase is where our model learns the internal representation of our data to be able to predict the classification label (since we have labeled data it is a supervised learning problem). In binary classification, we will either treat this as a 0 or 1, or a -1 and 1, representing a positive (signal) class and a negative (background) class. Since we will be using supervised learning, our training data will have labels assigned to each example.

There are parameters, called hyperparameters, which are parameters that are not learned by the learning algorithm for our model. This is often called the secondary learning problem in machine learning. In order to validate the results of our chosen configuration of hyperparameters, we need to split the training data into two parts. One part will use our learning algorithm to learn the best parameters to accurately predict our classes, this is called the training sample. The second part we will use to benchmark our model for our selection of hyperparameters, this is called the validation sample. Usually a grid search is performed to find the most optimal hyperparameters by searching over equal spaced iterations for all of the different values. This can be very time intensive and is usually the longest part of training.

The application phase is when this model will be applied to data after the parameters have been learned and the secondary learning problem has been dealt with. This data that is then fed into the model is called the test sample.

## 2.1 TMVA <a name="TMVA" /> 

TMVA stands for Toolkit for Multivariate Analysis. This is a part of the ROOT library and we will be using this for our classifier training. We will be using a Boosted Decision Tree (BDT) for our classification. The theory behind Decisition Tree learning will be given in class (and more here in the future). Boosting just means we will use an ensemble of small decision trees and then learn a function over these small decision trees to predict our classes. This aids the decision tree from overfitting. 

First we need to load the library.

In [3]:
TMVA::Tools::Instance(); // This loads the library.

Now we will create the `Factory` object which is used to relay your data to the MVA models and will store the parameters to rebuild the model in a directory `weight/`.

In [4]:
// Create a ROOT output file where TMVA will store ntuples, histograms, etc.
TFile * outputFile = TFile::Open("classificationOutput.root", "RECREATE");

// Create the factory object. Later you can choose the methods
// whose performance you'd like to investigate. The factory is
// the only TMVA object you have to interact with
TMVA::Factory factory("BDTSignalBkgClassification", outputFile,
                      "!V:!Silent:Color:DrawProgressBar:Transformations=I;D;P;G,D:AnalysisType=Classification");

The first argument to the constructor for `Factory` is the base of the name of all the weightfiles in the directory `weight/`. You can think of this as being the name of the job being undertaken by the MVA. The second argument is the output file for the training results. The third argument is the option string, where options are separated by ":". A full list of options for the `Factory` class is given [here](#http://tmva.sourceforge.net/old_site/optionRef.html#Factory).

Now we need to start loading the data with our `DataLoader` class. First we will create the `DataLoader` class and defined some variables.

In [5]:
TMVA::DataLoader dataLoader("dataset");

dataLoader.AddVariable("mass");
dataLoader.AddVariable("eta");
dataLoader.AddVariable("phi");

Now we need to add our signal and background to the `DataLoader` and allow for some arbitrary cuts to be made.

In [6]:
TCut myCutSig, myCutBkg; // No cuts are being made, uses same convention as TTree->Draw("expression");


dataLoader.AddSignalTree(sigTree, 1.0);   //signal weight  = 1
dataLoader.AddBackgroundTree(bkgTree, 1.0);   //background weight = 1

dataLoader.PrepareTrainingAndTestTree(myCutSig, myCutBkg,
                                   "nTrain_Signal=60000:nTrain_Background=60000:SplitMode=Random:NormMode=NumEvents:!V" );
// If using a defined test sample you would define test sample size like so:
// dataLoader.PrepareTrainingAndTestTree(mycut,
//         "NSigTrain=3000:NBkgTrain=3000:NSigTest=3000:NBkgTest=3000:SplitMode=Random:!V");

DataSetInfo              : [dataset] : Added class "Signal"
                         : Add Tree sig_tree of type Signal with 80000 events
DataSetInfo              : [dataset] : Added class "Background"
                         : Add Tree bkg_tree of type Background with 80000 events


The validation sample will be generated from the rest of our training sample. You usually need less data to validate than to train. However, you do want enough data to validate on.

### 2.1.1 K-Fold Cross-Validation

We will now be performing k-fold cross-validation on our training and validation sample in order to solve the secondary learning problem. This is a fairy simple process and just requires a bunch of trial and error.

First we will be creating a `CrossValidation` object that is created from our `DataLoader`. We need to create this after we have performed the `PrepareTrainingAndTestTree()` method to ensure we can perform k-fold cross validation on our training and validation samples. 

In HEP, we sometimes call the validation sample the test sample and the test sample the application sample. To keep these two separate, look for either mentioning of a validation sample or an application sample, since these terms are separate between the two nomenclatures. In the above method `PrepareTrainingAndTestTree()`, we could similarly call this the "`PrepareTrainingAndValidationTree()`" method.

In [7]:
UInt_t numFolds = 2;
TString analysisType = "Classification";
TString splitExpr = "";
TString cvOptions = Form("!V"
                            ":!Silent"
                            ":ModelPersistence"
                            ":AnalysisType=%s"
                            ":NumFolds=%i"
                            ":SplitExpr=%s",
                            analysisType.Data(), numFolds, splitExpr.Data());

In [8]:
TMVA::CrossValidation crossValidation("TMVACrossValidation", &dataLoader, outputFile, cvOptions);

The `NumFolds` argument is the "k" in k-fold cross-validation. The idea of k-fold cross-validation is to split your data into k folds, where k-1 folds are used for training and 1 fold is used for validation. You then do this k times so that each fold is the validation sample one time. Then you average your results and that is the performance of your model for the given hyperparameters.

The benefit of this is that every single data point is being used for training and validation, so you are minimzing the amount of bias being introduced in choosing your hyperparameters. A typical value of k is 10, but it can be any reasonable value.

`AnalysisType` should always be "Classification" for our purposes.  
`SplitExpr` tells how to split the data into the k-folds. This is dangerous because you want to make sure you do not bias how you are splitting the data. Usually a `%int([NumFolds])` is added at the end of the expression to try and avoid this. Keep empty unless you know what you are doing.

Now we need to "book" a particular MVA model. In this case, we will be using the BDT MVA and multilayer perceptron (MLP). You can book as many MVA techniques and then train them all very seamlessly. We are adding the MLP for this reason, but BDT are usually the preferred analysis technique. This is because the BDT classification algorithm can be analyzed in more detail than the MLP. This does not mean MLP is bad, simply we do not understand everything it is deciding to do all of the time. That being said, there currently is a shift towards moving away from BDTs, as more advanced MVA techniques have come about and have shown performances that BDTs cannot match.

In [9]:
crossValidation.BookMethod(TMVA::Types::kBDT, "BDT",
                           "!V:NTrees=500:MinNodeSize=2.5%:MaxDepth=5:BoostType=AdaBoost:AdaBoostBeta=0.1:UseBaggedBoost:BaggedSampleFraction=0.5:SeparationType=GiniIndex:nCuts=20" );

//Multi-Layer Perceptron (Neural Network)
crossValidation.BookMethod(TMVA::Types::kMLP, "MLP",
                   "!H:!V:NeuronType=tanh:VarTransform=N:NCycles=100:HiddenLayers=N+5:TestRate=5:!UseRegulator" );

This is where you will edit your hyperparameters. The hyperparameters for the BDT is:
  * `NTrees`: The number of trees to be used in the boosted decision tree. Remember the BDT model learns a function over an ensemble of small decision trees to prevent overfitting.
  * `MinNodeSize`: This is the minimum number of samples that can be leftover in a leaf of the decision tree after splitting on some characterstic. The constraint also prevents the decision trees from overfitting.
  * `MaxDepth`: This is the maximum depth that a decision tree can grow to by performing splits on the data. This also helps prevent the decision tree from overfitting.
  * `BoostType`: This is something you probably won't be changing. `AdaBoost` means we are using _Adaptive Boosting_. The other boosting type available is _Gradient Boosting_.
  * `AdaBoostBeta`: The learning rate of the AdaBoost learning algorithm.
  * `UseBaggedBoost`: Use only a random subsample of all events for growing the decision trees in each boost iteration.
  * `BaggedSampleFraction`: The fraction of the BaggedSample with respect to the training sample, i.e., 0.5 means the bagged sample will be 50% of the size of the training sample.
  * `SeparationType`: This is the function used to determine which variable (and for continuous variables, and where) to split the data on. Some popular ones are chi-squared, [Information Gain](#https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence), and [Gini coefficient/Index](#https://en.wikipedia.org/wiki/Gini_coefficient)
  * `nCuts`: "Number of grid points in variable range used in finding optimal cut in node splitting."<-- Not sure what that means. Maybe for continuous variables its defining a set number of points on where to cut? This is likely the interpretation since most if not all of our variables are continuous and we have a very large number of them.
  
As you can see, Decision Tree learning suffers from the problem of overfitting. Decision trees are very expressive models, and as such, will often overfit to the training data given. Boosting and these hyperparameter constraints help prevent overfitting from occuring and leads to a very effective and understandable classifier.

The hyperparameters for the MLP are as follows:
  * `NeuronType`: Is the activation function being used. `tanh` is a very popular option. Other options may include `ReLU`, `sigmoid`, `Leaky ReLU`, and `sgn` functions.
  * `VarTransform`: List of variable transformations performed before training. Each variable transformation can be applied to either the signal, background, or all classes by adding a "\_Signal", "\_Background", or "\_AllClasses" tag after the varialbe transformation. For example: "P\_Signal". By default, the variable transformation is applied to all classes.
    *  D - Decorrelation
    *  P - Principle Component Analysis (PCA) transformation
    *  G - Gaussianisation
    *  N - Normalisation
  * `NCycles`: The number of passes over the training sample to be performed. This is also called an __epoch__. Perceptrons are considered online learning algorithms, which means they only learn as they make mistakes. That is why we need to pass over the training sample more than once, as a single pass does not gaurantee that the Perceptron has converged and has learned how to predict the class label correctly. As time goes on, less mistakes should be made each epoch.   
  * `HiddenLayers`: This is the number of layers between your input layer and your output layer. The more hidden layers, the "deeper" your neural network.
  * `TestRate`: At each x-th epoch, overfitting is tested, where x is the `TestRate`.
  * `UseRegulator`: A regulator is used to prevent overfitting.
  
The full list of hyperparameters can be found by inspecting the generated classes in the "/weights/" folder.

Now we will perform our k-fold cross-validation and observe our results.

In [10]:
// Run cross-validation
crossValidation.Evaluate();

                         : Evaluate method: BDT
<HEADER> Factory                  : Booking method: BDT_fold1
                         : 
<HEADER> BDT_fold1                : #events: (reweighted) sig: 30000 bkg: 30000
                         : #events: (unweighted) sig: 30060 bkg: 29940
                         : Training 500 Decision Trees ... patience please
                         : Elapsed time for training with 60000 events: 11.4 sec         
<HEADER> BDT_fold1                : [dataset] : Evaluation of BDT_fold1 on training sample (60000 events)
                         : Elapsed time for evaluation of 60000 events: 3.43 sec       
                         : Creating xml weight file: dataset/weights/TMVACrossValidation_BDT_fold1.weights.xml
                         : Creating standalone class: dataset/weights/TMVACrossValidation_BDT_fold1.class.C
<HEADER> Factory                  : Test all methods
<HEADER> Factory                  : Test method: BDT_fold1 for Classification pe

                         : Elapsed time for training with 60000 events: 27.5 sec         
<HEADER> MLP_fold2                : [dataset] : Evaluation of MLP_fold2 on training sample (60000 events)
                         : Elapsed time for evaluation of 60000 events: 0.0903 sec       
                         : Creating xml weight file: dataset/weights/TMVACrossValidation_MLP_fold2.weights.xml
                         : Creating standalone class: dataset/weights/TMVACrossValidation_MLP_fold2.class.C
<HEADER> Factory                  : Test all methods
<HEADER> Factory                  : Test method: MLP_fold2 for Classification performance
                         : 
<HEADER> MLP_fold2                : [dataset] : Evaluation of MLP_fold2 on testing sample (60000 events)
                         : Elapsed time for evaluation of 60000 events: 0.0876 sec       
<HEADER> Factory                  : Evaluate all methods
<HEADER> Factory                  : Evaluate classifier: MLP_fold2
     

<HEADER> Dataset:dataset          : Created tree 'TestTree' with 120000 events
                         : 
<HEADER> Dataset:dataset          : Created tree 'TrainTree' with 120000 events
                         : 
<HEADER> Factory                  : Thank you for using TMVA!
                         : For citation information, please visit: http://tmva.sf.net/citeTMVA.html
                         : Evaluation done.


In [11]:
// Print results
vector<TMVA::CrossValidationResult> results = crossValidation.GetResults();
for(int i = 0; i < results.size(); ++i) {
    results[i].Print();
}

<HEADER> CrossValidation          :  ==== Results ====
                         : Fold  0 ROC-Int : 0.9991
                         : Fold  1 ROC-Int : 0.9994
                         : ------------------------
                         : Average ROC-Int : 0.9993
                         : Std-Dev ROC-Int : 0.0002
<HEADER> CrossValidation          :  ==== Results ====
                         : Fold  0 ROC-Int : 0.9992
                         : Fold  1 ROC-Int : 0.9994
                         : ------------------------
                         : Average ROC-Int : 0.9993
                         : Std-Dev ROC-Int : 0.0002


The first results are for the BDT since we booked the BDT first. ROC-Int is the integral of our ROC curve and should be as close to 1 as possible. We will go over what the ROC actually means later when we plot it.

### 2.1.2 Training

Now that we have determined our hyperparameters from above, we will actually perform the training with our `Factory` class. We need to also book the MVA techniques we will be using in our `Factory` class.

In [7]:
//Boosted Decision Trees (BDT)
factory.BookMethod(&dataLoader,TMVA::Types::kBDT, "BDT",
                   "!V:NTrees=200:MinNodeSize=2.5%:MaxDepth=5:BoostType=AdaBoost:AdaBoostBeta=0.1:UseBaggedBoost:BaggedSampleFraction=0.5:SeparationType=GiniIndex:nCuts=20" );

//Multi-Layer Perceptron (Neural Network)
factory.BookMethod(&dataLoader, TMVA::Types::kMLP, "MLP",
                   "!H:!V:NeuronType=tanh:VarTransform=N:NCycles=100:HiddenLayers=N+5:TestRate=5:!UseRegulator" );

Factory                  : Booking method: [1mBDT[0m
                         : 
DataSetFactory           : [dataset] : Number of events in input trees
                         : 
                         : 
                         : Number of training and testing events
                         : ---------------------------------------------------------------------------
                         : Signal     -- training events            : 60000
                         : Signal     -- testing events             : 20000
                         : Signal     -- training and testing events: 80000
                         : Background -- training events            : 60000
                         : Background -- testing events             : 20000
                         : Background -- training and testing events: 80000
                         : 
DataSetInfo              : Correlation matrix (Signal):
                         : --------------------------------
                      

These use the same hyperparameters as used in the k-fold cross-validation. Look above if you forgot what some of them meant.

...And now, we train!

In [8]:
factory.TrainAllMethods();

Factory                  : [1mTrain all methods[0m
Factory                  : [dataset] : Create Transformation "I" with events from all classes.
                         : 
                         : Transformation, Variable selection : 
                         : Input : variable 'mass' <---> Output : variable 'mass'
                         : Input : variable 'eta' <---> Output : variable 'eta'
                         : Input : variable 'phi' <---> Output : variable 'phi'
Factory                  : [dataset] : Create Transformation "D" with events from all classes.
                         : 
                         : Transformation, Variable selection : 
                         : Input : variable 'mass' <---> Output : variable 'mass'
                         : Input : variable 'eta' <---> Output : variable 'eta'
                         : Input : variable 'phi' <---> Output : variable 'phi'
Factory                  : [dataset] : Create Transformation "P" with events from all c

0%, time left: unknown
7%, time left: 8 sec
13%, time left: 8 sec
19%, time left: 7 sec
25%, time left: 7 sec
32%, time left: 6 sec
38%, time left: 5 sec
44%, time left: 5 sec
50%, time left: 4 sec
57%, time left: 3 sec
63%, time left: 3 sec
69%, time left: 2 sec
75%, time left: 2 sec
82%, time left: 1 sec
88%, time left: 1 sec
94%, time left: 0 sec


                         : Elapsed time for training with 120000 events: 9.28 sec         
BDT                      : [dataset] : Evaluation of BDT on training sample (120000 events)


0%, time left: unknown
7%, time left: 1 sec
13%, time left: 1 sec
19%, time left: 1 sec
25%, time left: 1 sec
32%, time left: 1 sec
38%, time left: 1 sec
44%, time left: 1 sec
50%, time left: 1 sec
57%, time left: 0 sec
63%, time left: 0 sec
69%, time left: 0 sec
75%, time left: 0 sec
82%, time left: 0 sec
88%, time left: 0 sec
94%, time left: 0 sec


                         : Elapsed time for evaluation of 120000 events: 1.99 sec       
                         : Creating xml weight file: [0;36mdataset/weights/BDTSignalBkgClassification_BDT.weights.xml[0m
                         : Creating standalone class: [0;36mdataset/weights/BDTSignalBkgClassification_BDT.class.C[0m
                         : classificationOutput.root:/dataset/Method_BDT/BDT
Factory                  : Training finished
                         : 
Factory                  : Train method: MLP for Classification
                         : 
TFHandler_MLP            : Variable        Mean        RMS   [        Min        Max ]
                         : -----------------------------------------------------------
                         :     mass:   -0.76227    0.14037   [    -1.0000     1.0000 ]
                         :      eta:    0.18587    0.36358   [    -1.0000     1.0000 ]
                         :      phi: 0.00053083    0.57589   [    -1.0000     

1%, time left: unknown
8%, time left: 51 sec
14%, time left: 45 sec
20%, time left: 42 sec
26%, time left: 38 sec
33%, time left: 34 sec
39%, time left: 30 sec
45%, time left: 27 sec
51%, time left: 24 sec
58%, time left: 21 sec
64%, time left: 18 sec
70%, time left: 15 sec
76%, time left: 12 sec
83%, time left: 8 sec
89%, time left: 5 sec
95%, time left: 2 sec


                         : Elapsed time for training with 120000 events: 48 sec         
MLP                      : [dataset] : Evaluation of MLP on training sample (120000 events)


0%, time left: unknown
7%, time left: 0 sec
13%, time left: 0 sec
19%, time left: 0 sec
25%, time left: 0 sec
32%, time left: 0 sec
38%, time left: 0 sec
44%, time left: 0 sec
50%, time left: 0 sec
57%, time left: 0 sec
63%, time left: 0 sec
69%, time left: 0 sec
75%, time left: 0 sec
82%, time left: 0 sec
88%, time left: 0 sec
94%, time left: 0 sec


                         : Elapsed time for evaluation of 120000 events: 0.174 sec       
                         : Creating xml weight file: [0;36mdataset/weights/BDTSignalBkgClassification_MLP.weights.xml[0m
                         : Creating standalone class: [0;36mdataset/weights/BDTSignalBkgClassification_MLP.class.C[0m
                         : Write special histos to file: classificationOutput.root:/dataset/Method_MLP/MLP
Factory                  : Training finished
                         : 
                         : Ranking input variables (method specific)...
BDT                      : Ranking result (top variable is best ranked)
                         : --------------------------------------
                         : Rank : Variable  : Variable Importance
                         : --------------------------------------
                         :    1 : eta       : 6.554e-01
                         :    2 : mass      : 2.939e-01
                         :    3 :

This will take some time. I made sure to do the verbose method so that you could see everything that is happening. Next we need to test our models on the test sample and then evaluate their performance.

In [9]:
factory.TestAllMethods();
factory.EvaluateAllMethods();

Factory                  : [1mTest all methods[0m
Factory                  : Test method: BDT for Classification performance
                         : 
BDT                      : [dataset] : Evaluation of BDT on testing sample (40000 events)


0%, time left: unknown
7%, time left: 0 sec
13%, time left: 0 sec
19%, time left: 0 sec
25%, time left: 0 sec
32%, time left: 0 sec
38%, time left: 0 sec
44%, time left: 0 sec
50%, time left: 0 sec
57%, time left: 0 sec
63%, time left: 0 sec
69%, time left: 0 sec
75%, time left: 0 sec
82%, time left: 0 sec
88%, time left: 0 sec
94%, time left: 0 sec


                         : Elapsed time for evaluation of 40000 events: 0.525 sec       
Factory                  : Test method: MLP for Classification performance
                         : 
MLP                      : [dataset] : Evaluation of MLP on testing sample (40000 events)
                         : Elapsed time for evaluation of 40000 events: 0.059 sec       
Factory                  : [1mEvaluate all methods[0m
Factory                  : Evaluate classifier: BDT
                         : 
BDT                      : [dataset] : Loop over test events and fill histograms with classifier response...
                         : 


0%, time left: unknown
7%, time left: 0 sec
13%, time left: 0 sec
19%, time left: 0 sec
25%, time left: 0 sec
32%, time left: 0 sec
38%, time left: 0 sec
44%, time left: 0 sec
50%, time left: 0 sec
57%, time left: 0 sec
63%, time left: 0 sec
69%, time left: 0 sec
75%, time left: 0 sec
82%, time left: 0 sec
88%, time left: 0 sec
94%, time left: 0 sec


TFHandler_BDT            : Variable        Mean        RMS   [        Min        Max ]
                         : -----------------------------------------------------------
                         :     mass:     136.10     80.462   [  0.0020855     1221.2 ]
                         :      eta:     1.5011     1.6988   [    -3.7054     5.0489 ]
                         :      phi:   0.021955     1.8116   [    -3.1416     3.1411 ]
                         : -----------------------------------------------------------
Factory                  : Evaluate classifier: MLP
                         : 
TFHandler_MLP            : Variable        Mean        RMS   [        Min        Max ]
                         : -----------------------------------------------------------
                         :     mass:   -0.76216    0.14062   [    -1.0000     1.1343 ]
                         :      eta:    0.18567    0.36478   [   -0.93234    0.94750 ]
                         :      phi:  0.0069582   

Now we will plot the [Receiver Operating Characteristic (ROC) curves](#https://en.wikipedia.org/wiki/Receiver_operating_characteristic).

In [10]:
auto c1 = factory.GetROCCurve(&dataLoader);
c1->Draw();

The ROC curve is a plot against the True Positive Rate (TPR) and False Positive Rate (FPR). The TPR is the number of times you correctly predicted the positive class divided by the total number of positive classes in the data. The false positive rate is the number of times you labeled a class positive but in reality it was negative divided by the total number of negative classes. Here is a nice picture to help keep them straight

![alt-text](https://www.gilliganondata.com/wp-content/uploads/2009/08/TypeI_TypeII1.JPG "")

We then vary our threshold on the classification output on what is considered signal. So for example, if our classification output $\in [0, 1]$ and we say a signal corresponds to classification output > 0.99, we should see a very small FPR but a high TPR. A perfect classifier will all have all signal at 1 and all background at 0. So as we change the threshold for what is considered signal, the TPR and FPR will not change until we have signal corresponding to classification output $\geq$ 0. The our TPR is 0 and our FPR is 1. This would mean if we plotted our TPR and FPR as a function of our threshold, it would look like a step function. If we integrate the area of this curve, this would correspond to an area of 1.

This is the idea behind a ROC curve. Basically we are seeing how well separated our signal classification is from our background classification. The closer the integral of this curve is to 1, the better of a job our classifier has done at separating signal from background.

Lastly, we need to close the Factory's output file.

In [11]:
outputFile->Close();

### 2.1.3 Application Phase

Now we need to apply this to our "Background and Signal" `TTree`. We will do this by reading in the weights store in the `/weights` directory. First we need to make a `Reader` object.

In [3]:
TMVA::Reader *reader = new TMVA::Reader();

And add the corresponding variables that we used before

In [4]:
Float_t mass, eta, phi;

reader->AddVariable("mass", &mass);
reader->AddVariable("eta", &eta);
reader->AddVariable("phi", &phi);

Now we need to book the MVA methods by parsing the XML weight files.

In [5]:
reader->BookMVA("BDT method", "./dataset/weights/BDTSignalBkgClassification_BDT.weights.xml");
reader->BookMVA("MLP method", "./dataset/weights/BDTSignalBkgClassification_MLP.weights.xml");

                         : Booking "BDT method" of type "BDT" from ./dataset/weights/BDTSignalBkgClassification_BDT.weights.xml.
                         : Reading weight file: ./dataset/weights/BDTSignalBkgClassification_BDT.weights.xml
<HEADER> DataSetInfo              : [Default] : Added class "Signal"
<HEADER> DataSetInfo              : [Default] : Added class "Background"
                         : Booked classifier "BDT" of type: "BDT"
                         : Booking "MLP method" of type "MLP" from ./dataset/weights/BDTSignalBkgClassification_MLP.weights.xml.
                         : Reading weight file: ./dataset/weights/BDTSignalBkgClassification_MLP.weights.xml
<HEADER> MLP                      : Building Network. 
                         : Initializing weights
                         : Booked classifier "MLP" of type: "MLP"


Now we need to perform the event loop and loop over all events in the test (application) sample. We then see what classification each event is and will store that classification in a histogram.

In [6]:
TH1F * h_BDTOutput = new TH1F("BDT_Output", "BDT_Output", 100, -2.0, 2.0);
TH1F * h_MLPOutput = new TH1F("MLP_Ouput", "MLP_Output", 100, -2.0, 2.0);


Float_t userMass, userEta, userPhi;
bkgSigTree->SetBranchAddress( "mass", &userMass );
bkgSigTree->SetBranchAddress( "eta", &userEta );
bkgSigTree->SetBranchAddress( "phi", &userPhi );

for (Long64_t i = 0; i < bkgSigTree->GetEntries(); ++i) {
    bkgSigTree->GetEntry(i);
    
    mass = userMass;
    eta = userEta;
    phi = userPhi;
    
    h_BDTOutput->Fill(reader->EvaluateMVA("BDT method"));
    h_MLPOutput->Fill(reader->EvaluateMVA("MLP method"));
}

The mass is: 66.399162
The eta is: -0.347728
The phi is: -0.876314
The BDT output classification is: -1.000000
The MLP output classification is: 0.000001
The mass is: 41.942574
The eta is: -0.199628
The phi is: -1.777052
The BDT output classification is: -1.000000
The MLP output classification is: 0.000001
The mass is: 197.210663
The eta is: 0.971860
The phi is: -0.436946
The BDT output classification is: -0.642880
The MLP output classification is: 0.015159
The mass is: 209.821762
The eta is: 1.042024
The phi is: -2.862557
The BDT output classification is: -0.626519
The MLP output classification is: 0.023460
The mass is: 18.500336
The eta is: 0.340624
The phi is: -3.122074
The BDT output classification is: -1.000000
The MLP output classification is: 0.000003
The mass is: 158.445328
The eta is: -0.634600
The phi is: 0.606412
The BDT output classification is: -0.986105
The MLP output classification is: 0.000006
The mass is: 131.862885
The eta is: 0.602804
The phi is: -0.630333
The BDT ou

The mass is: 17.743374
The eta is: 2.172062
The phi is: -0.529165
The BDT output classification is: -0.721606
The MLP output classification is: 0.000042
The mass is: 30.977579
The eta is: -0.894050
The phi is: 0.396270
The BDT output classification is: -1.000000
The MLP output classification is: 0.000000
The mass is: 306.593231
The eta is: -0.778030
The phi is: 2.467908
The BDT output classification is: -1.000000
The MLP output classification is: 0.000003
The mass is: 19.065922
The eta is: -0.946054
The phi is: 1.986384
The BDT output classification is: -1.000000
The MLP output classification is: 0.000001
The mass is: 31.263899
The eta is: 2.145264
The phi is: -1.529895
The BDT output classification is: -0.748462
The MLP output classification is: 0.000054
The mass is: 162.790878
The eta is: 0.089363
The phi is: 1.751534
The BDT output classification is: -0.978990
The MLP output classification is: 0.000050
The mass is: 1.661950
The eta is: -0.475156
The phi is: -3.071663
The BDT output 

The mass is: 7.774333
The eta is: 0.231379
The phi is: 1.267626
The BDT output classification is: -1.000000
The MLP output classification is: 0.000003
The mass is: 18.784956
The eta is: 0.330821
The phi is: -1.681394
The BDT output classification is: -1.000000
The MLP output classification is: 0.000003
The mass is: 87.280640
The eta is: 0.282111
The phi is: -1.799004
The BDT output classification is: -0.994694
The MLP output classification is: 0.000005
The mass is: 45.833847
The eta is: 0.702821
The phi is: -1.789493
The BDT output classification is: -0.991029
The MLP output classification is: 0.000006
The mass is: 49.850166
The eta is: -0.057778
The phi is: 1.896484
The BDT output classification is: -1.000000
The MLP output classification is: 0.000002
The mass is: 158.764618
The eta is: -1.689747
The phi is: 0.858251
The BDT output classification is: -0.978990
The MLP output classification is: 0.000001
The mass is: 166.158478
The eta is: -2.251044
The phi is: 1.004309
The BDT output c

The mass is: 39.480129
The eta is: 0.577082
The phi is: -2.942775
The BDT output classification is: -1.000000
The MLP output classification is: 0.000004
The mass is: 117.060471
The eta is: 1.615940
The phi is: 1.929175
The BDT output classification is: -0.391293
The MLP output classification is: 0.039272
The mass is: 9.759931
The eta is: -1.414062
The phi is: 2.820308
The BDT output classification is: -1.000000
The MLP output classification is: 0.000000
The mass is: 60.624180
The eta is: 0.345311
The phi is: -2.740108
The BDT output classification is: -1.000000
The MLP output classification is: 0.000003
The mass is: 43.388073
The eta is: -0.813484
The phi is: 0.593599
The BDT output classification is: -1.000000
The MLP output classification is: 0.000000
The mass is: 45.785374
The eta is: 0.134681
The phi is: -0.149596
The BDT output classification is: -1.000000
The MLP output classification is: 0.000003
The mass is: 42.376472
The eta is: -0.552489
The phi is: 2.962555
The BDT output cl

The mass is: 52.008259
The eta is: 1.659082
The phi is: 0.221983
The BDT output classification is: -0.966842
The MLP output classification is: 0.000048
The mass is: 268.931335
The eta is: -0.398362
The phi is: -2.615980
The BDT output classification is: -1.000000
The MLP output classification is: 0.000020
The mass is: 280.402435
The eta is: 0.162025
The phi is: -2.480089
The BDT output classification is: -1.000000
The MLP output classification is: 0.000041
The mass is: 55.590237
The eta is: 1.682709
The phi is: 1.597896
The BDT output classification is: -0.970649
The MLP output classification is: 0.000072
The mass is: 8.447150
The eta is: 0.154276
The phi is: 0.447829
The BDT output classification is: -1.000000
The MLP output classification is: 0.000003
The mass is: 80.714928
The eta is: 1.609884
The phi is: -0.879113
The BDT output classification is: -0.924016
The MLP output classification is: 0.000313
The mass is: 20.933540
The eta is: 0.737201
The phi is: 0.630214
The BDT output cla

The mass is: 68.405083
The eta is: -0.067224
The phi is: 2.763334
The BDT output classification is: -1.000000
The MLP output classification is: 0.000002
The mass is: 137.027115
The eta is: -1.545791
The phi is: -2.392532
The BDT output classification is: -0.975886
The MLP output classification is: 0.000000
The mass is: 76.971626
The eta is: -0.573891
The phi is: 2.969196
The BDT output classification is: -1.000000
The MLP output classification is: 0.000000
The mass is: 48.618835
The eta is: 0.920345
The phi is: 1.589863
The BDT output classification is: -0.990726
The MLP output classification is: 0.000010
The mass is: 61.512543
The eta is: 0.290932
The phi is: -1.582850
The BDT output classification is: -1.000000
The MLP output classification is: 0.000004
The mass is: 10.460620
The eta is: -1.370760
The phi is: -2.263342
The BDT output classification is: -1.000000
The MLP output classification is: 0.000000
The mass is: 83.763718
The eta is: -2.457628
The phi is: -2.918539
The BDT outpu

The mass is: 13.315799
The eta is: -0.097234
The phi is: 2.611294
The BDT output classification is: -1.000000
The MLP output classification is: 0.000002
The mass is: 68.377098
The eta is: 0.781064
The phi is: 1.387144
The BDT output classification is: -0.995555
The MLP output classification is: 0.000018
The mass is: 42.025845
The eta is: 1.511270
The phi is: 0.723521
The BDT output classification is: -0.990320
The MLP output classification is: 0.000023
The mass is: 164.310349
The eta is: -0.672296
The phi is: 0.153944
The BDT output classification is: -0.986105
The MLP output classification is: 0.000007
The mass is: 43.274715
The eta is: -0.035124
The phi is: 2.321768
The BDT output classification is: -1.000000
The MLP output classification is: 0.000002
The mass is: 355.001099
The eta is: -0.521402
The phi is: 2.577927
The BDT output classification is: -1.000000
The MLP output classification is: 0.000004
The mass is: 71.690567
The eta is: 0.066084
The phi is: -0.050087
The BDT output c

The mass is: 134.568497
The eta is: -0.533302
The phi is: -2.263210
The BDT output classification is: -0.975886
The MLP output classification is: 0.000001
The mass is: 47.940914
The eta is: -0.608196
The phi is: -1.981848
The BDT output classification is: -1.000000
The MLP output classification is: 0.000000
The mass is: 92.599449
The eta is: -0.251613
The phi is: 1.858849
The BDT output classification is: -0.994694
The MLP output classification is: 0.000001
The mass is: 51.751907
The eta is: -0.342999
The phi is: 1.496566
The BDT output classification is: -1.000000
The MLP output classification is: 0.000001
The mass is: 180.333099
The eta is: -0.725006
The phi is: -1.316519
The BDT output classification is: -0.975539
The MLP output classification is: 0.000008
The mass is: 0.783100
The eta is: 0.924621
The phi is: 0.697426
The BDT output classification is: -0.995555
The MLP output classification is: 0.000005
The mass is: 107.514114
The eta is: -1.356882
The phi is: 1.103962
The BDT outp

The mass is: 192.618393
The eta is: 3.048027
The phi is: 2.283213
The BDT output classification is: 0.839282
The MLP output classification is: 0.999214
The mass is: 156.379425
The eta is: 2.267769
The phi is: -1.002767
The BDT output classification is: 0.468404
The MLP output classification is: 0.987428
The mass is: 197.875244
The eta is: 3.408914
The phi is: -0.182907
The BDT output classification is: 0.931501
The MLP output classification is: 0.999360
The mass is: 191.783218
The eta is: 2.682834
The phi is: -1.169996
The BDT output classification is: 0.763862
The MLP output classification is: 0.997641
The mass is: 153.759857
The eta is: 2.538843
The phi is: -2.864799
The BDT output classification is: 0.581777
The MLP output classification is: 0.994531
The mass is: 172.029327
The eta is: 2.506776
The phi is: 0.086924
The BDT output classification is: 0.656155
The MLP output classification is: 0.997090
The mass is: 167.762115
The eta is: 3.173097
The phi is: 0.156058
The BDT output cla

Now let's plot these histograms and see how many signal events we had.

In [10]:
//TCanvas c_output("c_output", "Classification Results", 900, 900);
h_BDTOutput->Draw("hist");
c_output.SaveAs("BDTOutput.png");
c_output.Draw();

Info in <TCanvas::Print>: png file BDTOutput.png has been created


In [11]:
h_MLPOutput->Draw("hist");
c_output.Draw();

To determine how many signal events we had we need to pick a threshold. A common threshold is just the middle of the classification output. Where this threshold corresponds to is sometimes called a working point. A high working point means the threshold is close to the upper limit of the classifier output (should have a high TPR and low FPR). A low working point means the threshold is close to the lower limit of the classifier output (should have an OK TPR but relatively high FPR as well).

### Exercise

Try messing with the event loop and create plots of mass, phi, and eta based on if you believe the event is a signal or background event and plot those.