# Measuring Performance

You have learned about the efficiency of the CMS recommended top tagging algorithm. But how did we decide on those variables and those specific cuts? To measure performance of different algorithms, we use the "Receiver-Operator Characteristic" curve (ROC curve). The ROC curve shows background efficiency as a function of signal efficiency. Better-performing algorithms will have higher signal efficiency for the same background efficency, or lower background efficiency for the same signal efficiency. These can be plotted several different ways, sometimes using background rejection (1 - efficiency) instead of background efficiency.

You can produce ROOT files for various signal and background samples to study this:

Use the following commands to run over an RS KK gluon sample and a QCD sample: 

In [None]:
### RUN THIS CELL ONLY IF YOU ARE USING SWAN 
import os

##### REMEMBER TO MANUALLY COPY THE PROXY TO YOUR CERNBOX FOLDER AND TO MODIFY THE NEXT LINE
os.environ['X509_USER_PROXY'] = '/eos/home-X/Y/tmp/x509up_u0000'
if os.path.isfile(os.environ['X509_USER_PROXY']): pass
else: print("os.environ['X509_USER_PROXY'] ",os.environ['X509_USER_PROXY'])
os.environ['X509_CERT_DIR'] = '/cvmfs/cms.cern.ch/grid/etc/grid-security/certificates'
os.environ['X509_VOMS_DIR'] = '/cvmfs/cms.cern.ch/grid/etc/grid-security/vomsdir'

In [1]:
%%bash
python $CMSSW_BASE/src/Analysis/JMEDAS/scripts/jmedas_make_histograms.py --files=$CMSSW_BASE/src/Analysis/JMEDAS/data/MiniAODs/RunIIFall17MiniAODv2/QCD_Pt_300to470.txt --outname=$CMSSW_BASE/src/Analysis/JMEDAS/notebooks/files/qcd.root --maxevents=2000 --maxjets=6
python $CMSSW_BASE/src/Analysis/JMEDAS/scripts/jmedas_make_histograms.py --files=$CMSSW_BASE/src/Analysis/JMEDAS/data/MiniAODs/RunIIFall17MiniAODv2/rsgluon_ttbar_3000GeV.txt --outname=$CMSSW_BASE/src/Analysis/JMEDAS/notebooks/files/rsgluon_ttbar_3TeV.root --maxevents=2000 --maxjets=6

Added root://cmsxrootd.fnal.gov//store/mc/RunIIFall17MiniAODv2/QCD_Pt_300to470_TuneCP5_13TeV_pythia8/MINIAODSIM/PU2017_12Apr2018_94X_mc2017_realistic_v14-v1/70000/00389785-4C42-E811-A376-0025905C5438.root
Added root://cmsxrootd.fnal.gov//store/mc/RunIIFall17MiniAODv2/QCD_Pt_300to470_TuneCP5_13TeV_pythia8/MINIAODSIM/PU2017_12Apr2018_94X_mc2017_realistic_v14-v1/70000/006D085A-C341-E811-A61C-0025904C637E.root
Added root://cmsxrootd.fnal.gov//store/mc/RunIIFall17MiniAODv2/QCD_Pt_300to470_TuneCP5_13TeV_pythia8/MINIAODSIM/PU2017_12Apr2018_94X_mc2017_realistic_v14-v1/70000/02D51001-5E42-E811-ABAC-002590D9D8AA.root
Added root://cmsxrootd.fnal.gov//store/mc/RunIIFall17MiniAODv2/QCD_Pt_300to470_TuneCP5_13TeV_pythia8/MINIAODSIM/PU2017_12Apr2018_94X_mc2017_realistic_v14-v1/70000/0418D8CE-5B42-E811-A2AF-0025905C2C86.root
Added root://cmsxrootd.fnal.gov//store/mc/RunIIFall17MiniAODv2/QCD_Pt_300to470_TuneCP5_13TeV_pythia8/MINIAODSIM/PU2017_12Apr2018_94X_mc2017_realistic_v14-v1/70000/04476833-E341-E81

In this exercise you will make a simple ROC curve and examine performance of different variables. If you look in one of the ROOT files we have produced earlier, you will see there is a TTree with the name varTree. This tree contains several different variables we can use to measure performance:
```
ak8pt
ak8eta
ak8phi
ak8PUPPIpt
ak8PUPPIeta
ak8PUPPIphi
ak8mass
ak8csv
ak8CHSSDmass
ak8PUPPImass
ak8PUPPISDmass
ak8tau32
ak8tau21
ak8SD_sub0_mass
ak8SD_sub1_mass
ak8SD_sub0_csv
ak8SD_sub1_csv
ak8_N2_beta1
ak8_N2_beta2
ak8_N3_beta1
ak8_N3_beta2
npv
```

First, we will plot a ROC curve for a single variable: the (ungroomed) AK8 jet mass. There is a simple script to scan over this variable's distribution, choosing sets of cuts. For each set of cuts, the signal efficiency and background efficiency is calculated, and the point is added to the ROC curve.

To run the script, you give the variables to use as arguments. For this first run, simply do: 

In [2]:
from Analysis.JMEDAS.computeROC import computeROC
c1 = computeROC(["ak8mass"])

Welcome to JupyROOT 6.14/09


ReferenceError: attempt to access a null-pointer

You can see if there is a gain in performance by adding additional variables. The method (TMVA) then scans sets of cuts simultaneously applied to both distributions to produce the ROC curve. 

In [None]:
c1 = computeROC(["ak8mass","ak8tau32"])

Does adding the additional variable improve the performance? In all cases?

What about studying additional variables -- does the PUPPI or soft-drop mass give an improvement? You can add additional variables to the varTree by modifying jmedas_make_histograms.py, but will need to re-make the ROOT files as instructed above.

Try comparing different sets of variables to see how much you can improve the top-tagging performance! 