# The sample guessing game
In this game, we analyze 6 files, `sample[0-6].root` that are each a small 500k event sample of a CMS NanoAOD dataset.  All generator-level branches have been removed.  Your task is to figure out which file came from which dataset.  To make it a bit easier, here are the 6 possible datasets:

   * `DY2JetsToLL_M-50_TuneCP5_13TeV-madgraphMLM-pythia8`, producing $Z(\to \ell\ell)+2j$ events via QCD processes;
   * `EWKZ2Jets_ZToLL_M-50_TuneCP5_PSweights_13TeV-madgraph-pythia8`, producing $Z(\to \ell\ell)+2j$ events via EW processes (i.e. primarily vector boson fusion);
   * `GluGluHToWWTo2L2Nu_M125_13TeV_powheg2_JHUGenV714_pythia8`, producing Higgs boson events through gluon fusion, where the Higgs boson decays via $H\to WW^{*}\to 2\ell2\nu$;
   * `GluGluHToZZTo2L2Q_M125_13TeV_powheg2_JHUGenV7011_pythia8`, producing Higgs boson events through gluon fusion, where the Higgs boson decays via $H\to ZZ^{*}\to 2\ell2q$;
   * `TTTo2L2Nu_TuneCP5_13TeV-powheg-pythia8`, producing di-top events, where both top quarks must decay leptonically; and
   * `WWTo2L2Nu_NNPDF31_TuneCP5_13TeV-powheg-pythia8`, producing WW diboson events, where both W bosons decay leptonically $W\to \ell\nu$.
   
In all cases, the lepton can be any flavor, $\ell\in\{e,\mu,\tau\}$.  All 6 datasets were produced with the `RunIIAutumn18NanoAODv4-Nano14Dec2018_102X_upgrade2018_realistic_v16-v1` conditions tag.

If you are playing this game as part of the columnar analysis HATS, please use the existing conda environment as setup [here](https://github.com/jpivarski/2019-05-28-lpchats-numpy-uproot-awkward#2019-05-28-lpchats-numpy-uproot-awkward), additionally with
`pip install fnal-column-analysis-tools`

In [None]:
import numpy as np
import matplotlib.pyplot as plt

import uproot
import uproot_methods
import awkward

from fnal_column_analysis_tools import hist
from fnal_column_analysis_tools.hist import plot

In [None]:
# This downloads 4.3 GB of data, if you prefer to work with remote files, skip this step
# make sure to "kinit username@FNAL.GOV" first
# download speed on wifi was about 5MB/s, and took 14 minutes total
!mkdir -p data
!scp cmslpc-sl6.fnal.gov:/eos/uscms/store/user/ncsmith/samplegame/sample0.root data/
!scp cmslpc-sl6.fnal.gov:/eos/uscms/store/user/ncsmith/samplegame/sample1.root data/
!scp cmslpc-sl6.fnal.gov:/eos/uscms/store/user/ncsmith/samplegame/sample2.root data/
!scp cmslpc-sl6.fnal.gov:/eos/uscms/store/user/ncsmith/samplegame/sample3.root data/
!scp cmslpc-sl6.fnal.gov:/eos/uscms/store/user/ncsmith/samplegame/sample4.root data/
!scp cmslpc-sl6.fnal.gov:/eos/uscms/store/user/ncsmith/samplegame/sample5.root data/

In [None]:
# Alternative download method, if you have your grid certificate installed properly
# See https://gist.github.com/nsmith-/0e56f30ff386254b9fcc7164647deba7 for an installation guide
!mkdir -p data
!xrdcp root://cmseos.fnal.gov//store/user/ncsmith/samplegame/sample0.root data/
!xrdcp root://cmseos.fnal.gov//store/user/ncsmith/samplegame/sample1.root data/
!xrdcp root://cmseos.fnal.gov//store/user/ncsmith/samplegame/sample2.root data/
!xrdcp root://cmseos.fnal.gov//store/user/ncsmith/samplegame/sample3.root data/
!xrdcp root://cmseos.fnal.gov//store/user/ncsmith/samplegame/sample4.root data/
!xrdcp root://cmseos.fnal.gov//store/user/ncsmith/samplegame/sample5.root data/

In [None]:
import os

if os.path.exists('data/sample0.root'):
    prefix = 'data/'
else:
    prefix = 'root://cmseos.fnal.gov//store/user/ncsmith/samplegame/'

samplefiles = [uproot.open(prefix+"sample%d.root" % i) for i in range(6)]
samples = [f['Events'] for f in samplefiles]

In [None]:
# Here's a nice helper function to make the jagged tables for objects
def nanoArray(tree, prefix, column_names, has_p4=True):
    columns = {}
    if has_p4:
        columns['p4'] = uproot_methods.TLorentzVectorArray.from_ptetaphim(
            tree[prefix+'pt'].array(),
            tree[prefix+'eta'].array(),
            tree[prefix+'phi'].array(),
            tree[prefix+'mass'].array(),
        )
    columns.update({k: tree[prefix+k].array() for k in column_names})
    return awkward.JaggedArray.zip(**columns)

In [None]:
# Here's a list of all the available branches in each file
samples[0].show()

In [None]:
# I managed to figure things out with these variables
# but one could definitely use different/additional variables
electrons0 = nanoArray(samples[0], 'Electron_', ['charge', ])
muons0 = nanoArray(samples[0], 'Muon_', ['charge', ])
jets0 = nanoArray(samples[0], 'Jet_', ['btagCSVV2', 'muonIdx1', 'electronIdx1'])
met0 = uproot_methods.TVector2Array.from_polar(
    samples[0]['MET_pt'].array().flatten(),
    samples[0]['MET_phi'].array().flatten(),
)

In [None]:
plt.hist(electrons0['p4'].pt.flatten(), bins=np.linspace(0,200,200));