- Statistics in Theory - a lecture by Bob Cousins
- Statistical methods in LHC data analysis - Luca Lista
- ROOT Users Guides
- ROOT Tutorials, 2015, https://twiki.cern.ch/twiki/bin/view/Main/ROOTRioTutorial
- RooFit reference slides - by Wouter Verkerke, one of RooFit authors
- RooFit tutorials - a set of working macros that showcase all major features of RooFit
- HistFactory
- RooStats Manual - concise, contains clear summary of statistics concepts and definitions
- RooStats tutorial - by Kyle Cranmer, one of the RooStats developers
- RooStats tutorials - a set of working macros that showcase all major features of RooStats
You need ROOT
(v6.22 or newer) and python
installed for tutorials here. If you don't have them available, either from your local computer or a linux server, follow the instructions here to install the root
docker container.
The Jupyter setup is optional and if you are interested, you can find some intruction here to set it up, or the docker setup in the above instruction.
Hands-on 1: Basic fitting
Fit example with ROOT
, with following objectives:
- Fit a histogram with a linear chi-squre fit, and compare results with by-hand calculations
- Different fiting options
- Compare chi-squre fit and likelihood fit
Hands-on 2: Fitting with RooFit
Fit examples with RooFit
, composite p.d.f with signal and background component
pdf = f_bkg * bkg(x,a0,a1) + (1-fbkg) * (f_sig1 * sig1(x,m,s1 + (1-f_sig1) * sig2(x,m,s2)))
with following objectives:
- Construct a simple fit in RooFit and plot the NLL
- Compare binned and unbinned fit results
- Compare un-extended and extended likelihoof it
Hands-on 3: Advanced fitting with RooFit
A bit advanced fit examples with RooFit
, composite p.d.f with signal and background component, extended
pdf = n_bkg * bkg(x,a0,a1) + n_sig * (f_sig1 * sig1(x,m,s1 + (1-f_sig1) * sig2(x,m,s2)))
or using a signal strength
pdf = n_bkg * bkg(x,a0,a1) + mu * n_sig * (f_sig1 * sig1(x,m,s1 + (1-f_sig1) * sig2(x,m,s2)))
with following objectives:
- Compare plain likelihood fit and profile likelihood fit
- Fit with nuisance parameters with constraints
Fit the Higgs peak in ATLAS H4l open "data": http://opendata.cern.ch/record/3823, MC: gg->H->ZZ->4l
with mH = 125 GeV, for 2016 ATLAS open data release.
The (Monte Carlo) data is a Ttree with lepton four-vector informaiton available. Reconstruct the invariant mass of the four-lepton final state. An example code to process the TTree can be found here.
Tips and requirements:
- Construct a
S+B
model: S: signal, Gaussian, B: background, polynomial - Restrict to the mass range of 110 GeV to 160 GeV
- Step 1: Fit the mass peak, compare binned and unbinned fit results, using 20 and 500 events, respectively (four fits in total)
- Step 2: fix the mass peak and fit the signal and background yields
When the Higgs boson was discovered in 2012, there were about 20 events in the H4l channel within the mass range of 110 GeV to 160 GeV, see Fig.2 of the ATLAS Higgs discovery paper. With LHC Run-2 data, there are about 500 events in this mass range, see Fig. 5 of https://arxiv.org/abs/2004.03447.
Hands-on 4: Histfactory example
Create a workspace using HistFactory.
RooWorkspace is a persistable container for RooFit projects. It can contain and own variables, p.d.f.s, functions and datasets. The entire RooWorkspace can be saved into a ROOT TFile and organises the consistent streaming of its contents without duplication.
HistFactory
is a package that creates a RooFit probability density function from ROOT histograms of expected distributions and histograms that represent the +/- 1 sigma variations from systematic effects. The resulting probability density function, or a RooWorkspace
, can then be used with any of the statistical tools provided within RooStats, such as the profile likelihood ratio, Feldman-Cousins, etc.
In this tutorial, the model is basically the same as hepstat_tutorial_roofit_extended.py: i.e, composite p.d.f with signal and background component
pdf = n_bkg * bkg(x,a0,a1) + mu * n_sig * (f_sig1 * sig1(x,m,s1 + (1-f_sig1) * sig2(x,m,s2)))
and our goals are the following:
- Create a workspace using Workspace Factory
- Example operations at the workspace level
Hands-on 5: Build a workspace using histograms
In the above example (Histfactory example), a workspace is built using parametrized functions. In reality, non-parametrized PDFs are more often being used, for example, from Monte Carlo simulations. In this example, we build a workspace using histograms, and we also show you how to include systematic uncertainties in the likelihood model. Our objectives are:
- Create a workspace using histograms
- Include systematic uncertainties
Hands-on 6: Null hypothesis significance test
RooStats
example: compute the p0 and significance (Hypothesis Test)
The signal is a simple Gaussian and the background is a smoothly falling spectrum. To estimate the significance,
we need to perform an hypothesis test. We want to disprove the null model, i.e the background only model against the alternate model,
the background plus the signal. In RooStats, we do this by defining two two ModelConfig objects, one for the null model
(the background only model in this case) and one for the alternate model (the signal plus background).
Objectives of this tutorial are the following:
- Compute the null hypo significance using the Asymptotic calculator
- Compute the significance by hand using the asymptotic formula
- Compute the significance using frequentist method
Hands-on 7: CLs upper limits
Use the StandardHypoTestInvDemo tutorial macro to perform an inverted hypothesis test for computing an interval (one-sided upper limits). This macro will perform a scan of the p-values for computing the upper limit. Both asymptotic and frequentist methos will be shown.
Objectives of this tutorial are the following:
- Create the HypoTestInverter class and configure it
- Compute the CLs upper limits using the asymptotic formula
- Compute the CLs upper limits using the frequentist method (time consuming)
Use the workspaces created from Build a workspace using histograms:
- Plot the p0 scan as a function of the signal mass
- Plot the CLs upper limits as a function of the signal mass
Consider a typical physics problem at the LHC: searching a heavy resonance decaying to a pair of top-quarks, with both top-quarks decaying semileptonically, i.e., pp > Z' > tt, t>Wb, W->lv
. The signal is expected to have a resonance by looking at the invariant mass of the top-quark pair. The dominant background is the SM tt
production, which is expected to fall smoothly in the high mass tail. This should be an easy task if one can reconstruct the invariant mass of the top-quark pair. But how? The final states are llbb + missing energy
due to the neutrinos. It is not possible to fully reconstruct the invariant mass of the top-quark pair (mtt). More over, without such a powerful discriminant, mtt
, it might be challenging to separate the signal from the dominant background.
This tutorial aims to solve two problems:
- Use machine learning based regression to reconstruct, or infer, the mass
mtt
, based on the kinematic information of the experimental observables. - Use machine learning based classification to separate the signal from the background.
The following experimental observables are considered:
- 4-momenta (
pT , η, ϕ, m
) of the 2 leptons (4+4) - 4-momenta (
pT , η, ϕ, m
) of the jets, at lest 2 jets and up to 3 jets (4+4+4) - missing ET and the phi angle (2)
Before you start this toturial, follow the instruction here to set up the PyTorch docker environment.
This ANN-based machine learning tutorial aims to solve the same problem as for the TMVA tutorial above. It would be interesting to compare the performance of BDT vs DNN.
- Regression (the first part)
- Classification (the second part)