<hr style="height: 1px;">
<i>This notebook was authored by the 8.S50x Course Team, Copyright 2022 MIT All Rights Reserved.</i>
<hr style="height: 1px;">
<br>

<h1>Project 2 - Part I: Measuring Properties of the W Boson from LHC Data</h1>


<a name='section_2_0'></a>
<hr style="height: 1px;">


## <h2 style="border:1px; border-style:solid; padding: 0.25em; color: #FFFFFF; background-color: #90409C">PROJ2.0 Overview</h2>


<h3>Navigation</h3>

<table style="width:100%">
    <tr>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#section_2_1">PROJ2.1 Introduction and Data Exploration</a></td>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#problems_2_1">PROJ2.1 Checkpoints</a></td>
    </tr>
    <tr>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#section_2_2">PROJ2.2 Event Selection and Background Mitigation</a></td>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#problems_2_2">PROJ2.2 Checkpoints</a></td>
    </tr>
    <tr>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#section_2_3">PROJ2.3 Beginning to Look for the W Signal in the Data</a></td>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#problems_2_3">PROJ2.3 Checkpoints</a></td>
    </tr>
    <tr>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#section_2_4">PROJ2.4 Refining our Selection to Look for the W Signal in the Data</a></td>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#problems_2_4">PROJ2.4 Checkpoints</a></td>
    </tr>
    <tr>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#section_2_5">PROJ2.5 Fit for W Peak</a></td>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#problems_2_5">PROJ2.5 Checkpoints</a></td>
    </tr>
</table>

<h3>Learning Objectives</h3>

In this lab we will investigate W bosons produced in the LHC's 8 TeV proton proton collisions. These samples were produced some years ago, in a fun experiment that opened up the option of performing low mass resonance searches at the LHC. The studies done then have led to a wealth of results from both LHC experiments, ATLAS and CMS.

Specifically, in this part of the Project, we will explore the following objectives:

- Downloading and understanding the data
- Learning about event selection and background mitigation
- Examining simulated data
- Fitting the W peak to find the W boson mass



<h3>Slides</h3>

Run the code below to view the slides for this section, which are discussed in the related video. You can also open the slides in a separate window <a href="https://mitx-8s50.github.io/slides/L09/slides_L09_09.html" target="_blank">HERE</a>.

In [None]:
#>>>RUN: PROJ2.0-slides

from IPython.display import IFrame
IFrame(src='https://mitx-8s50.github.io/slides/L09/slides_L09_09.html', width=970, height=550)

<h3>Data</h3>

>description: Boosted Single Jet dataset at 8TeV<br>
>source: https://zenodo.org/record/8035318 <br>
>attribution: Philip Harris (CMS Collaboration), DOI:10.5281/zenodo.8035318 

In [None]:
#>>>RUN: PROJ2.0-runcell00

# NOTE: these files are too large to include in the original repository,
# so you must download them using the options below
#
# Ways to download:
#     1. Copy/paste the link (replace =0 with =1 to download automatically)
#     2. Use the wget commands below (works in Colab, but you may need to install wget if using locally)
#
# Location of files:
#     Move the files to the directory 'data'
#
# Using wget: (works in Colab)
#     Upon downloading, the code below will move them to the appropriate directory

#3GB Data Set: data1
!wget https://www.dropbox.com/s/bcyab2lljie72aj/data.tgz?dl=0
!mv data.tgz?dl=0 data.tgz #rename
!tar -xvf data.tgz #extract the data
!rm data.tgz #clean the downloaded file

#130MB Data Set: data2
!wget https://www.dropbox.com/s/p756oa4mfw17lfw/data.zip?dl=0
!mv data.zip?dl=0 data.zip #rename
!unzip data.zip #extract the data
!rm data.zip #clean the downloaded file

<h3>Importing Libraries</h3>

Before beginning, run the cell below to import the relevant libraries for this notebook.

In [None]:
#>>>RUN: PROJ2.0-runcell01

# pre-requisites: install now if you have not already done so
# uproot High energy physics python file format: https://masonproffitt.github.io/uproot-tutorial/aio.html
!pip install uproot
!pip install lmfit
!pip install mplhep

In [None]:
#>>>RUN: PROJ2.0-runcell02

import uproot
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
import os,sys

#!pip install lmfit #install lmfit if you have not done this already
import lmfit as lm

#!pip install mplhep #install mplhep if you have not done this already
# plotting style for High Energy physics 
import mplhep as hep
plt.style.use(hep.style.CMS)

<h3>Setting Default Figure Parameters</h3>

The following code cell sets default values for figure parameters.

In [None]:
#>>>RUN: PROJ2.0-runcell03

#set plot resolution
%config InlineBackend.figure_format = 'retina'

#set default figure parameters
plt.rcParams['figure.figsize'] = (6,6)

medium_size = 12
large_size = 15

plt.rc('font', size=medium_size)          # default text sizes
plt.rc('xtick', labelsize=medium_size)    # xtick labels
plt.rc('ytick', labelsize=medium_size)    # ytick labels
plt.rc('legend', fontsize=medium_size)    # legend
plt.rc('axes', titlesize=large_size)      # axes title
plt.rc('axes', labelsize=large_size)      # x and y labels
plt.rc('figure', titlesize=large_size)    # figure title


<a name='section_2_1'></a>
<hr style="height: 1px;">

## <h2 style="border:1px; border-style:solid; padding: 0.25em; color: #FFFFFF; background-color: #90409C">PROJ2.1 Introduction and Data Exploration</h2>    

| [Top](#section_2_0) | [Previous Section](#section_2_0) | [Checkpoints](#problems_2_1) | [Next Section](#section_2_2) |


<h3>W to QQ</h3>

Let's first consider the process that we would like to look for. The production of W bosons in proton collisions. Here is a Feynman diagram of the process. 

<!--<img src="images/Wqq.png" width="300"/>-->
<p align="center">
<img alt="Wqq" src="https://raw.githubusercontent.com/mitx-8s50/images/main/PROJ2/Wqq.png" width="300"/>
</p>

>source: https://arxiv.org/abs/1807.07454<br>
>attribution: arXiv:1807.07454, Marat Freytsis, Philip Harris, Andreas Hinzmann, Ian Moult, Nhan Tran, Caterina Vernieri

The left part of the diagram represents the production of the W boson via some initial quark interaction (quarks and anti-quarks are present when two protons collide). At the right you have a gluon (bottom) that is produced in association with the W boson (top). At the top right, the W boson is decaying. It can decay to many things. The full list of W boson decays is <a href="http://pdg.lbl.gov/2012/listings/rpp2012-list-w-boson.pdf" target="_blank">here</a>, in the W branching ratios section. The quark label generically means that the W boson decays to two quarks. In the reference document this is equivalent to a decay to hadrons. 

Both quarks and gluons will decay into objects that we refer to as *jets*. A jet is collection of particles coming from an original quark or gluon. I will not explain the details of a jet, instead I will point you to summer school lectures I gave on this <a href="https://indico.fnal.gov/event/11505/session/30/?slotId=0#20160820" target="_blank">here</a> (start from slide 36). 

<h3>Lorentz boost</h3>
    
This problem becomes interesting when the W boson has a high energy, or in other words it is boosted. In this case, the decays of the W are restricted to within a cone. A simple calculation of special relativity <font color='red'> that you should do </font> will give you that the maximum angular separation $\Delta \theta$ between the two decay products, can be described by:

$$
\begin{eqnarray}
\Delta \theta & < & \frac{2m}{p}.
\end{eqnarray}
$$

Where, $m$ is the mass and $p$ is the momentum of the resonance decaying, in this case the W boson. Thus, by taking $p$ to be sufficiently high, the angle $\theta$ is sufficiently small that we can resolve the two quark decays as one single jet cone instead of two separate jets. We will still take a large cone with $\Delta \theta_{max} = 0.8$. 
This means that our final state will look like this in the detector:

<!--<img src="images/wjet.png" width="300"/>-->
<p align="center">
<img alt="W jet" src="https://raw.githubusercontent.com/mitx-8s50/images/main/PROJ2/wjet.png" width="300"/>
</p>

**Note:** A small point about collisions at the LHC is that in place of momentum we often use a variable called transverse momentum or $\vec{p}_{T}$. This is the projection of $\vec{p}$ onto the plane perpendicular to the collision. This plane is particularly well understood since, by transverse momentum conservation, all collisions need to have $\sum_{i} \vec{p_{T}}^{i}=0$ for all resulting particles in the collision. For this analysis, you can effectively interchange $p$ and $\vec{p}_{T}$.

Some elementary introduction about detector's geometry can be found <a href="https://www.lhc-closer.es/taking_a_closer_look_at_lhc/0.momentum" target="_blank">here</a>

<h3>Backgrounds</h3>

Once we have a cone sufficient to detect both decay products we need to find a jet with two quarks. To find a jet with two quarks we need to remove our background processes, which are events that "look like" the interaction we want to find, from our events. Our main background process consists of the diagrams below:

<!--<img src="images/dijet.png" width="600"/>-->
<p align="center">
<img alt="Dijet background" src="https://raw.githubusercontent.com/mitx-8s50/images/main/PROJ2/dijet.png" width="600"/>
</p>


where quarks and gluons are produced by the strong force and manifest in the detector as jets. We call this background *multijet* or *QCD background*, which stands for Quantum Chromodynamics. Other sub-dominant background processes are the production of top quark pairs $t\bar{t}$, refered to as the *top quark background*, or the production of a pair of W or Z bosons, which we refer to as the *diboson background*.

<h3>Other processes</h3>

There are additional processes that will produce a jet and gluon in the final state as well, which we might be interested in looking for as well. These are the production of Z bosons and Higgs bosons H. They are also resonances that may decay into a pair of quarks and can get reconstructed in one single jet cone - *but what is interesting is that they most often decay into a pair of b-quarks*. 

Besides its decay, one can use the mass of the resonance to distinguish between these signatures: the Z boson has a mass of $\sim 90~$GeV, the Higgs boson has a mass of $\sim 125~$GeV, while the W boson has a mass of $\sim 80.4~$GeV.

As a summary, our main background consists of either a quark or a gluon and our signal is a boson with 2 quarks inside. **So the challenge is to *construct an identification algorithm of a jet that looks like it originated from two quarks*.**

<h3> Loading data & Auxiliary functions</h3>

Before we start, let's define collider coordinates centered around the collision point. We tend to write our momentum 4-vector as $\vec{p}=(p_{T},\eta,\phi,m)$ in place of $\vec{p}=(p,\theta,\phi,m)$. You can read more in this short link <a href="https://www.lhc-closer.es/taking_a_closer_look_at_lhc/0.momentum" target="_blank">here</a>.

Now is a good time to look at the data. Let's take a look at the different samples we have. You should have run the first cell in Section 0 to download the data, but you can also download from these links:
- <a href="https://www.dropbox.com/s/bcyab2lljie72aj/data.tgz?dl=0" target="_blank">3 GB file</a>
- <a href="https://www.dropbox.com/s/p756oa4mfw17lfw/data.zip?dl=0" target="_blank">130 MB file</a>

These files should be in a directory called `data`.


Here are the different datasets:

* **Data**: *data/JetHT_s.root*. The 8 TeV JetHT dataset. This means that the data passed an online selection (trigger) that required the event to have jets. More on triggers below.

* **W(qq) simulation**: Here we have different options for a simulated qq=>W=>qq dataset
    * *data/WQQ_s.root*: 8 TeV collision energy (low number of events)
    * *data/skimh/WQQ_sh.root*: 13 TeV collision energy (high number of events but different collision energy). Can use these to train NNs and make nice plots.
    * *data/WQQ_new.root*: 8 TeV collision energy (newer dataset with not so high number of events)* 

* **Z(qq) simulation**: Again, we have different options for a simulated qq=>Z=>qq dataset
    * *data/ZQQ_s.root*: 8 TeV collision energy (low number of events)
    * *data/skimh/ZQQ_sh.root*: 13 TeV collision energy (high number of events but higher collision energy). Can use these to train NNs and make nice plots.
    * *data/ZQQ_new.root*: 8 TeV collision energy (newer dataset with not so high number of events)

* **H(bb) simulation**: *data/ggH.root*: This is a small simulated gg=>H=>bb dataset at 8 TeV collision energy. (we might need this in the future).

* **Multijet production or QCD background simulation**: *data/QCD_s.root*. This is our main background. And our worst modeled. We just call these backgrounds QCD because they are produced with Quantum chromodynamics.

* **Top quark pair production simulation**: *data/TT.root*. This is a background sample with top quark decays

* **Diboson simulation**: *data/WW.root,data/WZ.root,data/ZZ.root*. These are three rarer double W, W+Z and Z+Z diboson samples where we have two bosons instead of one.

In [None]:
#>>>RUN: PROJ2.1-runcell01

# Now let's open the data. 

# DATA
#-------------------------------------------------------------------------------
# Our data sample is the JetHT dataset. 
# What that means is the data passed triggers that have a jet in one of the triggers (discussed below).
data   = uproot.open("data/JetHT_s.root")["Tree"]


# SIMULATION
#-------------------------------------------------------------------------------
# In addition to above we have Monte Carlo Simulation of many processes
# Some of these process are well modelled in simulation and some of them are not

#the process qq=>W=>qq and qq=>Z=>qq processes at 8TeV collision energy
wqq    = uproot.open("data/WQQ_s.root")["Tree"] 
zqq    = uproot.open("data/ZQQ_s.root")["Tree"]

# To train NNs and make nice plots we will use larger samples produced at a different collision energy
# qq=>W=>qq and qq=>Z=>qq at 13TeV collision energy
wqq13  = uproot.open("data/skimh/WQQ_sh.root")["Tree"]
zqq13  = uproot.open("data/skimh/ZQQ_sh.root")["Tree"]

# qq=>W=>qq and qq=>Z=>qq at 8TeV collision energy
wqq_n  = uproot.open("data/WQQ_8TeV_Jan11_r.root")["Tree"]
zqq_n  = uproot.open("data/ZQQ_8TeV_Jan11_r.root")["Tree"]

# Now we have our worst modeled background this is also our main background. 
# This is is our di-jet quark and gluon background. 
# We just call these backgrounds QCD because they are produced with Quantum Chromo Dynamics.
qcd    = uproot.open("data/QCD_s.root")["Tree"]

# Now we have the Higgs boson sample (we might need this in the future)
ggh    = uproot.open("data/ggH.root")["Tree"]

# And top-quark pair production background. 
tt     = uproot.open("data/TT.root")["Tree"]

# Finally we have the rarer double W, W+Z and Z+Z diboson samples where we have two bosons instead of one
ww     = uproot.open("data/WW.root")["Tree"]
wz     = uproot.open("data/WZ.root")["Tree"]
zz     = uproot.open("data/ZZ.root")["Tree"]

dataDict = {'qcd': qcd,
            'tt': tt,
            'data': data,
            'wqq': wqq,
            'zqq': zqq,
            'wqq13': wqq13,
            'zqq13': zqq13,
            'wqq_n': wqq_n,
            'zqq_n': zqq_n,
            'ww': ww,
            'zz': zz,
            'wz': wz,
            'ggh': ggh
            }
from collections import OrderedDict 

order_of_keys = ['data','qcd','tt','ww','zz','wz','wqq','wqq13','wqq_n','zqq','zqq13','zqq_n','ggh']
list_of_tuples = [(key, dataDict[key]) for key in order_of_keys]
OrdDataDict = OrderedDict(list_of_tuples)

<h3>Exploring data</h3>

Now, let's explore the data. There are a lot of different variables, but most of them will not used for this study. 
However, you should feel free to explore the different variables. For completeness, I will write a table of all of the different variables below: 

| sample   | book keeping variable | 
|------|------|
| run      | LHC run  period           |
| lumi     | LHC run period sub section | 
| event    | LHC collision id |
| trigger  | Bitmask of triggers that have been passed |  
| hltmatch | ??? (unused I think) | 
| puweight | Weight to match the beam inensity (so called Pileup) |
| npu      | For simulation the number of simulated pileup collisions | 
| npuPlusOne      | "" |
| npuMinusOne     | "" |
| nvtx            | Number of reconstructed vertices (a proxy for the total number of collisions) | 
| metFiltersWord  | Bitmask of whether event had anamalous detector features | 
| scale1fb        | The expected number of events per 1/fb of data | 
| rho             | energy density | 
| metRaw          | Raw Missing Transverse Energy (this is a proxy of the direction of invisible particles in the transverse plane | 
| metRawPhi       | Raw Missing Transverse Energy direction in transverse plane | 
| met             | Corrected metRaw | 
| metphi          | Correct metRawPhi | 
| tkmet           | charged metRaw | 
| tkmetphi        | charged metRawPhi | 
| mvamet          | ML  corrected metRaw |  
| mvametphi       | ML corrected metRawPhi | 
| puppet          | PUPPI corrected metRaw |
| puppetphi       | PUPPI corrected metRawPhi | 
| mt              | relativistic mass of (met+leading jet) in the transverse plane | 
| rawmt           | relativistic mass of (metRaw+leading jet) in the transverse plane | 
| tkmt            | relativistic mass of (tkmet+leading jet) in the transverse plane | 
| mvamt           | relativistic mass of (mvamet+leading jet) in the transverse plane | 
| puppetmt        | relativistic mass of (puppet+leading jet) in the transverse plane | 
| metSig          | probalbilistic measure missing transverse Energy is from 0 in sigma | 
| mvaMetSig       | probalbilistic mvamet is from 0 in sigma | 
| njets           | Number of jets with pt > 30 GeV | 
| nbtags          | Number of b-jets with pt > 30 GeV | 
| nfwd            | Number of jets with pt > 30 GeV and abs(eta) > 2.5 | 
| mindphi         | minimum direction in transverse plane of all jets and met | 
| j0_pt           | leading small jet pt | 
| j0_eta          | leading small jet $\eta$ | 
| j0_phi          | leading small jet $\phi$ | 
| j1_pt           | sub leading small jet pt | 
| j1_eta          | sub leading small jet $\eta$ | 
| j1_phi          | sub leading small jet $\phi$ | 
| j2_pt           | third highest leading small jet pt | 
| j2_eta          | third highest leading small jet $\eta$ | 
| j2_phi          | third leading small jet $\phi$ | 
| j0_mass         | leading jet mass | 
| j0_csv          | leading jet ML b-quark likelihood ML discriminator |
| j0_qgid         | leading jet quark vs gluon discrminator | 
| j0_chf          | leading jet charged particle fraction  |
| j0_nhf          | leading jet neutral (on photon) particle fraction  |
| j0_emf          | leading jet photon particle fraction  |
| j0_dphi         | delta $\phi$ w. respect to the sub-leading jet | 
| j1_mass         | subleading jet mass | 
| j1_csv          | subleading jet ML b-quark likelihood ML discriminator |
| j1_qgid         | subleading jet quark vs gluon discrminator | 
| j1_chf          | subleading jet charged particle fraction  |
| j1_nhf          | subleading jet neutral (on photon) particle fraction  |
| j1_emf          | subleading jet photon particle fraction  |
| j1_dphi         | delta $\phi$ w. respect to the leading jet  | 
| j2_mass         | third highest jet mass | 
| j2_csv          | third highest ML b-quark likelihood ML discriminator |
| j2_qgid         | third highest jet quark vs gluon discrminator | 
| j2_chf          | third highest jet charged particle fraction  |
| j2_nhf          | third highest jet neutral (on photon) particle fraction  |
| j2_emf          | third highest jet photon particle fraction  |
| j2_dphi         | delta $\phi$ w. respect to the closest jet  | 
| dj0_pt          | Lead and sub leading jets combined 4-vector pt  | 
| dj0_mass        | Lead and sub leading jets combined 4-vector mass | 
| dj0_phi         | Lead and sub leading jets combined 4-vector $phi$ | 
| dj0_y           | Lead and sub leading jets combined 4-vector rapidity | 
| dj0_qgid        | Lead and sub leading jets combined quark gluon| 
| dj0_csv         | Lead and sub leading jets combined b-quark discriminator | 
| dj0_jdphi       | Lead and sub leading jets difference in transverse plane | 
| nvjet           | number of fat jets | 
| vjet0_pt        | fat jet pt | 
| vjet0_eta       | fat jet $\eta$ | 
| vjet0_phi       | fat jet $\phi$ | 
| vjet0_mass      | fat jet mass | 
| vjet0_csv       | fat jet b-tag probability | 
| vjet0_flavor    | fat jet flavor id (if simulation) | 
| vjet0_t1        | fat jet $\tau_{1}$ | 
| vjet0_t2        | fat jet $\tau_{2}$ | 
| vjet0_t3        | fat jet $\tau_{3}$ | 
| vjet0_msd0      | fat jet soft drop mass $\beta=0$ |
| vjet0_msd1      | fat jet soft drop mass $\beta=1$ |
| vjet0_mprune    | fat jet pruned mass |
| vjet0_mtrim     | fat jet pruned mass |
| vjet0_pullAngle | fat jet color flow variable between quarks |
| vjet0_sj1_csv   | fat jet highest momentum subjet b-tag ML discriminator | 
| vjet0_sj2_csv   | fat jet subleading momentum subjet b-tag ML discriminator | 
| vjet0_sj1_qgid  | fat jet highest momentum subjet quark gluon likelihood | 
| vjet0_sj2_qgid  | fat jet subleading momentum subjet quark gluon likelihood | 
| vjet0_sj1_q     | fat jet highest momentum subjet charge  | 
| vjet0_sj2_q     | fat jet highest momentum subjet charge  | 
| vjet0_sj1_z     | fat jet highest momentum subjet energy relative to fat jet  | 
| vjet0_sj2_z     | fat jet subleading momentum subjet energy relative to fat jet  | 
| vjet0_iso15     | fat jet isolation with 1.5 cone | 
| vjet0_c2b0      | fat jet  $C_{2}^{\beta=0}$ correlation function for two likelihood |  
| vjet0_c2b0P2    | fat jet $C_{2}^{\beta=0.2}$ correlation function for two likelihood |  
| vjet0_c2b0P5    | fat jet $C_{2}^{\beta=0.5}$ correlation function for two likelihood |  
| vjet0_c2b1P0    | fat jet $C_{2}^{\beta=1.0}$ correlation function for two likelihood |  
| vjet0_c2b2P0    | fat jet $C_{2}^{\beta=2.0}$ correlation function for two likelihood |  
| vjet0_qjet      | fat jet quantum jet volatility |  
| vjet0_trig      | fat jet trigger matched | 
| vjet0_genm      | fat jet simulated mass (if matched to a jet) |
| vjet0_genV      | ??? |
| nmuons          | number of muons | 
| mu0_pt          | Leading Muon  pt | 
| mu0_eta         | Leading Muon $\eta$ | 
| mu0_phi         | Leading Muon $\phi$ | 
| dm0_pt          | dimuon combined 4-vector pt | 
| dm0_mass        | dimuon combined 4-vector relativistic mass | 
| dm0_phi         | dimuon combined 4-vector $\phi$ | 
| dm0_y           | dimuon combined 4-vector rapidity | 
| nelectrons      | number of electrons | 
| e0_pt           | leading electron pt | 
| e0_eta          | leading electron $\eta$ |  
| e0_phi          | leading electron $\phi$ |
| ntaus           | number of hadronic $\tau_{h}$ | 
| tau0_pt         | leading $\tau_{h}$ $p_{T}$ |
| tau0_eta        | leading $\tau_{h}$ $\eta$ |
| tau0_phi        | leading $\tau_{h}$ $\phi$ |
| nphotons        | number of additional photons | 
| pho0_pt         | leading photon $p_{T}$ |
| pho0_eta        | leading photon $\eta$ |
| pho0_phi        | leading photon $\phi$ |

Now all of these variables are not needed. In this selection we will focus on what we call "fat jets". The labels there given by `vjet_`. Fat jets are large cone jets that are reconstructed with a large radius $\Delta\theta$ to ensure that both quarks are in the cone. To isolate a single collision we also are applying the PUPPI algorithm (you can ask me about this). You should focus on the `vjet` variables for this project. 

In [None]:
#>>>RUN: PROJ2.1-runcell02

#these are the datasets that we are working with
#keys: ['data','qcd','tt','ww','zz','wz','wqq','wqq13','wqq_n','zqq','zqq13','zqq_n','ggh']

# You can view all of these variables within each dataset using the `keys` option
print('wqq keys')
print(wqq.keys())
print()
print('data keys')
print(data.keys())
#note the bdt varaibles at the end  of the `data` can be ignored,
#this is an old deep learning training that we will not use for this study


<h3>Weights of the simulated data</h3>

Before we look at the simulated data, we need to understand how we weight our simulation. 

The weights can be written out as 
\begin{eqnarray}
w_{tot} & = & \rm{total data} \times \frac{\sigma}{N_{\rm{events}}} \times w_{PU}
\end{eqnarray}

We apply three weights:

- The total data is defined as the total amount data in our sample. To compute this we quote our data in units of $fb^{-1}$. This is a "femto-barn" where a barn is $10^{-28}m^{2}$, a volume of area (that rumour has it Enrico Fermi claimed was as big as a barn).  Note, the total luminosity collected for 8 TeV data was 18.3 $fb^{-1}$.** This translated to 18300 $pb^{-1}$ (picobarns).

- Our next weight is the cross section, $\sigma$ divided over the number of generated events. $\sigma$ is the interaction cross section of the process. We save this ratio in units of $fb$ so that it cancels with our total data "luminosity". In our data files this weight is saved as `scale1fb`. *Note: In most of our data files this variable is saved in units of $pb$ so we need to multiply by an extra factor of 1000. This extra factor of 1000 does not apply for the new wqq_n and zqq_n samples.*

- Lastly, we apply a pileup weight, $w_{PU}$ to match the simulated beam intensity. Pileup stands for the additional interactions between protons when two proton bunches collide at the LHC. This has an effect at modifying the simulation in a certain way. We account for this with the variable ` puweight` saved in our data files.

Therefore we expect a list of weights such as the following:
```weights=[1000*18300,"puweight","scale1fb"]```
where the first element of the list is a *fixed scaling number* and the *last two are variable weights* saved in our files.

In [None]:
#>>>RUN: PROJ2.1-runcell03

# these are the standard weights
weights=[1000*18300,"puweight","scale1fb"]

def get_weights(weights,mask,key):
    # the first element of the list is the scaling weight
    weight = weights[0]
    # this needs to be divided by 1000 if the sample is wqq_n or zqq_n
    if key=='wqq_n' or key=='zqq_n': 
        print('divide weight by 1000.')
        weight /= 1000.
    if key=='ggh': weight /= 1000. #maybe ggh too?
    # now let's loop over the following weights
    for i in range(1,len(weights)):
        weight *= OrdDataDict[key].arrays(weights[i], library="np")[weights[i]][mask]
    return weight

# For our samples with different collision energy (13 TeV) we need to perform a little hack on the cross section weight
# so we normalize them to the number of events of the 8 TeV collision energy samples after a simple mask

#This computes the integral of weighted events assuming a basic mask (see below details of this basic selection)
def integral(iData,iWeights,iKey):
    def selection(iData):
        trigger = (iData.arrays('trigger', library="np")["trigger"].flatten() > 0) # trigger selection
        jetpt   = (iData.arrays('vjet0_pt', library="np")["vjet0_pt"].flatten() > 400) # require jet pT above certain threshold
        allcuts = np.logical_and.reduce([trigger,jetpt]) # apply both masks at the same time
        return allcuts
    mask_sel=selection(iData)
    # get weights and take the integral and return it
    weight = get_weights(iWeights,mask_sel,iKey)
    return np.sum(weight)

def scale(iData8TeV,iData13TeV,iWeights,iKey8TeV,iKey13TeV):
    int_8TeV  = integral(iData8TeV,iWeights,iKey8TeV)
    int_13TeV = integral(iData13TeV,iWeights,iKey13TeV)
    print("Scale %s:"%iKey13TeV,'ratio: ',int_8TeV/int_13TeV,' 8 TeV integral: ',int_8TeV,' 13 TeV integral: ',int_13TeV)
    return int_8TeV/int_13TeV

# we define this extra scaling number as:
wscale=scale(wqq,wqq13,weights,'wqq','wqq13')
zscale=scale(zqq,zqq13,weights,'zqq','zqq13')

#w_nscale=scale(wqq,wqq_n,[18300,"puweight","scale1fb"],'wqq','wqq_n')
#z_nscale=scale(zqq,zqq_n,[18300,"puweight","scale1fb"],'zqq','zqq_n')

# Note: you could apply this weight function such as
# qcd: get_weights(weights,qcd_mask,'qcd')
# wqq_13: get_weights(weights,w_mask,'wqq_13')*wscale

Finally, we will make some quick plotting functions, which will be used in the next section. See the comments on how it works, but it should be pretty straightforward. 

In [None]:
#>>>RUN: PROJ2.1-runcell04

# define some labels and colors
labels = {'qcd': 'QCD',
          'wqq': 'W',
          'zqq': 'Z',
          'wqq13': 'W (13 to 8 TeV)',
          'zqq13': 'Z (13 to 8 TeV)',
          'wqq_n': 'W new',
          'zqq_n': 'Z new',
          'tt': 'tt',
          'ggh': 'H',
          'zz': 'ZZ',
          'ww': 'WW',
          'wz': 'WZ',
          'data': 'Data',
         }
colors = {'qcd': 'orange',
          'wqq': 'royalblue',
          'zqq': 'r',
          'wqq13': 'cornflowerblue',
          'zqq13': 'salmon',
          'wqq_n': 'lightsteelblue',
          'zqq_n': 'lightcoral',
          'tt': 'green',
          'ggh': 'cyan',
          'zz': 'purple',
          'ww': 'brown',
          'wz': 'crimson',
          'data': 'black',
         }

# build a plot to compare/stack histograms
def histErr(iVar,iLabel,iBins,iMin,iMax,iSims,iMasks,iData=None,iMaskData=None,
            iLabels=None,iColors=None,
            iDensity=True,iStack=False,iWeights=None):
    
    fig, ax = plt.subplots(1,1,figsize=(6,6),dpi=80)

    # first plot the simulated data - build arrays
    if isinstance(iSims,dict): # if iSims is a dict
        simhists = [x.arrays(iVar, library="np")[iVar][iMasks[key]] for key,x in iSims.items()] 
    else: # if it's a list
        simhists = [iSims[i].arrays(iVar, library="np")[iVar][iMasks[i]] for i in range(0,len(iSims))]
        
    # define labels
    plot_labels = iLabels
    if iLabels is None:
        plot_labels = [labels[lk] for lk in list(iSims.keys())] #labels
    plot_colors = iColors
    if iColors is None:
        plot_colors = [colors[lk] for lk in list(iSims.keys())] # colors
    
    # build the histogram weights
    hist_weights = None
    if iWeights:
        hist_weights = [get_weights(weights,iMasks[key],key) for key in iSims.keys()]
        if 'wqq13' in key:
            hist_weights *= wscale
        if 'zqq13' in key:
            hist_weights *= zscale
        
    htype = 'bar'
    if not iStack: htype='step'
        
    _,bins,_ = plt.hist(simhists,
                        color=plot_colors, label=plot_labels, weights=hist_weights,
                        range=(iMin,iMax), bins=iBins, alpha=.6, histtype=htype, 
                        density=iDensity,stacked=iStack)
    
    # now include the data points (if any)
    if iData:
        data = iData.arrays(iVar, library="np")[iVar][iMaskData]
        counts, binEdges = np.histogram(data,bins=iBins,range=(iMin,iMax),density=iDensity)
        yerr = np.sqrt(counts) # let's apply Poisson uncertainties
        if iDensity: yerr /= np.sqrt(sum(iMaskData)*(binEdges[1]-binEdges[0]))
        binCenters = (binEdges[1:]+binEdges[:-1])*.5
        plt.errorbar(binCenters, counts, yerr=yerr,fmt="o",c="k",label="Data", ms=3)
    
    #if iDensity:
    #   plt.ylim(0,0.015)
    
    #plt.legend(prop={'size': 10})
    plt.legend(loc=1)
    plt.xlabel(iLabel)
    if iDensity: plt.ylabel("Normalized Counts") 
    else: plt.ylabel("Counts")
    plt.show()

<a name='problems_2_1'></a>     

| [Top](#section_2_0) | [Restart Section](#section_2_1) | [Next Section](#section_2_2) |


### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Checkpoint 2.1.1</span>

Let's consider the objectives of this project. What are we doing and why are we looking for jets? Select all options below that are relevant to understanding the objectives of this project:

A) The goal of this project is to find signatures of the Higgs boson.\
B) The goal of this project is to find W and/or Z bosons that decay into quarks.\
C) Quarks leave showers of particles that we reconstruct as jets.\
D) When the momentum of a W or Z boson is high enough, quarks will decay into a single jet cone.\
E) Studying jets allows us to probe the strong interaction and investigate the properties of quarks and gluons.\
F) The study of jets helps us to search for new physics phenomena, such as the production of exotic particles or particles beyond the Standard Model.

### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Checkpoint 2.1.2</span>

Do all the data sets have the same keys? For instance, is there a difference between the keys in the `data` file compared to the simulation files? Explore the data (optionally complete the code below to return the difference between lists of keys).

Choose the correct option:

A) The data sets have the same keys.\
B) The `data` files contain the same keys as the simulation files, but also some additional information.\
C) The simulation files contain the same keys as the `data` file, but also some additional information.\
D) The data sets all contain different keys and, therefore, different types of information.

In [None]:
#>>>PROBLEM: PROJ2.1.2
# Use this cell for drafting your solution (if desired),
# then enter your solution in the interactive problem online to be graded.

#these are the datasets that we are working with
#keys: ['data','qcd','tt','ww','zz','wz','wqq','wqq13','wqq_n','zqq','zqq13','zqq_n','ggh']

#find difference between lists:
def diff_lists(list1, list2):
    return #YOUR CODE HERE

print(diff_lists(data.keys(),zqq.keys()))

### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Checkpoint 2.1.3</span>

Let's familiarize ourselves with the data a little more and practice extracting some features. Use `np.mean()` to find the average number of jets (`njets`) in the `wqq` dataset. Also, find the average number of b-tags (`nbtags`) detected in the `ggh` dataset.

Report your answer as a list of two numbers with precision 1e-2: `[avg njets in wqq, avg nbtags in ggh]`


In [None]:
#>>>PROBLEM: PROJ2.1.3
# Use this cell for drafting your solution (if desired),
# then enter your solution in the interactive problem online to be graded.

print('Avg number of jets in wqq:', #YOUR CODE HERE)
print('Avg number of b-tags in ggh:', #YOUR CODE HERE)

### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Checkpoint 2.1.4</span>

Let's look more closely at what the `get_weights` and `integral` functions are doing.

It is important to understand that weights are how we estimate the expected number of events. For the weights that we use in this sample, we have 3 numbers:

- the total luminosity of the data `18300*1000` in units of $\mathrm{fb}^{-1}$
- the weight to adjust for the beam intensity known as the pileup weight, `puweight`
- the production cross section of the sample in units of fb, `scale1fb`

Note that the cross section for events in samples can be different due to the way samples were produced.  To select events, we apply a mask. Effectively this is just a cut requiring a certain element of the dataset to behave a certain way. 

Given the information above, show that by using the `get_weights` command, we can get the same value as the `integral` function once we have applied the right mask. Complete the code below, then enter your answer as a list of two numbers with precision 1e-2: `[sum of weights, integral of weighted events]`


In [None]:
#>>>PROBLEM: PROJ2.1.4
# Use this cell for drafting your solution (if desired),
# then enter your solution in the interactive problem online to be graded.

#build a mask to select everything for a sample
sample='qcd'
test_mask      = (dataDict[sample].arrays('trigger', library="np")["trigger"].flatten() >= 0)
test_jet       = (dataDict[sample].arrays('vjet0_pt', library="np")["vjet0_pt"].flatten() > 400) # require jet pT above certain threshold
test_comb      = np.logical_and.reduce([test_mask,test_jet]) # apply both masks at the same time

#The function get_weights() returns the events with the proper weights, after the masks are applied
#print(get_weights(weights,test_comb,sample))

print('Sum of weights:', #YOUR CODE HERE)
print('Integrated result:', #YOUR CODE HERE) #hint: use the integral() function

<a name='section_2_2'></a>
<hr style="height: 1px;">

## <h2 style="border:1px; border-style:solid; padding: 0.25em; color: #FFFFFF; background-color: #90409C">PROJ2.2 Event Selection and Background Mitigation</h2>    

| [Top](#section_2_0) | [Previous Section](#section_2_1) | [Checkpoints](#problems_2_2) | [Next Section](#section_2_3) |


<h3>Event Selection</h3>

Let's talk about to perform the selection of events. 

The dataflow at the LHC is complicated, but we can simplify it to the following diagram. Here I show it for ATLAS, but for CMS its basically the same. 

<!--<img src="images/atlas-data-flow.png" width="600"/>-->
<p align="center">
<img alt="atlas-data-flow" src="https://raw.githubusercontent.com/mitx-8s50/images/main/PROJ2/atlas-data-flow.png" width="600"/>
</p>

>source: https://rreece.github.io/research/<br>
>attribution: (c) Ryan Reece

We will work with the right part of the plot (*ntuples*). The samples that we are using are called *ntuples*, since they have *n* variables. 

The top part of this diagram tells us how the data initially comes from the detector. We usually call it "trigger and DAQ" (DAQ = Data Acquisition). The trigger look at the features of the event **in a fast way** to see if the event is interesting. If the event is interesting we keep it. If it is not, we throw it away. 

*Triggers* can be quite complicated because they have to process a lot of data really fast. The first layer of the trigger takes in data at a rate of 20 MHz (With 13 TeV collisions this increased to 40 MHz). This translates to about 50 terabytes/s, which is the most amount of data in any single system. To process the data quickly, we use specialized (*FPGAs*)(Field-programmable gate arrays) to look at the data quickly and determine if it is interesting. Because we can only take a cursory look at the data, we sometimes make a mistake. This means that with the final reconstructed parameters the trigger will change. To understand how the trigger works let's plot some data.

In our dataset, we have saved different trigger selections that require at least one *fat jet* with *different energy or transverse momentum*. So let's explore these trigger selections.

In [None]:
#>>>RUN: PROJ2.2-runcell01

# Let's select some data (note that trigger can only be > 0)

# First let's build masks on our data - these will be boolean arrays
alldata      = (dataDict['data'].arrays('trigger', library="np")["trigger"].flatten() >= -1000000)
triggerdata1 = (dataDict['data'].arrays('trigger', library="np")["trigger"].flatten() % 2 > 0) #let's require the lowest trigger jet pT > 320
triggerdata2 = (dataDict['data'].arrays('trigger', library="np")["trigger"].flatten() % 4 > 1) #let's require one of our standard triggers (jet pT > 370 )

# Now let's make a plot of the fat jet pt  
# normalized
histErr('vjet0_pt','Fat jet $p_T$ [GeV]',50,300,1e3,
        [dataDict['data'],dataDict['data'],dataDict['data']],
        [alldata,triggerdata1,triggerdata2],
        iLabels=['all','$p_T$>320','$p_T$>370'],
        iColors=['black','red','blue'],
        iDensity=True,iStack=False,iWeights=None)

# and without density
histErr('vjet0_pt','Fat jet $p_T$ [GeV]',50,300,1e3,
        [dataDict['data'],dataDict['data'],dataDict['data']],
        [alldata,triggerdata1,triggerdata2],
        iLabels=['all','$p_T$>320','$p_T$>370'],
        iColors=['black','red','blue'],
        iDensity=False,iStack=False,iWeights=None)

#So you can see as you cut tighter, you get much less jets, but the data will be cleaner (I suggest triggerdata1)

<h3>Mitigating background: Looking at Jet Substructure</h3>

Now, we want to know how to separate our two prong signal jets from one prong background jets. There are some variables in our data that can be used to distinguish these. To make this simple, we are just going to go over the most basic ones. That way you can get a feel for how to identify two prong and one prong jets. Your challenge will be to explore how to do this. 


<h4>Groomed Mass</h4>

Jet Grooming is a very powerful tool to clean up the resolution of the mass of jet. The idea is just like how you would groom a bush. The strategy is to take a jet and remove radiative gluons off of quarks. This spurious, soft (small energy), and wide-angle radiation can effectively broaden the mass of a jet. 

The way that grooming is done is by iterating down and remove clusters of quarks and gluons that have low energy and are far away from the central axes of the quark/gluon. Practially speaking this removes radiation away from the original quark and gluon direction. The details of how this works has deep physical meaning, which I will not go through here. What you should take away is that this is an iterative algorithm that is approximate, not perfect, but helps. 

There are many grooming algorithms. The main ones that we use are trimming, pruning, filtering, and soft drop (with various beta parameters). Typically at the LHC we use soft drop with $\beta=0$. Let's look at how it affects our background (QCD) and our W to quarks signal. 

<h3>Basic Selection</h3>

Let's go ahead and define a basic selection of events, and plot the core variables we just discussed. 

In [None]:
#>>>RUN: PROJ2.2-runcell02

# First let's define a quick selection (a simple pT cut of 400 GeV and a 320 GeV trigger)
def selection(iData):
    #lets apply a trigger selection
    trigger = (iData.arrays('trigger', library="np")["trigger"].flatten() > 0)
    #Now lets require the jet pt to be above a threshold
    jetpt   = (iData.arrays('vjet0_pt', library="np")["vjet0_pt"].flatten() > 400)
    standard_trig = (iData.arrays('trigger', library="np")["trigger"].flatten() % 2 > 0) #lets require one of our standard triggers (jet pT > 320 )
    # standard_trig = (iData.arrays('trigger', library="np")["trigger"].flatten() % 4 > 1) #lets require one of our standard triggers (jet pT > 370 )
    allcuts = np.logical_and.reduce([trigger,jetpt])
    return allcuts

#print(wqq.arrays())
# Let's look at all the data files (except the 8 TeV W and Z samples - let's work with the 13 TeV ones)
myDataDict = OrdDataDict.copy()
del myDataDict['wqq_n']
del myDataDict['zqq_n']
del myDataDict['data']

# Get masks for the selection defined above (both for simulated datasets and data)
masks = {}
for key in myDataDict: masks[key] = selection(myDataDict[key])
maskData = selection(dataDict['data'])

# Now let's plot the mass and the groomed mass (msd0) for the QCD background
fig, ax = plt.subplots(1,1,figsize=(6,6),dpi=80)
plt.title("QCD Background")
plt.hist(qcd.arrays('vjet0_mass', library="np")["vjet0_mass"][masks['qcd']],weights=get_weights(weights,masks['qcd'],'qcd'),
         bins=50,range=(0,300), color='salmon',label="groomed mass", alpha=.6)
plt.hist(qcd.arrays('vjet0_msd0', library="np")["vjet0_msd0"][masks['qcd']], weights=get_weights(weights,masks['qcd'],'qcd'),
         bins=50,range=(0,300), color='red',label="mass", alpha=.6)
plt.legend()
plt.xlabel("QCD Jet mass [GeV]")
plt.ylabel("Counts")
plt.show()

# Let's look at the W/Z samples now (8 TeV collision energy)
fig, ax = plt.subplots(1,1,figsize=(6,6),dpi=80)
plt.title("8 TeV Collision Energy")
plt.hist(myDataDict['wqq'].arrays('vjet0_mass', library="np")["vjet0_mass"][masks['wqq']],weights=get_weights(weights,masks['wqq'],'wqq'),
         bins=50,range=(0,300), color='salmon',label="W mass", alpha=.6)
plt.hist(myDataDict['wqq'].arrays('vjet0_msd0', library="np")["vjet0_msd0"][masks['wqq']], weights=get_weights(weights,masks['wqq'],'wqq'),
         bins=50,range=(0,300), color='red',label="W groomed mass", alpha=.6)
plt.hist(myDataDict['zqq'].arrays('vjet0_mass', library="np")["vjet0_mass"][masks['zqq']],weights=get_weights(weights,masks['zqq'],'zqq'),
         bins=50,range=(0,300), color='pink',label="Z mass", alpha=.6)
plt.hist(myDataDict['zqq'].arrays('vjet0_msd0', library="np")["vjet0_msd0"][masks['zqq']], weights=get_weights(weights,masks['zqq'],'zqq'),
         bins=50,range=(0,300), color='hotpink',label="Z groomed mass", alpha=.6)
plt.legend()
plt.xlabel("Signal Jet mass [GeV]")
plt.ylabel("Counts")
plt.show()

# Let's look at the W/Z samples now (13 TeV collision energy)
# Note that in the weights we need to multiply by wscale
fig, ax = plt.subplots(1,1,figsize=(6,6),dpi=80)
plt.title("13 TeV Collision Energy")
plt.hist(myDataDict['wqq13'].arrays('vjet0_mass', library="np")["vjet0_mass"][masks['wqq13']],weights=get_weights(weights,masks['wqq13'],'wqq13')*wscale,
         bins=50,range=(0,300), color='salmon',label="W  mass", alpha=.6)
plt.hist(myDataDict['wqq13'].arrays('vjet0_msd0', library="np")["vjet0_msd0"][masks['wqq13']], weights=get_weights(weights,masks['wqq13'],'wqq13')*wscale,
         bins=50,range=(0,300), color='red',label="W groomed mass", alpha=.6)
plt.hist(myDataDict['zqq13'].arrays('vjet0_mass', library="np")["vjet0_mass"][masks['zqq13']],weights=get_weights(weights,masks['zqq13'],'zqq13')*wscale,
         bins=50,range=(0,300), color='pink',label="Z mass", alpha=.6)
plt.hist(myDataDict['zqq13'].arrays('vjet0_msd0', library="np")["vjet0_msd0"][masks['zqq13']], weights=get_weights(weights,masks['zqq13'],'zqq13')*wscale,
         bins=50,range=(0,300), color='hotpink',label="Z groomed mass", alpha=.6)
plt.legend()
plt.xlabel("Signal Jet mass [GeV]")
plt.ylabel("Counts")
plt.show()

What you observe is that the mass for our qcd background goes down to much lower values, and the mass for W boson gets more narrow and approaches the mass of the W boson (80.4 GeV).  This is a great way to reduce the background and improve the sensitivity of the signal. 

<h4>N-Subjettiness</h4>

Now lets look at another class of variables. These variables are the n-subjettiness variables. These variables were developed at MIT by Prof. Thaler and a UROP. The original paper is <a href="https://arxiv.org/abs/1011.2268" target="_blank">here</a>. Each of these variables compute the likelihood of a certain number of prongs, or the likelihood that a certain number of sub-jets exist in the shower. 

We write these variables as $\tau_{i}$, with $\tau_{1}$ being the likelihood for 1 pronged jet, $\tau_{2}$ a two pronged and so on. To test these variables we use the ratios as a way to measure the likelihood of $N$ prongs vs $M$ prongs. To look for W and Z bosons, we look for the ratio of 2 prongs with respect to one. Hence, we consider the variable $\tau_{2}/\tau_{1}$. 

Let's now look at how this variable behaves between our signal simulation and our background. 

In [None]:
#>>>RUN: PROJ2.2-runcell03

# Compute the t21 ratio
# let's use the same selection set above
# note that here we are going to use our 13 TeV signal samples
#print(len(masks["qcd"]))

fig, ax = plt.subplots(1,1,figsize=(6,6),dpi=80)
qcdt21 = (qcd.arrays('vjet0_t2', library="np")["vjet0_t2"][masks['qcd']]/
          qcd.arrays('vjet0_t1', library="np")["vjet0_t1"][masks['qcd']])
wt21 = (wqq13.arrays('vjet0_t2', library="np")["vjet0_t2"][masks['wqq13']]/
          wqq13.arrays('vjet0_t1', library="np")["vjet0_t1"][masks['wqq13']])

plt.hist(qcdt21, weights=get_weights(weights,masks['qcd'],'qcd'),
         bins=50, color='red',label="QCD", alpha=.6, density=True)
plt.hist(wt21, weights=get_weights(weights,masks['wqq13'],'wqq13')*wscale,
         bins=50, color='black',label="W", alpha=.6, density=True)
plt.legend()
plt.xlabel(r"$\tau_{21}$")
plt.ylabel("Normalized Counts")
plt.show()

What you can see is that our two pronged signal has a lower $\tau_{2}/\tau_{1}$, so the chance of the background is low. So by requiring $\tau_{2}/\tau_{1} < X$ we can isolate two pronged signals over the QCD background. 

<a name='problems_2_2'></a>     

| [Top](#section_2_0) | [Restart Section](#section_2_2) | [Next Section](#section_2_3) |


### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Checkpoint 2.2.1</span>

Let's understand what the trigger is doing in code cell `PROJ2.2-runcell01`. We have defined two critical triggers that we care about. The first is whether an event has a transverse momentum pT > 320 GeV, and the second is whether an event has a transverse momentum of pT > 370 GeV.

To characterize the trigger, the first bit is 1 if pT > 320 GeV and 0 if it's not. The second bit is 1 if pT > 370 GeV and 0 otherwise. We can write the bit value as: `trigger = 2*(pT > 370) + (pT > 320)`. Consider the following possible scenarios:

- if we have an event with pT < 320, the value of triggger is 0
- if we have an event with pT > 320 but less than 370, the value of the trigger is 1
- if we have an event with pT > 370, the value of trigger is 3

These are the only possible values of trigger. So, we can define the criteria for selecting events with pT > 320 GeV as `trigger % 2 > 0` (i.e., trigger mod 2 = 1).

What is the criteria for selecting events with pT > 370? Complete the code below.

In [None]:
#>>>PROBLEM: PROJ2.2.1
# Use this cell for drafting your solution (if desired),
# then enter your solution in the interactive problem online to be graded.

#this function is defined for you
def pass_320(trigger):
    #return 1 for events with pT > 320
    if trigger % 2 > 0:
        return 1
    else:
        return 0
    
#this function you must complete
def pass_370(trigger):
    #return 1 for events with pT > 370
    if #YOUR CODE HERE:
        return 1
    else:
        return 0

### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Checkpoint 2.2.2</span>

In this project, we are using simulated data for both 8 TeV and 13 TeV energies. It turns out that our data at 8 TeV provides the most accurate predictions, however, the 13 TeV distributions appear much smoother in the plots because they have more events. Smooth shapes, particularly of invariant quanities like mass, make it it easier to plot and interpolate.

How can we effectively use both simulation data sets in our analysis?

A) We can use the shape of the 13 TeV distributions, but scale the normalization to 8TeV distributions. This is an approximation, but it  gets the best features of both.\
B) We can't use 13 TeV at all, just 8 TeV for 8TeV data\
C) We can separately analyze the 8 TeV and 13 TeV datasets and compare the obtained results.\


### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Checkpoint 2.2.3</span>

Below what value of `t2/t1` are W events dominant? Enter your answer as number with precision 1e-1.

### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Checkpoint 2.2.4</span>

Which of the following statements describes what will happen if we change our cut by increasing the `t2/t1` threshold? Select all that apply.

A) We will get more W events compared to background.\
B) We will be able to better distinguish W signal from background.\
C) We will no longer be able to distinguish W signal from background.\
D) Nothing because the signal is independent of `t2/t1`.


<a name='section_2_3'></a>
<hr style="height: 1px;">

## <h2 style="border:1px; border-style:solid; padding: 0.25em; color: #FFFFFF; background-color: #90409C">PROJ2.3 Beginning to Look for the W Signal in the Data</h2>    

| [Top](#section_2_0) | [Previous Section](#section_2_2) | [Checkpoints](#problems_2_3) | [Next Section](#section_2_4) |


<h3>Overview</h3>

Now, lets try to find the $W\rightarrow qq$ and $Z\rightarrow qq$ peak in the **data.** This is a difficult problem and you will have to use the above ideas plus a few others. To give you a hint you should read <a href="https://arxiv.org/abs/1603.00027" target="_blank">this paper</a>. Also, you should consider all of the other physics papers based on this strategy. That includes our <a href="https://arxiv.org/abs/1705.10532" target="_blank">original paper</a> and two follow-up papers <a href="https://arxiv.org/abs/1710.00159" target="_blank">here</a> and  <a href="https://arxiv.org/abs/1909.04114" target="_blank">here</a>. These later papers use more technology developed along the same lines, but the original paper should have all you need to get a resonance. 

To put it all together, we want to make a data vs simulation  plot. For this will take all of our simulations and add them together. Let's make a simple plotting example. We will plot this in two ways. First, we will just show how the normalized distributions look like so we can compare the shapes. Secondly, we will make the stacked histogram plot, so we can see how the data compares to our prediction. 


In [None]:
#>>>RUN: PROJ2.3-runcell01

try:
    del myDataDict['wqq'] #let's omit 8 TeV samples from here
    del myDataDict['zqq']
except:
    print('samples already deleted')
    
# let's compare shapes
histErr('vjet0_msd0','Fat jet $m_{SD}$ [GeV]',50,40,200,
        myDataDict,masks,
        dataDict['data'],maskData,
        iDensity=True,iStack=False,iWeights=True)

# Let's do a stacked plot  of all simulation and data
histErr('vjet0_msd0','Fat jet $m_{SD}$ [GeV]',50,40,200,
        myDataDict,masks,
        dataDict['data'],maskData,
        iDensity=False,iStack=True,iWeights=True)

### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Checkpoint 2.3.1</span>

When we look at the above distributions, we see that the W, Z and other channels yield resonant bumps at various masses. However, in the bottom plot, we don't see these bumps in the data or MC simulation. Why do we not see them? 

A) The bumps are not visible in the bottom plot because the detector effects smear out the resonant structures.\
B) The bumps are not observed in the data or MC simulation in the bottom plot due to limitations in the modeling of certain physical processes.\
C) The bumps seen in the W, Z, and other channels might be due to statistical fluctuations or specific experimental conditions, which are not replicated in the bottom plot.\
D) The bumps are there in the plot on the bottom, but the QCD background is just so much larger than the W, Z samples and others that we just can't see them.

<h3>Starting Our Selection Campaign</h3>

Lets start fresh by defining our samples, selection, and all the tools that we have above. This is just a refresh, so we can start fresh and begin to perform our selection, below.

In [None]:
#>>>RUN: PROJ2.3-runcell02

#Load the data, if you have not done so in Section 1

wqq    = uproot.open("data/WQQ_s.root")["Tree"]
zqq    = uproot.open("data/ZQQ_s.root")["Tree"]
wqq13  = uproot.open("data/skimh/WQQ_sh.root")["Tree"]
zqq13  = uproot.open("data/skimh/ZQQ_sh.root")["Tree"]
wqq_n  = uproot.open("data/WQQ_8TeV_Jan11_r.root")["Tree"]
zqq_n  = uproot.open("data/ZQQ_8TeV_Jan11_r.root")["Tree"]
qcd    = uproot.open("data/QCD_s.root")["Tree"]
tt     = uproot.open("data/TT.root")["Tree"]
ww     = uproot.open("data/WW.root")["Tree"]
wz     = uproot.open("data/WZ.root")["Tree"]
zz     = uproot.open("data/ZZ.root")["Tree"]
ggh    = uproot.open("data/ggH.root")["Tree"]
data   = uproot.open("data/JetHT_s.root")["Tree"]

After loading the data, you are provided with some simple helper functions that have already been used in earlier sections (perhaps slightly differently). These are used for pre-selection (standard cuts that physicsists usually apply before making measurements) and computing the scaling factor of datasets.

In [None]:
#>>>RUN: PROJ2.3-runcell03

def selection(iData):
    '''
    Standard pre-selection
    '''
    #lets apply a trigger selection
    trigger = (iData.arrays('trigger', library="np")["trigger"].flatten() > 0)

    #Now lets require the jet pt to be above a threshold (400 TODO: ASK about units)
    jetpt   = (iData.arrays('vjet0_pt', library="np")["vjet0_pt"].flatten() > 400)

    #Lets apply both jetpt and trigger at the same time
    #standard_trig = (iData.arrays('trigger', library="np")["trigger"].flatten() % 4 > 1) #lets require one of our standard triggers (jet pT > 370 )
    allcuts = np.logical_and.reduce([trigger,jetpt])

    return allcuts
    
def get_weights(iData,weights,sel):
    
    weight = weights[0]
    
    for i in range(1,len(weights)):
        weight *= iData.arrays(weights[i],library="np")[weights[i]][sel]
        
    return weight

def integral(iData,iWeights):
    '''
    This computs the integral of weighted events 
    assuming a selection given by the function selection (see below)
    '''
    
    #perform a selection on the data (
    mask_sel=selection(iData)
    
    #now iterate over the weights not the weights are in the format of [number,variable name 1, variable name 2,...]
    weight  =iWeights[0]
    
    for i0 in range(1,len(iWeights)):
        weightarr = iData.arrays(iWeights[i0], library="np")[iWeights[i0]][mask_sel].flatten()
        
        #multiply the weights
        weight    = weight*weightarr
    
    #now take the integral and return it
    return np.sum(weight)


def scale(iData8TeV,iData13TeV,iWeights):
    '''
    This computes the integral of two selections for two datasets labelled 8TeV and 13TeV,
    but really can be 1 and 2. Then it returns the ratio of the integrals
    '''
    
    int_8TeV  = integral(iData8TeV,iWeights)
    int_13TeV = integral(iData13TeV,iWeights)
    
    return int_8TeV/int_13TeV

<h3>Find the W Peak</h3>

Now, let's define a new function to produce a similar plot that was shown earlier (`PROJ2.3-runcell01`). We will make a plot of the data vs. jet mass and $\tau_2$, first without any cuts. 

In [None]:
#>>>RUN: PROJ2.3-runcell04

def plotDataSim(iVar, iSelection, iVarName, iRange):
    
    #Lets Look at the mass
    weights = [1000*18300, "puweight", "scale1fb"]
    mrange = iRange #range for mass histogram [GeV]
    bins=40            #bins for mass histogram
    density = False     #to plot the histograms as a density (integral=1)

    qcdsel      = iSelection(qcd)
    wsel        = iSelection(wqq13)
    zsel        = iSelection(zqq13)
    datasel     = iSelection(data)
    ttsel       = iSelection(tt)
    wwsel       = iSelection(ww)
    wzsel       = iSelection(wz)
    zzsel       = iSelection(zz)
    gghsel      = iSelection(ggh)

    wscale=scale(wqq,wqq13,weights)
    zscale=scale(zqq,zqq13,weights)

    # Getting the masses of selected events
    dataW = data.arrays(iVar, library="np") [iVar][datasel]
    qcdW  = qcd.arrays(iVar, library="np")  [iVar][qcdsel]
    wW    = wqq13.arrays(iVar, library="np")[iVar][wsel]
    zW    = zqq13.arrays(iVar, library="np")[iVar][zsel]
    zzW   = zz   .arrays(iVar, library="np")[iVar][zzsel]
    wzW   = wz   .arrays(iVar, library="np")[iVar][wzsel]
    wwW   = ww   .arrays(iVar, library="np")[iVar][wwsel]
    ttW   = tt   .arrays(iVar, library="np")[iVar][ttsel]
    gghW  = ggh  .arrays(iVar, library="np")[iVar][gghsel]

    #Define the weights for the histograms
    hist_weights = [get_weights(qcd,weights,qcdsel),
                    get_weights(wqq13,weights,wsel)*wscale,
                    get_weights(zqq13,weights,zsel)*zscale,
                    get_weights(zz,weights,zzsel),
                    get_weights(wz,weights,wzsel),
                    get_weights(ww,weights,wwsel),
                    get_weights(tt,weights,ttsel),
                   ]

    #Hint: Provide a list of selected data
    plt.hist([qcdW,wW, zW, zzW, wzW, wwW, ttW],
             color=["royalblue","r", "orange","g", "b", "purple", "cyan",], 
             label=["QCD", "W", "Z", "ZZ", "WZ", "WW", "tt",], 
             weights=hist_weights,
             range=mrange, bins=50, alpha=.6, density=density,stacked=True)

    #Other configurations for the histogram
    counts, bins = np.histogram(dataW, bins=bins, range=mrange, density=density)
    yerr = np.sqrt(counts) / np.sqrt(len(dataW)*np.diff(bins))
    binCenters = (bins[1:]+bins[:-1])*.5
    plt.errorbar(binCenters, counts, yerr=yerr,fmt="o",c="k",label="data")
    plt.legend()
    plt.xlabel(iVarName)
    plt.ylabel("Counts")
    plt.show()

plotDataSim("vjet0_msd0", selection, "Jet Mass", [40,200])
plotDataSim("vjet0_t2", selection, r"$\tau_2$", [0,0.5]) 
#Add some code here to compare variables

<h3>The first cut</h3>

Now we want to plot the jet mass, first making a cut on the value of $\tau_2/\tau_1$ that best discriminates background from signal. Refer to the plot from code cell `PROJ2.2-runcell03` and the results from `Checkpoint 2.2.3`.

**Complete the code below by entering a value for the `t21` threshold. The examine the plot.**

In [None]:
#>>>RUN: PROJ2.3-runcell06

#Define tau2/tau1
def t21_func(itau1,itau2):
    return itau2/itau1


def selectionW_firstcut(iData):
    '''
    This is the specific selection for selecting out events with W signal for our analysis
    '''
    
    #Pre-selection citeria
    trigger = (iData.arrays('trigger', library="np")["trigger"].flatten() >= 0)
    jetpt   = (iData.arrays('vjet0_pt', library="np")["vjet0_pt"].flatten() >= 400)
    
    #Select the jets to compute tau2/tau1
    jett2   = (iData.arrays('vjet0_t2', library="np")["vjet0_t2"].flatten())
    jett1   = (iData.arrays('vjet0_t1', library="np")["vjet0_t1"].flatten())
        
    t21 = t21_func(jett1,jett2)
                                
    #And then perform the cut
    #Hint: You could determine the threshold of the cut by plotting the distribution of 
    #t21ddt scores for W and background and then determine a ball park threshold
    #where you think the W signal would be best selected
    #Or more simply you could look at the given plot and determine the appropriate threshold.
    
    t21cut   = t21 < #YOUR CODE HERE (enter t21 threshold)
    
    allcuts = np.logical_and.reduce([trigger, jetpt, t21cut])
    
    return allcuts

plotDataSim("vjet0_msd0", selectionW_firstcut, "Jet Mass",[40,200])

Looking at this plot, we can see that the W peak is not at all obvious to find! This is why we need to employ additional techniques in order to clearly identify the W peak, which is what you'll have the chance to do in the next section!

<a name='problems_2_3'></a>   

| [Top](#section_2_0) | [Restart Section](#section_2_3) |


### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Checkpoint 2.3.2</span>

In the above plot, there is the simulation prediction (colored histograms), and then the data (black points). You can see the data has a shape that looks like there are two bumps on each other. However, the bumps are merged and peak roughly at the same spot. Why can we not just use the simulation to extract the W and Z bumps? Select all that apply:

A) We can! The best way to analyze the properties of a signal is to see where is matches simulation exactly.\
B) We cannot because our simulations use assumptions that would bias our measurement. This is effectively a kind of circular analysis.\
C) While the simulation provides a useful reference, it is essential to account for potential discrepancies between the simulation and data due to uncertainties in the theoretical models, calibration of the detectors, or unknown physics phenomena. 


<a name='section_2_4'></a>
<hr style="height: 1px;">

## <h2 style="border:1px; border-style:solid; padding: 0.25em; color: #FFFFFF; background-color: #90409C">PROJ2.4 Refining our Selection to Look for the W Signal in the Data</h2>   

| [Top](#section_2_0) | [Previous Section](#section_2_3) | [Checkpoints](#problems_2_4) | [Next Section](#section_2_5) |


<h3>Overview of W Signal</h3>

Now, lets do the lab. **Your first challenge is to make a mass plot and perform fitting for W signal.**

A hint is that the plots should come out similarly to this (it doesn't have to be exactly the same), where on the left we show the soft-drop mass/groomed mass ($m_{SD}$) distribution for different processes in the Monte Carlo simulation along with the real data. The plot on the right shows the fit:

<!--<img src="images/S50_WFit.png" width='900'>-->
<p align="center">
<img alt="Fit for W Peak" src="https://raw.githubusercontent.com/mitx-8s50/images/main/PROJ2/S50_WFit.png" width="900"/>
</p>


<h3>Objective 1: Develop a procedure to select data and make a W boson mass plot</h3>

Since finding W peak is hard, we need to use another parameter, $\rho$, which is a scaling variable for QCD jets. This parameter adds another channels of mass and $p_T$ to our selection, helping us to refine our W peak. The parameter $\rho$ is defined in <a href="https://arxiv.org/pdf/1603.00027.pdf" target="_blank">this paper.</a>

Your first goal is to figure out how $\rho$ is defined by quoting the paper, and then figure out the best selections based on a combination of $\rho$ and $\tau_2/\tau_1$. The final cut is based on a parameter defined as *DT* (Deccorelated Taggers) score:

$$(\tau_2/\tau_1)_{dt} = \tau_2/\tau_1 - (\text{your correlation})*\rho$$

Where the correlation `(your correlation)` is the correlation coefficient between $\tau_2/\tau_1$ and $\rho$. 

To figure out the correlation, let's plot $\tau_2/\tau_1$ and $\rho$ in the data first!

### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Checkpoint 2.4.1</span>

Complete the code below to plot $\tau_2/\tau_1$ vs. $\rho$. Specifically, we will check the function `rho_func` in the answer-checker, then you should use your result within the funciton `plot_taus_and_rho` to create a plot.

In [None]:
#>>>PROBLEM: PROJ2.4.1
# Use this cell for drafting your solution (if desired),
# then enter your solution in the interactive problem online to be graded.

def rho_func(imass,ipt,mu=1.):
    return #YOUR CODE HERE
    

def plot_taus_and_rho(iData):
    
    jetptnocut = (iData.arrays('vjet0_pt', library="np")["vjet0_pt"].flatten())
    jetmass = (iData.arrays('vjet0_msd0', library="np")["vjet0_msd0"].flatten())
    jett2   = (iData.arrays('vjet0_t2', library="np")["vjet0_t2"].flatten())
    jett1   = (iData.arrays('vjet0_t1', library="np")["vjet0_t1"].flatten())
    
    #Define rho according to the paper (eqn. 3.2)
    #Really this is rho_prime as defined in the paper, with mu=1 in these units
    rho = #YOUR CODE HERE
    
    #Define tau2/tau1
    t21 = t21_func(jett1,jett2)
    
    plt.hist2d(rho, t21, bins = 40)
    
    plt.xlabel(r"$\rho$")
    plt.ylabel(r"$\tau_2/\tau_1$")
    plt.show()
    
plot_taus_and_rho(qcd)

Great! Now we can fit a line on the 2D histogram to determine the correlation! Here we give you the fitting code. The codes fit by putting a threshold on the 2D histogram to selectively fit on the most relevant data points, your task for this is to play around with the threshold to determine the best fit!

In [None]:
#>>>RUN: PROJ2.4-runcell02

def fit_correlation(iData):
    
    jetptnocut = (iData.arrays('vjet0_pt', library="np")["vjet0_pt"].flatten())
    jetmass = (iData.arrays('vjet0_msd0', library="np")["vjet0_msd0"].flatten())
    jett2   = (iData.arrays('vjet0_t2', library="np")["vjet0_t2"].flatten())
    jett1   = (iData.arrays('vjet0_t1', library="np")["vjet0_t1"].flatten())
    
    #Define rho according to the paper
    rho = rho_func(jetmass,jetptnocut)
    
    #Define tau2/tau1
    t21 = t21_func(jett1,jett2)
    
    plt.hist2d(rho, t21, bins = 40)
    plt.xlabel(r"$\rho$")
    plt.ylabel(r"$\tau_2/\tau_1$")
    
    #Fit the line
    #Produce 2D histogram
    H,xedges,yedges = np.histogram2d(rho,t21, bins=40,density = True)
    
    bin_centers_x = (xedges[:-1]+xedges[1:])/2.0
    bin_centers_y = (yedges[:-1]+yedges[1:])/2.0
    
    #Find the non-zero indicies
    non_zero_idx = np.argwhere(H > 0.6) #You can play around with this!
    x_idx = non_zero_idx[:,0]
    y_idx = non_zero_idx[:,1]
    
    x_coord = [bin_centers_x[x_idx[i]] for i in range(0,len(x_idx))]
    y_coord = [bin_centers_y[y_idx[i]] for i in range(0,len(y_idx))]
    
    #Fit a linear model on the points plotted
    def func(x, a, b):
        return a * x + b
    plt.scatter(x_coord, y_coord)
    
    popt, pcov = curve_fit(func, x_coord, y_coord)
    plt.plot(bin_centers_x, func(bin_centers_x, *popt), 'r-',
             label='fit: a=%5.3f, b=%5.3f' % tuple(popt))
    
    #Show the fit result
    legend = plt.legend()
    plt.setp(legend.get_texts(), color='w')
    plt.show()

from scipy.optimize import curve_fit
fit_correlation(qcd)

### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Checkpoint 2.4.2</span>

Now use the correlation from the plot to define $(\tau_2/\tau_1)_{dt}$ and plot it with $\rho$ to verify that we have successfully decorrelated the tagger. If you do it correctly, you can see that the decorrelated scores are now independent of $\rho$ (you should see a straight-line distribution in the histogram)!

Specifically, complete the function `t21ddt_func`, which should decorrelate the `t21` value as a function of `rho`. Set the default value of `iMcorr` based on your fit above.

In [None]:
#>>>PROBLEM: PROJ2.4.2
# Use this cell for drafting your solution (if desired),
# then enter your solution in the interactive problem online to be graded.

def t21ddt_func(it21,irho,iMcorr=#YOUR CODE HERE):
    #iMcorr is the correlation coefficient
    return #YOUR CODE HERE


def plot_tausdt_and_rho(iData):
    
    jetptnocut = (iData.arrays('vjet0_pt', library="np")["vjet0_pt"].flatten())
    jetmass = (iData.arrays('vjet0_msd0', library="np")["vjet0_msd0"].flatten())
    jett2   = (iData.arrays('vjet0_t2', library="np")["vjet0_t2"].flatten())
    jett1   = (iData.arrays('vjet0_t1', library="np")["vjet0_t1"].flatten())
    
    #Define rho according to the paper
    rho = rho_func(jetmass,jetptnocut)
    
    #Define tau2/tau1
    t21 = t21_func(jett1,jett2)
    
    #decorrelated tagger score
    t21ddt = t21ddt_func(t21,rho)
    
    plt.hist2d(rho, t21ddt, bins = 40)
    
    plt.xlabel(r"$\rho$")
    plt.ylabel(r"$\tau_2/\tau_1$_dt")
                
    #Fit the line
    #Produce 2D histogram
    H,xedges,yedges = np.histogram2d(rho,t21ddt, bins=40,density = True)
    
    bin_centers_x = (xedges[:-1]+xedges[1:])/2.0
    bin_centers_y = (yedges[:-1]+yedges[1:])/2.0
    
    #Find the non-zero indicies
    non_zero_idx = np.argwhere(H > 0.6) #You can play around with this!
    x_idx = non_zero_idx[:,0]
    y_idx = non_zero_idx[:,1]
    x_coord = [bin_centers_x[x_idx[i]] for i in range(0,len(x_idx))]
    y_coord = [bin_centers_y[y_idx[i]] for i in range(0,len(y_idx))]
    
    #Fit a linear model on the points plotted
    def func(x, a, b):
        return a * x + b
    plt.scatter(x_coord, y_coord)
    
    popt, pcov = curve_fit(func, x_coord, y_coord)
    plt.plot(bin_centers_x, func(bin_centers_x, *popt), 'r-',
             label='fit: a=%5.3f, b=%5.3f' % tuple(popt))
    
    #Show the fit result
    legend = plt.legend()
    plt.setp(legend.get_texts(), color='w')
    plt.show()
    
plot_tausdt_and_rho(qcd)

### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Checkpoint 2.4.3</span>

Since you figured out your decorrelation, determine the best cut for the decorrelated taggers score (Refer to the plot from code cell `PROJ2.2-runcell03` and the results from `Checkpoint 2.2.3`) and use it in your selection function!

Complete the function `get_t21cut_W`, which should return values of `t21ddt` below the threshold that you define.

In [None]:
#>>>PROBLEM: PROJ2.4.3
# Use this cell for drafting your solution (if desired),
# then enter your solution in the interactive problem online to be graded.

def get_t21cut_W(t21ddt,t21_thresh=#YOUR CODE HERE):
    #return values of t21ddt that occur below the threshold
    #based on previous analysis of t21 vs. rho
    ### YOUR CODE HERE ### 
    return

    
def selectionW(iData):
    '''
    This is the specific selection for selecting out events with W signal for our analysis
    '''
    
    #Pre-selection citeria
    trigger = (iData.arrays('trigger', library="np")["trigger"].flatten() >= 0)
    jetpt   = (iData.arrays('vjet0_pt', library="np")["vjet0_pt"].flatten() >= 400)
    jetptnocut = (iData.arrays('vjet0_pt', library="np")["vjet0_pt"].flatten())
    jetmass = (iData.arrays('vjet0_msd0', library="np")["vjet0_msd0"].flatten())
    jett2   = (iData.arrays('vjet0_t2', library="np")["vjet0_t2"].flatten())
    jett1   = (iData.arrays('vjet0_t1', library="np")["vjet0_t1"].flatten())
    
    
    #Define the parameters rho, tau2/tau1
    rho = rho_func(jetmass,jetptnocut)
    t21 = t21_func(jett1,jett2)
    
    #Define the decorrelated tagger
    Mcorr = #YOUR CODE HERE
    t21ddt = t21ddt_func(t21,rho,Mcorr) 
                            
    #And then perform the cut
    #Hint: You could determine the threshold of the cut by plotting the distribution of 
    #t21ddt scores for W and background and then determine a ball park threshold
    #where you think the W signal would be best selected
    t21cut   = ### YOUR CODE HERE ### 
    
    allcuts = np.logical_and.reduce([trigger, jetpt, t21cut])
    
    return allcuts

Now that we have our selection function, let's try to make the mass plot!

In [None]:
#>>>RUN: PROJ2.4-runcell05

#Lets Look at the mass
weights = [1000*18300, "puweight", "scale1fb"]
mrange = (45,200)  #range for mass histogram [GeV]
bins=40            #bins for mass histogram
density = True     #to plot the histograms as a density (integral=1)

qcdsel      = selectionW(qcd)
wsel        = selectionW(wqq13)
zsel        = selectionW(zqq13)
datasel     = selectionW(data)
ttsel       = selectionW(tt)
wwsel       = selectionW(ww)
wzsel       = selectionW(wz)
zzsel       = selectionW(zz)
gghsel      = selectionW(ggh)
wscale=scale(wqq,wqq13,weights)
zscale=scale(zqq,zqq13,weights)

dataW = data.arrays('vjet0_msd0', library="np")["vjet0_msd0"][datasel]
qcdW = qcd.arrays('vjet0_msd0', library="np")["vjet0_msd0"][qcdsel]
wW = wqq13.arrays('vjet0_msd0', library="np")["vjet0_msd0"][wsel]
zW = zqq13.arrays('vjet0_msd0', library="np")["vjet0_msd0"][zsel]
zzW = zz.arrays('vjet0_msd0', library="np")["vjet0_msd0"][zzsel]
wzW = wz.arrays('vjet0_msd0', library="np")["vjet0_msd0"][wzsel]
wwW = ww.arrays('vjet0_msd0', library="np")["vjet0_msd0"][wwsel]
ttW = tt.arrays('vjet0_msd0', library="np")["vjet0_msd0"][ttsel]
gghW = ggh.arrays('vjet0_msd0', library="np")["vjet0_msd0"][gghsel]

hist_weights = [get_weights(qcd,weights,qcdsel),
                get_weights(wqq13,weights,wsel)*wscale,
                get_weights(zqq13,weights,zsel)*zscale,
                get_weights(zz,weights,zzsel),
                get_weights(wz,weights,wzsel),
                get_weights(ww,weights,wwsel),
                get_weights(tt,weights,ttsel),
               ]

plt.hist([qcdW,wW, zW, zzW, wzW, wwW, ttW], 
         color=["royalblue","r", "orange","g", "b", "purple", "cyan",], 
         label=["QCD", "W", "Z", "ZZ", "WZ", "WW", "tt",], 
         weights=hist_weights,
         range=mrange, bins=50, alpha=.6, density=density,stacked=True)

counts, bins = np.histogram(dataW, bins=bins, range=mrange, density=density)
yerr = np.sqrt(counts) / np.sqrt(len(dataW)*np.diff(bins))
binCenters = (bins[1:]+bins[:-1])*.5
plt.errorbar(binCenters, counts, yerr=yerr,fmt="o",c="k",label="data")
plt.legend()
plt.xlabel(r"Mass [GeV]")
plt.ylabel("Normalized Counts")
plt.show()

Remember to compare your mass plot with the one shown at the start of this section.

<a name='section_2_5'></a>
<hr style="height: 1px;">

## <h2 style="border:1px; border-style:solid; padding: 0.25em; color: #FFFFFF; background-color: #90409C">PROJ2.5 Fit for W Peak</h2>   

| [Top](#section_2_0) | [Previous Section](#section_2_4) | [Checkpoints](#problems_2_5) | [Next Section](#section_2_6) |


<h3>Objective 2: Fit the W mass peak with the appropriate function</h3>

Now you will perform the fit on W signal. It involves a few steps:
    
<h4>1. Defining a model</h4>

First you need to define a fit model of your own. In this case we would use some functions (gaussian, exponential) in conjuntion with a 6th order polynomial. You could see more on how the order of the polynomials are determined here: https://en.wikipedia.org/wiki/Chow_test. The concepts were also covered in previous Lessons, if you want to review them. Adding a chow test will likely allow you to improve the measurement by lowering the polynomial. Whilte its not needed here, we strongly encourage this investigation. 

In the extended projec you will have the chance to determine the order of the polynomial for the Z fit. You might see that we might not necessarily need a 6th order polynomial for the Z fit. The main reason for this is that we have much more data in W sample than the Z sample.

### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Checkpoint 2.5.1</span>

Define a fit function `fitW()` that combines a Gaussian of the form $a\exp{-(x-\mu)^2/(2\sigma^2)}$ with a 5th order polynomial. The parameters `a`, `mu`, and `sigma`, the parameters of the polynomial are left as fit parameters. 

In [None]:
#>>>PROBLEM: PROJ2.5.1
# Use this cell for drafting your solution (if desired),
# then enter your solution in the interactive problem online to be graded.

def fitW(x, p0, p1, p2, p3, p4, p5, a, mu, sigma):
    #Our model is a gaussian on top of 5th order polynomial.
    
    #Define the polynomial
    poly  = #YOUR CODE HERE
    
    #Define the gaussian
    gauss = #YOUR CODE HERE
    
    #Stick them together
    y =  poly + a*gauss
    
    return y

<h4>2. Performing a fit</h4>

After defining your model, you need to get the data histogram and perform the fit:

### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Checkpoint 2.5.2</span>

Edit the fit model `p` to set the initial conditions for the fit parameters. Run the fit and see how it looks!

What is the reduced chi-squared value? Is it good? Report your answer as a number with precision 1e-1.

In [None]:
#>>>PROBLEM: PROJ2.5.2
# Use this cell for drafting your solution (if desired),
# then enter your solution in the interactive problem online to be graded.

# Now we get the data histogram so we can fit it
bins = 50
mrange=[40,140]
counts, bins = np.histogram(dataW,bins=bins,range=mrange,density=False)

w = (1/ #poisson unc) #Poisson uncertainty here
binCenters = (bins[1:]+bins[:-1])*.5
x,y = binCenters.astype("float32"), counts.astype("float32")

#Perform the fit 
model = lm.Model(fitW)
     
#Set initial conditions for your fit
#You could experiment with zeros or your intuition first.
#For better fit I suggest adding restrictions to the fit.
p = model.make_params(#YOUR CODE HERE) 


result_W = model.fit(data=y,
                   params=p,
                   x=x,
                   weights=w)

#Plot the result
plt.figure()
result_W.plot()
plt.xlabel("mass[GeV]",position=(0.92,0.1))
plt.ylabel("Entries/bin",position=(0.1,0.84))

#Print the fit summary
print(result_W.fit_report())
result_W.chisqr

<h4>3. Extracting the Mass</h4>

Remember to compare your fit plot! Now we need to extract the W mass and the error in the measurement!

### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Checkpoint 2.5.3</span>

Finally, get the mass and standard error from your fit results (corresponding to the specific choice of fit function we defined in `Checkpoint 2.5.2`). This should be the parameter `mu`, corresponding to the Gaussian fit. Report the mass with precision 1e-1.

In [None]:
#>>>PROBLEM: PROJ2.5.3
# Use this cell for drafting your solution (if desired),
# then enter your solution in the interactive problem online to be graded.

mW = #YOUR CODE HERE
mWerr = #YOUR CODE HERE

print(mW, "+/-", mWerr)