In [None]:
import os
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import scipy.optimize as optimize
from functools import partial
%matplotlib inline

plt.rcParams['font.size'] = 14

# Searching for the Higgs boson

### Introduction

The overarching goal of this project is to search for the Higgs boson. The Higgs boson was discovered in July 2012 by two experiments, the ATLAS and CMS experiments. It was the last particle to be experimentally discovered in the Standard Model. The Standard Model is a theory describing how fundamental particles interact through the weak, electromagnetic, and strong forces. This figure from [Quanta magazine](https://www.quantamagazine.org/a-new-map-of-the-standard-model-of-particle-physics-20201022/) depicts the Standard Model particles:

<img src="figures/StandardModel.png" width="400"/>


To discover the Higgs boson, protons were collided at a center-of-mass energy of 7 and 8 TeV. When protons are collided at such a high energy, the protons "break up" into their constituent particles (called quarks and gluons). These constituent particles interact in a variety of ways according to the Standard Model and can produce other particles. The ATLAS and CMS experiments are huge detectors which measure these outgoing particles.

Recall that in weeks 7 of this course we searched for dark matter in a simplified data set. In spirit this is similar to the dark matter search, but there are several key differences:

1. **Detectors:** The detectors at CERN that were used to discover the Higgs particle are the size of a 6 story building (about 50 m long and 25 m tall).  The detectors surround the collision point, and are constructed to fully image the particles that are produced when two protons collide at very nearly the speed of light.  Sophisticated trigger and data analysis techniques are used to filter the data in order to find events that might have contained a Higgs boson.  
2. **What to look for:** 
- The Higgs boson is unstable, and decays too quickly to other particles (the predicted lifetime is $\approx10^{-22}$ seconds) to be directly "imaged". As such, physicists look for events with particles that are consistent with coming from the Higgs boson decay.  The Higgs can decay to a number of different particles, but we will be looking at decays to two photons. 
- In the dark matter search in Week 7, we used the **S2** variable (the size of the electron charge signal) to distinguish signal from background.  For the Higgs search we will instead be looking at the **mass** of the parent particle which decayed into two photons.  
- For background events that mass will look like a falling distribution, and we will model it as a linear function.  For signal events that mass will be peaked distribution, centered at the mass of the Higgs particle ( $m_{H} = 125.1 \frac{\rm GeV}{c^2}$ ).   This means that in the Higgs search, we will be cutting on both sides of the signal. In other words, the signal events are $ | m - m_{H} | < w $, where $w$ is the width of the cut window.

3. **Signal and background yield:** In this search, we have much more data, even at the last step of the data analysis.  Both the rate of signal events and background events are much higher than for the dark matter search.


You can try two different ways of doing the search.

1. **"Cut and count":**  
- The general idea is to define what we call a signal region (or a region enriched in signal events), and a background region (a region enriched in background events), and to show that there are more events in the signal region that in the background region.
- This would be straightforward if our background rate was constant as a function of mass. Then you can simply pick a window around where you expect a peak to be as the signal region, and another window far away from the peak, of the same width, as the background region. 
- In our case, since the background is not constant, you can consider picking two background regions, one on either side of the signal, and to average them. Or you can come up with a different way.

<img src="figures/higgs_cut_and_count_2.png" width="400"/>


2. **"Fitting":** 
- Pick a function to model your signal, and a separate one to model your background.
- Fit your models to the data, and use the results to extract the size of the signal.

<img src="figures/Higgs_Mass_fit.png" width="400"/>


For comparison, here is a figure from the Higgs discovery paper from the ATLAS experiment:

<img src="figures/2012Higgsplot.png" width="400"/>


### Potential goals for this project:

1. Optimize the width of the cut window for the best significance for discovering the Higgs boson. To do this, define a metric that quantifies your expected significance given the expected numbers of signal and background events, and explain why you chose this metric.

2. Apply the "cut and count" analysis to the real data and calculate the significance. 

3. Plot the increase in significance, that you expect as a function of time using the "cut and count" analysis. In other words, how would your ability to discover the Higgs scale with more data? Come up with a formula that describes the increase vs time. 

4. Apply the "fitting" analysis to the real data and come up with a way to quantify the significance of the results.

5. Compare the results of the "fitting" analysis with the "cut and count" analysis. 

Try to complete all of these goals (or similar goals if you have your own ideas). We have written much of the code for you, so be sure to explain each plot and number that you make thoroughly.

# Project details


## Load the data

This is **24 months** of simulated data.

We have histogrammed the Higgs mass data in units of $1~{\rm GeV}/{c^2}$, as was done in the plots above. We will look at all the data between $90$ and $160~{\rm GeV}/{c^2}$.

In [None]:
data = np.loadtxt('../data/Higgs.txt')
mass_grid = np.linspace(90., 160., 71) # for reference
masses = data[:,0]
nevts = data[:,1]
errors = np.sqrt(nevts)

fig, ax = plt.subplots(figsize=(8, 5))

ax.errorbar(masses, nevts, yerr=errors, fmt='.')

ax.set_xlabel(r"Mass [GeV/$c^2$]")
ax.set_ylabel(r"Counts [per GeV/$c^2$]")

fig.tight_layout()
plt.show()

## Useful variables

First, let's make our lives easy by defining some variables based on the measured Higgs particle mass, making an intial guess for the mass itself, and a initial guess for the width of the mass peak. **Define these variables based on the plots above.**

(N.B. If you look up the theoretical width of the Higgs peak due to quantum-mechanical effects, it's about $4~{\rm MeV}/{c^2}$, which is much narrower than the width we see. The width of the peak is entirely dominated by detector effects.) 

In [None]:
Higgs_Mass  = None # In Units of GeV / c**2
Higgs_Width = None # In Units of GeV / c**2

## Signal and background models

Here are some initial models that you can use. You may want to optimize these models. Note that the models are expressed in units of events per GeV per month.

`Gauss`: tells you the value from a gaussian
* Function arguments
    * x: x-value(s) to evaluate at 
    * nsig: normalization factor of gaussian
    * mu: center
    * sigma: width

`poly1`: tells you the value of a linear function
* Function arguments
    * x: x-value(s) to evaluate at 
    * ref_mass: x offset
    * offset: y offset
    * slope: slope of function
    
`passed_cuts`: tells you how many signal and background events from the idealized data pass your cuts if you use a cut window of a particular width, using models for how much signal and background we expect.
* Function arguments
    * cut_width: half of the width of your signal/background region 
    * masses: the array of mass points in data
    * model_sig: the values of your signal model
    * model_bkg: the values of your background model

In [None]:
def Gauss(x, nsig, mu, sigma):
    return nsig*stats.norm(loc=mu, scale=sigma).pdf(x)

def poly1(x, ref_mass, offset, slope):
    return offset + (x-ref_mass)*slope

def passed_cuts(cut_width, masses, model_sig, model_bkg, Higgs_Mass):
    mask = np.abs(masses - Higgs_Mass) < cut_width
    n_sig = np.sum(model_sig[mask])
    n_bkg = np.sum(model_bkg[mask])
    return n_sig, n_bkg

In [None]:
ref_mass                    = Higgs_Mass
nsig_per_month              = 20.
nbkg_per_gev_per_month      = 40.
bkg_slope_per_gev_per_month = -0.2
model_bkg = poly1(mass_grid, ref_mass, nbkg_per_gev_per_month, bkg_slope_per_gev_per_month)
model_sig = Gauss(mass_grid, nsig_per_month, Higgs_Mass, Higgs_Width)

In [None]:
fig, ax = plt.subplots(figsize=(8, 5))

ax.scatter(mass_grid, model_sig, label="Signal", marker='.')
ax.scatter(mass_grid, model_bkg, label="Background", marker='.')
ax.scatter(mass_grid, model_sig+model_bkg, label="Combined", marker='.')

ax.set_xlabel(r"Mass [GeV/$c^2$]")
ax.set_ylabel(r"Counts [per GeV/$c^2$ / month]")

ax.legend(fontsize=10)
fig.tight_layout()

plt.show()

## Useful functions to optimize cuts and set expectations

You can use these three functions to:

1. `plot_nexp_passed_cuts`: makes a plot of how many signal and background events from your model data pass your cuts if you as a function of the cut width.
* Function arguments:
    * masses: the array of mass points in data
    * model_sig: the values of your signal model
    * model_bkg: the values of your background model

2. `find_sig2noise`: makes of plot of the significance of the signal.
* Function arguments:
    * masses: the array of mass points in data
    * model_sig: the values of your signal model
    * model_bkg: the values of your background model

3. `sig2noise_v_time`: shows you how the significance will increase with time, assuming you keep taking data.
* Function arguments:
    * masses: the array of mass points in data
    * model_sig: the values of your signal model
    * model_bkg: the values of your background model
    
** Note that you will have to define your own `significance` function. This should depend on the expected number of background events and the number of signal events and should represent your confidence in discovering dark matter if it exists. Hint: Think in terms of sigmas.

In [None]:
def significance(nsig, nbkg):
    significance = None # Define your significance function here
    return significance

def plot_nexp_passed_cuts(masses, model_sig, model_bkg, show=False):
    sig_cts = np.zeros(26)
    bkg_cts = np.zeros(26)
    widths = np.linspace(0, 25, 26)
    for i, width in enumerate(widths):
        sig_cts[i], bkg_cts[i] = passed_cuts(width, masses, model_sig, model_bkg, Higgs_Mass)
        
    fig, ax = plt.subplots(figsize=(8, 5))
    ax.plot(widths, sig_cts, label="Signal")
    ax.plot(widths, bkg_cts, label="Background")
    ax.set_yscale('log')
    ax.set_xlabel(r"Cut Width [GeV / $c^2$]")
    ax.set_ylabel(r"Events [per GeV / $c^2$ / month]")
    ax.legend(fontsize=10)
    fig.tight_layout()
    if show:
        plt.show()
    
def find_sig2noise(masses, model_sig, model_bkg, plot=True):
    sig_cts = np.zeros(26)
    bkg_cts = np.zeros(26)
    widths = np.linspace(0, 25, 26)
    for i, width in enumerate(widths):
        if i == 0:
            continue
        sig_cts[i], bkg_cts[i] = passed_cuts(width, masses, model_sig, model_bkg, Higgs_Mass)
    sig2noise = np.zeros(26)
    sig2noise[1:] = significance(sig_cts[1:],bkg_cts[1:])
    if plot:
        fig, ax = plt.subplots(figsize=(8, 5))
        ax.plot(widths, sig2noise)
        ax.set_xlabel(r"Cut Width [GeV / $c^2$]")
        ax.set_ylabel(r"Significance for one month")
        fig.tight_layout()
        plt.show()
    return sig2noise

def sig2noise_v_time(mass_grid, model_sig, model_bkg, plot=True):
    max_s2n = np.zeros(24)
    best_cut = np.zeros(24)
    n_months_array = np.arange(24)
    for n_months in n_months_array:
        if n_months == 0:
            continue
        s2n = find_sig2noise(mass_grid, n_months*model_sig, n_months*model_bkg, plot=False)
        max_s2n[n_months] = np.max(s2n)
        best_cut[n_months] = np.argmax(s2n)
    if plot:
        fig, ax = plt.subplots(figsize=(8, 5))
        ax.scatter(n_months_array, max_s2n)
        ax.set_xlabel(r"Time [months]")
        ax.set_ylabel(r"Significance for N months")
        fig.tight_layout()
        plt.show()

    return max_s2n

## Useful functions for the cut and count analysis

You can use these two functions to:

`extract_sig_from_data`: tells you many events are in your signal region on the "real" data.  Since it is real data, you don't know if they are signal or background.

* Function arguments:
    * cut_width: half of the width of your signal/background region 
    * masses: the array of mass points in data
    * nevs: the number of events observed a each mass in data   
    * Higgs_mass: the center of your signal region


`extract_bkg_from_data`: tells you many events are in your background region on the "real" data and the Poisson error.

* Function arguments:
    * cut_width: half of the width of your signal/background region 
    * masses: the array of mass points in data
    * nevs: the number of events observed a each mass in data
    * low_center: the center of your "lower" background region 
    * high_center: the center of your "higher" background region

In [None]:
def extract_sig_from_data(cut_width, masses, nevts, Higgs_Mass):
    mask = np.abs(masses - Higgs_Mass) < cut_width
    return np.sum(nevts[mask])

def estimate_bkg_from_data(cut_width, masses, nevts, low_center, high_center):
    mask_bkg_lo = np.abs(masses-low_center) < cut_width
    mask_bkg_hi = np.abs(masses-high_center) < cut_width
    mask_bkg = np.bitwise_or(mask_bkg_lo, mask_bkg_hi)
    bkg_estimate = 0.5 * np.sum(nevts[mask_bkg])
    return (bkg_estimate, np.sqrt(bkg_estimate))

In [None]:
# Example of how to use estimate_bkg_from_data
bkg_yld, bkg_err = estimate_bkg_from_data(10, masses, nevts, 105, 145)
print(f"In background region, I found {bkg_yld} +/- {bkg_err} events")

## Useful functions for fitting the data

The important one here is `fitAndPlotResult`, which will find take your data, fit your model, and plot the results. It returns the fitted number of signal events and the error on that fit.

* Function arguments:
    * masses: the array of mass points
    * nevs: the number of events observed a each mass
    * ref_mass: the reference mass for the background model
    * init_pars: guesses for the initial parameters of the model. Should be in the format of a list: [p0,p1,p2,p3]

The three model parameters that are fitted for are: 

1. The total number of signal events.
2. The number of background events in the bin at the reference mass.
3. The slope of the background model, in events per bin.
4. The width of the signal peak.

The other parameters will be fixed.

In [None]:
def model_func(x, ref_mass, nsig, offset, slope, width):
    return Gauss(x, nsig, Higgs_Mass, width) + poly1(x, ref_mass, offset, slope)

def generic_chi2(params, data_vals, model, x, ref_mass):
    model_vals = model(x, ref_mass, *params)
    return np.sum(((data_vals - model_vals)**2)/data_vals)

def cost_func(data_vals, model, x, ref_mass):
    return partial(generic_chi2, data_vals=data_vals, model=model, x=x, ref_mass=ref_mass)

def fitAndPlotResult(masses, nevts, ref_mass, init_pars):
    our_cost_func = cost_func(nevts, model_func, masses, ref_mass=ref_mass)
    result = optimize.minimize(our_cost_func, x0=np.array(init_pars))
    fit_pars = result['x']
    cov = result['hess_inv']
    model_fit = model_func(masses, ref_mass, *fit_pars)
    background_fit = poly1(masses, ref_mass, fit_pars[1], fit_pars[2])

    print("Best Fit ---------")
    print(f"Normalization factor for gaussian : {fit_pars[0]:0.1f} [Events]")
    print(f"        Fitted width for gaussian : {fit_pars[3]:0.4f} GeV/c^2")
    print(f"      Linear offset at Higgs mass : {fit_pars[1]:0.2f} [Events / GeV/c^2]")
    print(f"                     Linear slope : {fit_pars[2]:0.2f} [Events / GeV/c^2 / GeV/c^2]")

    fig, ax = plt.subplots(figsize=(8, 5))
    ax.errorbar(masses, nevts, yerr=np.sqrt(nevts), fmt='.', label="data")
    ax.plot(masses, background_fit, label="background model")
    ax.plot(masses, model_fit, label="full model")
    ax.set_xlabel(r"mass $[\frac{\rm GeV}{c^2}]$")
    ax.set_ylabel(r"Events $[{\rm per }\frac{\rm GeV}{c^2}]$")
    ax.legend(fontsize=10)
    fig.tight_layout()
    plt.show()

    return (fit_pars[0], np.sqrt(cov[0,0]))