# Higgs Vignette

In [1]:
import numpy as np
import pandas as pd
import gzip
from millipede import BernoulliLikelihoodVariableSelector

## A toy Higgs analysis

We consider a [publicly available dataset](https://archive.ics.uci.edu/ml/datasets/HIGGS#) of (simulated) Higgs production in particle collisions at a particle collider like the LHC. 
The Higgs is a fundamental particle that formers the cornerstone of the Standard Model of particle physics.

From our point of view this is just a logistic regression dataset with 28 features. Let's see if we can figure out which features are most informative for differentiating background events from true Higgs production events. In other words let's do feature selection. We proceed as follows.

## First we load our dataset

Our dataset has 28 continuous-valued covariates (features) and one binary-valued response or outcome variable. The response is in the final column ('signal_event'), with 0 corresponding to background events and 1 corresponding to signal events (i.e. those with a Higgs boson). We've randomly subsampled the (large) dataset to have only 500 data points.

In [2]:
dataframe = pd.read_csv(gzip.GzipFile("./higgs.csv.gz", "rb"), index_col=0)
dataframe.head(5)

Unnamed: 0,lepton_pT,lepton_eta,lepton_phi,missing_energy_magnitude,missing_energy_phi,jet1_pt,jet1_eta,jet1_phi,jet1_btag,jet2_pt,...,jet4_phi,jet4_btag,m_jj,m_jjj,m_lv,m_jlv,m_bb,m_wbb,m_wwbb,signal_event
0,-0.97938,0.7984,-0.419629,-0.920256,-0.326625,-0.924768,-0.031182,-0.797841,-1.0,-0.890909,...,0.077073,1.0,-0.937439,-0.891028,-0.68126,-0.85659,-0.901631,-0.905623,-0.908678,0.0
1,-0.750671,0.1504,0.670497,-0.957327,-0.001578,-0.901614,-0.336001,0.309158,1.0,-0.841725,...,0.184998,-1.0,-0.674253,-0.661044,-0.682098,-0.831254,-0.802116,-0.672779,-0.704905,1.0
2,-0.731734,0.5164,-0.085668,-0.937382,-0.672873,-0.86047,0.138569,-0.933781,1.0,-0.88945,...,-0.245166,-1.0,-0.945057,-0.91048,-0.681382,-0.838173,-0.775481,-0.859885,-0.871721,0.0
3,-0.517963,0.152,-0.746267,-0.947095,-0.869415,-0.887704,-0.038853,0.214605,1.0,-0.71504,...,0.640253,-1.0,-0.965931,-0.935671,-0.682011,-0.762698,-0.871967,-0.868329,-0.891515,1.0
4,-0.84246,-0.4184,0.432682,-0.834378,-0.446632,-0.829899,-0.451059,-0.709019,1.0,-0.881156,...,0.794339,-1.0,-0.908299,-0.853577,-0.608219,-0.829275,-0.82844,-0.761596,-0.767367,0.0


In [3]:
dataframe['signal_event'].head(5)

0    0.0
1    1.0
2    0.0
3    1.0
4    0.0
Name: signal_event, dtype: float64

## Next we create a VariableSelector object appropriate for our binary-valued responses

In [4]:
# Note that by default we make use of all features/covariates.
# If we want to exclude any features from consideration we should
# drop them from the dataframe using pandas.DataFrame.drop
selector = BernoulliLikelihoodVariableSelector(dataframe,      # pass in the data
                                               'signal_event', # indicate the column of responses
                                               S=1,            # specify the expected number of covariates included a priori
                                               )

## Finally we run the MCMC algorithm to compute posterior inclusion probabilities (PIPs) and other posterior quantities of interest

In [5]:
# this should take less than 15 seconds to run on a reasonably fast laptop
selector.run(T=5000, T_burnin=1000, verbose=False, seed=0)

## The results are available in the selector.summary DataFrame

- It appears that m_bb is the most important feature. This is not surprising, since this quantity reconstructs the mass of the Higgs boson. In particular the Higgs mass appears as a distinct peak on top of a smooth continuum of m_bb values from the background events.
- Note that the intercept term does not have a corresponding PIP, since it is always included in the model by assumption.

In [6]:
selector.summary.sort_values(by=['PIP'], ascending=False)

Unnamed: 0,PIP,Coefficient,Coefficient StdDev,Conditional Coefficient,Conditional Coefficient StdDev
m_bb,0.94854,-5.908611,1.929967,-6.240037,1.365868
m_wwbb,0.904235,-8.698346,4.107494,-9.630761,3.11451
m_wbb,0.819471,8.352952,4.236343,10.130489,1.938654
jet2_pt,0.629456,2.2021,1.756765,3.466697,0.688888
jet1_pt,0.380623,1.350613,1.70777,3.379943,0.66298
jet4_pt,0.214269,0.767959,1.387868,2.977446,0.943046
m_jjj,0.115638,0.655927,2.068083,6.229692,2.428537
missing_energy_magnitude,0.025588,-0.052868,0.345914,-2.0437,0.746257
m_jlv,0.019716,0.076237,0.516262,2.758054,1.498525
m_jj,0.010686,-0.040688,0.511469,-2.621528,3.176341


Some additional stats about the MCMC run are available in `selector.stats`:

In [7]:
selector.stats

{'Weight quantiles': '5/10/20/50/90/95:  3.59e-03  6.13e-02  2.78e-01  3.25e+00  3.75e+00  3.85e+00',
 'Weight moments': 'mean/std/min/max:  2.38e+00  1.50e+00  7.29e-07  4.33e+00',
 'Elapsed MCMC time': '6.8 seconds',
 'Mean iteration time': '1.134 ms',
 'Number of retained samples': 5000,
 'Number of burn-in samples': 1000,
 'Adapted xi value': '3.092',
 'Polya-Gamma MH stats': 'Mean acc. prob.: 0.838  Accepted/Attempted: 1081/1291'}