# Lab 5: LHC Particle Physics Data Analysis

**Author:** [YOUR NAME]

**Course:** Physics 434 - Data Analysis Lab  
**Objective:** Explore LHC particle physics data and optimize discovery significance through event selection

In this lab, you will work with realistic particle physics data from the Large Hadron Collider (LHC). The goal is to identify Higgs boson signals from QCD background noise using jet substructure analysis and cut-based optimization.

## Dataset Information

Two pT (transverse momentum) ranges are provided:

### Low pT (250-500 GeV/c): `Sample_pt_250_500`
- **Training samples:**
  - `higgs_100000_pt_250_500.pkl` (Expected yields: N_higgs = 100)
  - `qcd_100000_pt_250_500.pkl` (Expected yields: N_qcd = 20,000)
- **Pseudo-experiments:**
  - `data_highLumi_pt_250_500.h5`
  - `data_lowLumi_pt_250_500.h5`

### High pT (1000-1200 GeV/c): `Sample_pt_1000_1200`
- **Training samples:**
  - `higgs_100000_pt_1000_1200.pkl` (Expected yields: N_higgs = 50)
  - `qcd_100000_pt_1000_1200.pkl` (Expected yields: N_qcd = 2,000)
- **Pseudo-experiments:**
  - `data_highLumi_pt_1000_1200.h5`
  - `data_lowLumi_pt_1000_1200.h5`

## Physics Background

### What is a Jet?
A jet is a collection of particles traveling in the same direction, originating from a hard quark or gluon through bremsstrahlung and fragmentation processes.

### Jet Substructure
At high transverse momenta, heavy particles (W, Z, Higgs, top quarks) have collimated decays. Standard jet identification fails because all decay products end up in a single jet. **Jet substructure variables** computed from constituent 4-momenta help distinguish boosted heavy particles with two-pronged decays from QCD jets.

### Key Variables:
- **mass**: Jet invariant mass (GeV)
- **d2**: Jet substructure variable for two-pronged discrimination
- **η (eta)**: Pseudorapidity (geometric quantity related to polar angle)
- **φ (phi)**: Azimuthal angle around the beam

**Extended Reading:** 
- ATLAS detector: https://arxiv.org/pdf/1709.04533.pdf
- Jet substructure: https://arxiv.org/abs/1201.0008

## Setup and Data Loading

In [None]:
# TODO: Import required libraries
# import pickle
# import numpy as np
# import matplotlib.pyplot as plt
# import pandas as pd

In [None]:
# TODO: Load the QCD background data
# Use pickle to load: Sample_pt_250_500/qcd_100000_pt_250_500.pkl


In [None]:
# TODO: Load the Higgs signal data
# Use pickle to load: Sample_pt_250_500/higgs_100000_pt_250_500.pkl


In [None]:
# TODO: Explore the data structure
# Print the keys available in the data dictionary
# Print the first few values of 'mass' and 'd2'

# Task 1: Visualization (3 points)

Explore the low pT dataset (`Sample_pt_250_500`). Make representative plots of each feature to understand the data structure and characteristics.

## Part (a): Individual Feature Distributions

Create histograms for each available feature in both signal and background datasets.

In [None]:
# TODO: Plot mass distributions for signal and background
# Create overlaid histograms with proper labels and legends

In [None]:
# TODO: Plot d2 distributions for signal and background

In [None]:
# TODO: Plot other available features (pt, eta, phi, etc.)
# Create a multi-panel figure showing all features

## Part (b): Summary Statistics

Calculate and display basic statistics for key features.

In [None]:
# TODO: Calculate mean, std, min, max for mass and d2

# Task 2: Data Exploration (3 points)

Study correlations between mass and d2 jet substructure variable to build a cut-based analysis.

## Part (a): Distribution Comparison and 2D Scatter Plots

Create mass and d2 distributions, and 2D scatter plots for signal (```'Higgs Signal'```) and background (```'QCD Background'```).

In [None]:
# TODO: Create two plots showing mass distributions (signal vs background)
# Describe the shape and discrimination power

In [None]:
# TODO: Create two plots showing d2 distributions (signal vs background)
# Describe the shape and discrimination power

In [None]:
# TODO: Create 2D scatter plot of mass vs d2 for signal

In [None]:
# TODO: Create 2D scatter plot of mass vs d2 for background

**Question:** Describe the discrimination power of mass and d2 for separating signal from background.

**Your Answer:**

## Part (b): Weighted Distributions

Re-weight signal (N_signal = 100) and background (N_background = 20,000) to match expected yields.

In [None]:
# TODO: Calculate weights for signal and background
# N_signal_expected = 100, N_background_expected = 20000


In [None]:
# TODO: Create weighted histograms with signal stacked on background for mass
# Use plt.hist with weights parameter and stacked=True

In [None]:
# TODO: Create weighted histograms with signal stacked on background for mass and d2

**Question:** Describe the visibility of observing the signal over the background.

**Your Answer:**

## Part (c): Mass Window Selection

Apply a mass cut of [120, 130] GeV to enhance signal visibility.

In [None]:
# TODO: Apply mass cut [120, 130] GeV
# Create boolean masks for signal and background


In [None]:
# TODO: Plot weighted mass distributions after the cut
# Does the mass distribution look as expected?

In [None]:
# TODO: Plot weighted d2 distributions after the mass cut
# How does this compare to d2 without any cuts?

**Question:** How does the d2 plot after mass cut compare to the one without cuts?

**Your Answer:**

# Task 3: Significance Optimization (4 points)

Scan over d2 values to find the optimal cut that maximizes discovery significance.

**Significance Formula:** $\text{Significance} = \frac{N_{\text{signal}}}{\sqrt{N_{\text{background}}}}$ (in units of σ)

The goal is to reject as much background as possible while keeping signal to make it more significant.

## Part (a): Understanding d2 Cuts

Inspect the d2 distribution after mass cut to determine the appropriate cut direction.

**Questions to consider:**
1. How does d2 distribution change after applying the mass cut?
2. For a d2 cut value of 4, should you keep events below or above 4?
3. What range should be used to scan d2 cut values?

**Your Answers:**

## Part (b): d2 Cut Scan

Scan d2 cut values and calculate significance for each cut.

In [None]:
# TODO: Define d2 scan range (e.g., 0 to 10 with fine steps)

# TODO: For each d2 cut value:
#   1. Count signal and background events passing both mass and d2 cuts
#   2. Apply weights to get expected yields
#   3. Calculate significance = N_signal / sqrt(N_background)


In [None]:
# TODO: Plot significance vs d2 cut value
# Mark the point where significance reaches 3σ

## Part (c): Final Results with Optimal Cuts

Apply both mass and optimal d2 cuts to visualize the final result.

In [None]:
# TODO: Identify optimal d2 cut value (e.g., for 3σ significance)

# TODO: Apply both mass [120, 130] and optimal d2 cuts

# TODO: Calculate final significance

In [None]:
# TODO: Plot final mass distribution with both cuts applied
# Show signal stacked on background with optimal cuts

**Question:** What do you observe in the final mass distribution? What is the final significance? How much improvement did you achieve?

**Your Answer:**

# Bonus: High pT and Multi-Feature Optimization (3 points)

Optimize discovery significance for high pT data (`Sample_pt_1000_1200`) using at least 3 features.

In [None]:
# TODO: Load high pT data

In [None]:
# TODO: Explore available features

In [None]:
# TODO: Implement multi-feature optimization through for loops
# Consider combinations of mass, d2, and other jet substructure variables

# for ... :
    # for ... :
        # for ... :
            # if significance > best_significance:
            #     best_significance = significance
            #     best_cuts = {'mass_cut': mass_cut, 'd2_cut': d2_cut, 't21_cut': t21_cut}

In [None]:
# TODO: Plot final mass distribution with optimal cuts applied
# Show signal stacked on background with optimal cuts