# Project Summary: Higgs → ττ Classification

This notebook gives a simple written summary of the work done in the other notebooks:
- Exploratory Data Analysis
- Binary Classification (signal vs background)
- Multiclass Classification (Z vs ggH vs VBF)
- Main MRes assignment context

Goal: Measure the Higgs signal strengths for gluon fusion (μ_ggH) and vector boson fusion (μ_VBF) with precision close to CMS reference values (≈9% and ≈18%). A machine learning approach using XGBoost creates smarter bins for a likelihood fit.


## Methods (Brief)

1. Load data for three decay channels: et (electron–tau), mt (muon–tau), tt (tau–tau).
2. Keep processes separate: Z (background), ggH and VBF (signal).
3. Clean missing values (−9999 → median). No complex feature engineering required.
4. Train:
   - Binary models (merged channels, with channel indicator columns) for a simple baseline.
   - Multiclass XGBoost models separately per channel for Z vs ggH vs VBF.
5. Turn multiclass probabilities into a 2D grid of bins:
   - x: P(ggH)/(P(ggH)+P(VBF)) distinguishes production mode.
   - y: P(ggH)+P(VBF) distinguishes signal vs background.
6. Scale each process histogram to expected yields (Z × 8.4, ggH × 0.034, VBF × 0.011).
7. Combine channel histograms and fit Poisson likelihood to extract μ_ggH and μ_VBF.


## Binary Classification (Baseline)

Purpose: Quick check that ML can separate signal (ggH+VBF) from Z background.

Highlights:
- XGBoost gave the strongest accuracy/AUC among tested models (Random Forest, simple Neural Network).
- Provides clean signal vs background ordering for 1D histograms.
- Limitation: Cannot tell ggH from VBF, so not ideal for separate μ measurements.


## Multiclass Classification (Main Approach)

We train one XGBoost model per channel (et, mt, tt) with three classes: Z, ggH, VBF.

Why better:
- Preserves differences between channels.
- Separates ggH and VBF so we can measure μ_ggH and μ_VBF individually.

Performance:
- Good accuracy and confusion matrices show clear separation from Z.
- Most confusion is between ggH and VBF (expected – similar final states).

## Probability Binning & Scaling

We turn the multiclass probabilities into 2D bins:
- Horizontal (production): P(ggH) / (P(ggH)+P(VBF))
- Vertical (signal‑likeness): P(ggH)+P(VBF)

Each process histogram is then scaled to expected yields (physics cross‑sections) so the likelihood fit reflects real rates rather than raw counts.

## Likelihood Fit (Simple View)

We fit Poisson counts in all bins to extract:
- μ_ggH (gluon fusion signal strength)
- μ_VBF (vector boson fusion signal strength)
- Also a single merged μ (optional)

Precision = (uncertainty / value) × 100%. Targets: ggH ≤ 9%, VBF ≤ 18%.

## Conclusion & Simple Next Steps

The multiclass, per‑channel XGBoost approach provides clearer separation and enables measuring μ_ggH and μ_VBF with useful precision. The probability‑based 2D binning is a practical improvement over a single binary score.

Next simple improvements:
- Try a few more x/y bin combinations.
- Tune XGBoost depth and learning rate.
- Calibrate probabilities (e.g. isotonic) if needed.

Overall, the notebooks show a complete path: clean data → train models → build smart bins → fit μ values. This matches the project goal in a straightforward way.