# The Experiment
This experiment uses custom modules in the *backend* folder, which have been specifically developed for this third-year lab. This notebook works directly with the **13 TeV 2025 ATLAS Open Data**. 

Please run the cell below to install the required packages. You will need to do this **each time you start the server**.

In [None]:
!pip install atlasopenmagic
!pip install pyarrow==20.0.0
from atlasopenmagic import install_from_environment
install_from_environment(environment_file="../backend/environment.yml")

Next, run the cell below to import the required modules and functions for the experiment. Repeat this step **every time you restart the kernel**.

In [None]:
import os
import re
import awkward as ak 
import vector
import time
import datetime
from zoneinfo import ZoneInfo
import uproot
import glob
import numpy as np
import hist
from hist import Hist
import matplotlib.pyplot as plt
from matplotlib.ticker import AutoMinorLocator # for minor ticks
import sys
sys.path.append('../')
from backend import analysis_uproot, DIDS_DICT, VALID_SKIMS, plot_stacked_hist, plot_histograms, analysis_parquet, plot_errorbars

## Accessing Data Samples
Data samples stored in the *backend* folder can be accessed using `skim` **and** a *string code*, passed to the `analysis_uproot` function. The key `skim` and string codes are listed below:  

#### `Skim` (Final-State Filters)
* `'2to4lep'` - Events with two to four leptons, each with at least 7 GeV of transverse momentum $p_T$
* `'exactly4lep''` - Events with exactly four leptons, each with at least 7 GeV of transverse momentum $p_T$
* `'GamGam'` - Events with at least two photons, each with at least 25 GeV of $p_T$

#### String Codes
* `'Data'` - Real data
* `'Zee'` - Simulated $Z \rightarrow e^+e^-$ events
* `'Zmumu'` - Simulated $Z \rightarrow \mu^+\mu^-$ events
* `'Hyy'` - Simulated $H \rightarrow \gamma \gamma$ events

> **Note**: The string code for real data is always `'Data'`, regardless of the `skim` used.

To combine multiple datasets, combine the string codes using `'+'`. For example, if you would like to combine the $Z \rightarrow \mu^+\mu^-$ and $Z \rightarrow e^+e^-$ datasets, use the string code `'Zee+Zmumu'`.

All available `skim` can be viewed with `VALID_SKIMS`, and all string codes with `DIDS_DICT.keys()`.

In [None]:
VALID_SKIMS

In [None]:
DIDS_DICT.keys()

To use `analysis_uproot`, the following arguments must be provided:
- `string_code_dict` (*dict*) : Maps dataset labels to string codes
  - **Keys**: Labels used in the output dictionary of Awkward Arrays  
        - Must include 'Data' in the key for real data. When detected, `analysis_uproot` will skip weight calculation for that sample to reduce memory usage.
  - **Values**: The corresponding string codes
  - **Example**:
    ```
    string_code_dict = {
        'Data' : 'Data',
        'Signal Zee' : 'Zee',
        'Background Wenu+Wmunu' : 'Wenu+Wmunu'
    }
    ```
- `skim` (*str*) e.g. `'2to4lep'`
- `luminosity` (*float*) : Integrated luminosity in fb⁻¹, e.g. `36.6`.
- `fraction` (*float*) : Fraction of each dataset to read, e.g. `0.1` for 10%
- `read_variables` (*list of str*) : Names of ROOT branches to read
- `save_variables`(*list of str*) : Variables to keep in memory or write to disk

Optional argument:
- `cut_function` (*callable*) : Function that receives an argument and returns it. Can be used to:
    - Apply selection cuts
    - Compute derived variables using the data in `read_variables`, e.g. invariant mass
- `local_files` (*bool*) : Set to `False` to stream the sample files, `True` to access the local files
- `sample_path` (*str*) : Path to the sample files, either for accessing existing files or downloading new ones (default: `'../backend/datasets'`)
- `write_parquet` (*bool*) : Set to `True` to write output to Parquet files
- `output_directory` (*str or None*) :
   - Directory where files will be saved
   - If `None`, a unique folder in `output/` will be created using the current date and time
- `write_txt` (*bool*) : Set to `True` to write a summary log of the run
- `txt_filename` (*str or None*) : Filename for the summary log. If not provided, a unique file will be created in the `txt/` folder.
- `return_output` (*bool*)
  - `True` (default) : Returns the output dictionary of Awkward Arrays
  - `False` : Nothing is returned (saves memory when only writing to disk)

In [None]:
# Investigate weights
string_code_dict = {
#    'Data' : 'Data',
#    'Signal Zee' : 'Zee',
    'Signal Zmumu' : 'Zmumu',
#    'Background Wenu+Wmunu' : 'Wenu+Wmunu',
#    'Background VV4l+H4l' : 'VV4l+H4l'
}
skim = '2to4lep'
luminosity = 36.6
fraction = 0.01
read_variables = ['lep_n', 'lep_pt', 'lep_type',  'lep_isLooseID',
 'lep_isMediumID',
 'lep_isTightID',
 'lep_isLooseIso',
 'lep_isTightIso',
 'filteff', 'kfac', 'xsec', 'mcWeight', 'ScaleFactor_PILEUP',  'ScaleFactor_ELE', 'ScaleFactor_MUON', 'ScaleFactor_LepTRIGGER']

save_variables = read_variables

def pt_30_cut(data):
    data = data[data['lep_n'] == 2]
    
    pt_cut = (data['lep_pt'][:, 0] > 15) & (data['lep_pt'][:, 1] > 30)
    data = data[pt_cut]
    type_cut = (data['lep_type'][:, 0] + data['lep_type'][:, 1] ==26)
    data = data[type_cut]

    return data

pt30_data = analysis_uproot(skim, string_code_dict, luminosity, fraction, 
                            read_variables, save_variables, cut_function=pt_30_cut)
pt30_data

In [None]:
# Investigate ordering of pt vector
string_code_dict = {
#    'Data' : 'Data',
#    'Signal Zee' : 'Zee',
#    'Signal Zmumu' : 'Zmumu',
    'Signal H4l' : 'H4l',
#    'Background Wenu+Wmunu' : 'Wenu+Wmunu',
#    'Background VV4l+H4l' : 'VV4l+H4l'
    'Background VV4l' : 'VV4l'
}
#skim = '2to4lep'
skim = 'exactly4lep' 
luminosity = 36.6
fraction = 0.3
read_variables = ['lep_n', 'lep_pt', 'lep_type']

save_variables = read_variables

def pt_30_cut(data):
    data = data[data['lep_n'] == 4]
 
    data['sum_lep_type'] = data['lep_type'][:, 0] + data['lep_type'][:, 1] + data['lep_type'][:, 2] + data['lep_type'][:, 3] 
    # Select events with correctly matched pairs of opposite-sign same-flavour leptons 
    data = data[(data['sum_lep_type']!=44) & (data['sum_lep_type']!=52)]

    data['pt_diff_1'] = data['lep_pt'][:, 0] - data['lep_pt'][:, 1]
    data['pt_diff_2'] = data['lep_pt'][:, 1] - data['lep_pt'][:, 2]
    data['pt_diff_3'] = data['lep_pt'][:, 2] - data['lep_pt'][:, 3]

    return data
    
pt30_data = analysis_uproot(skim, string_code_dict, luminosity, fraction, 
                            read_variables, save_variables, cut_function=pt_30_cut)
pt30_data

In [None]:
# run preselection to make the default '2to4lep' parquet files
string_code_dict = {
#    '2to4lep' : 'Data',
#    'H4l' : 'H4l',
#    'VV4l' : 'VV4l',
#    'Zee' : 'Zee',
#    'Zmumu' : 'Zmumu',
    'm10_40_Zee' : 'm10_40_Zee', 
    'm10_40_Zmumu' : 'm10_40_Zmumu', 
    'm10_40_Ztautau' : 'm10_40_Ztautau',
#    'ttbar' : 'ttbar',
#    'Wenu' : 'Wenu',
#    'Wmunu' : 'Wmunu',
#    'Ztautau' : 'Ztautau',
#    'Wtaunu' : 'Wtaunu',
#    'ttV' : 'ttV',
#    'VVV' : 'VVV',
    'VV3l' : 'VV3l',
    'VBF_Zll' : 'VBF_Zll',
}

skim = '2to4lep'
luminosity = 36.6
fraction = 1

read_variables = ['lep_n',
 'lep_pt',
 'lep_eta',
 'lep_phi',
 'lep_e',
 'lep_ptvarcone30',
 'lep_topoetcone20',
 'lep_type',
 'lep_charge',
 'lep_isLooseID',
 'lep_isMediumID',
 'lep_isTightID',
 'lep_isLooseIso',
 'lep_isTightIso',
 'trigE',
 'trigM',
 'lep_isTrigMatched']

save_variables = read_variables

def preselection_2to4lep_cut(data):
    # Preselection cuts for writing out skim = '2to4lep' parquet files
    # Require single-lepton trigger and at least one trigger-matched lepton with pt > 27 GeV

    
    # Require that a single lepton trigger fired 
    data = data[(data['trigM'] | data['trigE'])] 

    # Require at least one lepton to have  pt > 27 GeV and be trigger matched (
    data = data[((data['lep_pt'][:, 0] > 27) & (data['lep_isTrigMatched'][:, 0])) | ((data['lep_pt'][:, 1] > 27) & (data['lep_isTrigMatched'][:, 1])) | (data['lep_n'] > 2)]
    return data

analysis_uproot(skim, string_code_dict, luminosity, fraction,
                            read_variables, save_variables, local_files=True, cut_function=preselection_2to4lep_cut,output_directory='../backend/parquet2', write_parquet=True, return_output=False)


In [None]:
string_code_dict = {
#    'Data' : 'Data',
#    'Signal Zee' : 'Zee',
    'Signal Zmumu' : 'Zmumu',
#    'Background Wenu+Wmunu' : 'Wenu+Wmunu',
#    'Background VV4l+H4l' : 'VV4l+H4l'
}
skim = '2to4lep'
luminosity = 36.6
fraction = 0.005
read_variables = ['lep_n', 'lep_pt',
                 'lep_eta', 'lep_phi', 'lep_e']
save_variables = read_variables

def pt_30_cut(data):
    data = data[data['lep_n'] == 2]
    
    pt_cut = (data['lep_pt'][:, 0] > 30) & (data['lep_pt'][:, 1] > 30)
    data = data[pt_cut]
    # Define four momentum
    four_momentum = vector.zip({
        'pt': data['lep_pt'],
        'eta' : data['lep_eta'],
        'phi' : data['lep_phi'],
        'E' : data['lep_e']
    })
    # Add the 4-momentum of the two leptons in each event and get the 
    # invariant mass using .M
    data['mass'] = (four_momentum[:, 0] + four_momentum[:, 1]).M
    return data

pt30_data = analysis_uproot(skim, string_code_dict, luminosity, fraction, 
                            read_variables, save_variables, cut_function=pt_30_cut)
pt30_data

You can simply set `write_parquet = True` if you would like to write data to disk. It is recommended to set `return_output = False` to reduce memory usage if you choose to save output to Parquet files.

In [None]:
def pt_20_cut(data):
    data = data[data['lep_n'] == 2]

    pt_cut = (data['lep_pt'][:, 0] > 20) & (data['lep_pt'][:, 1] > 20)
    data = data[pt_cut]
    
    # Define four momentum
    four_momentum = vector.zip({
        'pt': data['lep_pt'],
        'eta' : data['lep_eta'],
        'phi' : data['lep_phi'],
        'E' : data['lep_e']
    })
    # Add the 4-momentum of the two leptons in each event and get the 
    # invariant mass using .M
    data['mass'] = (four_momentum[:, 0] + four_momentum[:, 1]).M
    return data

output_dir = 'output/pt20cut'
pt20_data = analysis_uproot(skim, string_code_dict, luminosity, fraction, 
                            read_variables, save_variables, cut_function=pt_20_cut,
                            write_txt=False, txt_filename=None, local_files=True,
                            write_parquet=True, output_directory=output_dir,
                            return_output=False)

## Reading Parquet Files
You can read the files saved by `analysis_uproot` using the `analysis_parquet` function. Specify the directory you want to read from using the `read_directory` keyword argument. If `subdirectory_names` is not provided, the function will attempt to read all subfolders.  
_**Tips**_: For large datesets, it is recommended to restart the kernel beforehand to clear memory. Do not read too many variables at once. 

In [None]:
read_directory = 'output/pt20cut'
read_variables = ['lep_pt', 'mass']
pt20_data = analysis_parquet(read_variables, 
                             read_directory=read_directory, 
                             subdirectory_names=None,
                             fraction=1, cut_function=None, 
                             write_parquet=False,
                             output_directory=None, return_output=True)
pt20_data

The same function can be used to apply selection cuts and write the filtered data to disk. Note that the dictionary are constructed by combining the subdirectory names with the `fraction`, separated by `' x'`.

## Plotting Stacked Histograms
The dictionary returned by `analysis_uproot` can be passed directly to `plot_stacked_hist` to plot stacked histograms.

In [None]:
# Variable to plot on the x-axis
plot_variable = 'pt_diff_3'

# Define plot appearance
color_list = ['b', 'r']
xmin, xmax = -100, 100 # Define histogram bin range and x-axis limits 
num_bins = 200 # Number of histogram bins
x_label = 'pt[2] - pt[3] [GeV]' # x-axis label 

# Plot the histogram
fig, hists = plot_stacked_hist(pt30_data, plot_variable, color_list,
                               num_bins, xmin, xmax, x_label,
                               show_text=False, save_fig=True)

In [None]:
# Variable to plot on the x-axis
plot_variable = 'mass'

# Define plot appearance
color_list = ['k', 'yellow', 'purple', 'g', 'm']
xmin, xmax = 0, 120 # Define histogram bin range and x-axis limits 
num_bins = 500 # Number of histogram bins
x_label = 'mass [GeV]' # x-axis label 

# Plot the histogram
fig, hists = plot_stacked_hist(pt30_data, plot_variable, color_list,
                               num_bins, xmin, xmax, x_label,
                               show_text=False, save_fig=False)

If you would like to plot many variables at once, you can use the `plot_histograms` function.

In [None]:
# Plot many variables at once 
plot_variables = ['lep_type[0]','lep_type[1]','lep_type[2]','lep_type[3]']
xmin_xmax_list = (10.5, 13.5) # Bin range for all variables
# Define plot appearance
color_list = ['b', 'r']
num_bins_list = 3
x_label_list = plot_variables

figure_list, hists_list = plot_histograms(pt30_data,
                                          plot_variables,
                                          color_list,
                                          xmin_xmax_list,
                                          num_bins_list,
                                          x_label_list, logy=False, residual_plot=False, save_fig=True)

In [None]:
# Plot many variables at once 
plot_variables = [ 'lep_isLooseID[0]',
 'lep_isMediumID[0]',
 'lep_isTightID[0]',
 'lep_isLooseIso[0]',
 'lep_isTightIso[0]',
'totalWeight', 'mcWeight', 'filteff', 'kfac', 'xsec', 'ScaleFactor_PILEUP',  'ScaleFactor_ELE', 'ScaleFactor_MUON', 'ScaleFactor_LepTRIGGER']
xmin_xmax_list = [(-0.05, 1.5), (-0.05, 1.5), (-0.05, 1.5), (-0.05, 1.5), (-0.05, 1.5), (-20, 20), (-500000000, 500000000), (0,1), (1,2), (2000,2500), (0,5), (0,5), (0,5), (0,5)] # Bin range for all variables
# Define plot appearance
color_list = ['b']
num_bins_list = 200
x_label_list = ['lep_isLooseID[0]',
 'lep_isMediumID[0]',
 'lep_isTightID[0]',
 'lep_isLooseIso[0]',
 'lep_isTightIso[0]','totalWeight', 'mcWeight', 'filteff', 'kfac', 'xsec', 'ScaleFactor_PILEUP',  'ScaleFactor_ELE', 'ScaleFactor_MUON', 'ScaleFactor_LepTRIGGER']

figure_list, hists_list = plot_histograms(pt30_data,
                                          plot_variables,
                                          color_list,
                                          xmin_xmax_list,
                                          num_bins_list,
                                          x_label_list, logy=False, residual_plot=False, save_fig=True)

In [None]:
# You can use the Figure object returned by the function to save the figure as an image
figure_list[0].savefig('test.png', dpi=500)

In [None]:
hists_list[1]

