In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.integrate import quad
import tensorflow as tf
from sklearn.utils import shuffle
import sys, os
from iminuit import Minuit
from scipy.stats import chi2
from scipy.special import erfinv
import tensorflow as tf
from sklearn.preprocessing import StandardScaler, PowerTransformer, OneHotEncoder
from sklearn.metrics import confusion_matrix, roc_auc_score, roc_curve
from sklearn.model_selection import train_test_split
from collections.abc import Sequence
from sklearn.pipeline import Pipeline
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV
from sklearn.metrics import accuracy_score

# **¡¡¡ README !!!**

In this project the CSV files containing the signal and background events are not included.

Thus notebook assumes that the directory listed on the variable ```DATAFILE_DIR``` is the directory which contains the four CSV files.

# **Defining notebook constants**

Here we will define any global variables that will have to be used throughout the notebook and we don't plan on changing.

In [None]:
DATAFILE_DIR = "/home/giorgio/data/report_4" + "/"
# DATAFILE_DIR = "/home/s1835083/Desktop/report_4_data/"
KEYS = {
    "features" : ["lep1_pt", "lep2_pt", "fatjet_pt", "fatjet_eta", "fatjet_D2", "Zll_mass", "Zll_pt", "MET",],
    "features_plus_mass" : ["lep1_pt", "lep2_pt", "fatjet_pt", "fatjet_eta", "fatjet_D2", "Zll_mass", "Zll_pt", "MET", "reco_zv_mass"],
    "targets"  : ["isSignal"],
    "features_and_targets" : ["lep1_pt", "lep2_pt", "fatjet_pt", "fatjet_eta", "fatjet_D2", "Zll_mass", "Zll_pt", "MET", "isSignal"],
    "weights"  : ["FullEventWeight"],
} 
SEED = 42
N_BINS = 75
SUBRANGE = (0.7e6, 1.5e6)

# **Dataloading**

Before any analysis, we first need to load in the signal and backround datasets from the csv files.

To do this, we create a funtion called  ```load_data``` which will return a dictironary containing a dataframe of each dataset represented by a CSV file in the ```DATAFILE_DIR``` directory. The key of the dictionary entries are the names of the CSV files. Hence the structurte of the dictionary will be:

- Top
- Diboson
- ggH1000
- Zjets

The dictionary structure is used do that we can easily iterate over the dicitionary to apply functions to all four dataframes of the datasets.

In [None]:
# Load in data from csv
print(os.listdir(DATAFILE_DIR))

def load_data(data_directory):
    """
    Function which will load in all 4 csv files as a pandas dataframe.
    """
    
    # Make list containg directory of all csv files
    file_names = [DATAFILE_DIR + file_name for file_name in os.listdir(data_directory)]
    
    # Make list containing the key for each csv file
    data_keys = [key.rsplit(".")[0] for key in os.listdir(data_directory)]

    # Define dictionary which will contain each pandas dataframe
    dataset_dictionary = {}

    # Load in the CSV files by iterating over all files in list
    for key, file_name in zip(data_keys, file_names):
        dataset_dictionary[key] = pd.read_csv(file_name, sep=",", header=0, comment="#", index_col=0)
    return dataset_dictionary

# Load in the datasets
dataset_dictionary = load_data(DATAFILE_DIR)


Now that we have loaded in our datasets, we will do some basic exploration. First we will remove all NaNs from each of the dataframes s.t none of our entries contain a NaN. After this, we will print out a summary for each of the datasets so that we can get a basic overview of what it contains.

In [None]:
def remove_nans(dataset_dictionary):
    """
    Remove all nans in a dicionary containing multiple pandas dataframes
    """    

    # iterate over all datasets
    for key in dataset_dictionary:
        # Remove entries with nans
        dataset_dictionary[key].dropna(inplace=True)

def print_summaries(dataset_dictionary):
    """
    Print a summary of each dataframe in a dictionarty
    """

    # iterate over all datasets
    for key in dataset_dictionary:
        # Remove entries with nans
        print(f"##### {key} #####")
        print(dataset_dictionary[key].info())
        print("\n")


# Remove Nans for all dataframes
remove_nans(dataset_dictionary)

# Print summary of all datasets
print_summaries(dataset_dictionary)


To further analyse each of the datasets, we print out the first 6 rows of each of the dataframes to get better understanding of what they contain, what are the units and datatypes used to represent themm.

In [None]:
dataset_dictionary["Top"].head(6)

In [None]:
dataset_dictionary["Zjets"].head(6)

In [None]:
dataset_dictionary["Diboson"].head(6)

In [None]:
dataset_dictionary["ggH1000"].head(6)

Now that we have done our basic preprocessing and a cursory analysis of the entries on each of the dataframes, we want to go more in depth into the distributuions of each of the datasets. Given that we are going to have to apply cuts to some of the parameters in the dataset, it is in our interest to visualise the distributions s.t we can tell where our initial cuts are going to have to be.

In order to do this, we will write a function which will plot the distributuions for each of the datasets in our dictionary for each of the kinematic parameters which we have labeled as features in the global constants. This will make it simpler to plot the distributions every time we are applying cuts so that we can see the effects of each of the cuts.

After defining a function which will plot the distributuions, we plot them before any cuts.

The parameters that are going to be plotted every time we call the funcction will be:

- lep1_pt
- lep2_pt
- fatjet_pt
- fatjet_eta
- fatjet_D2
- Zll_mass
- Zll_pt
- MET
- reco_zv_mass

It should be noted that the distributions being plotted include the Fuill event weights, hence we get the correct scaling of the singal events compared to the background events.

In [None]:
def plot_kinematic_parameters(dataset_dictionary):
    """
    Function which will plot the lkinematic parameters which will be used as inputs for neutral network
    """

    # Define plotting parameters for each dataset
    colors = ["darkblue", "maroon", "darkgreen", "violet"]
    labels = dataset_dictionary.keys()
    x_labels = ["Momentum (MeV/c)", "Momentum (MeV/c)","Momentum (MeV/c)","Pseudorapidity",r"D_2",r"Mass (MeV/c^2)","Momentum (MeV/c)","Energy (MeV)", r"Mass (MeV/c^2)"]

    fig, ax = plt.subplots(3, 3, figsize=(10,10))

    # Iterate over every axis in the subplot to pot a specific kinematic parameter
    for idx, (parameter_key, axis) in enumerate(zip(KEYS["features_plus_mass"], ax.flatten())):

        # Iterate over the datasets
        for label, color, dataset_key in zip(labels, colors, dataset_dictionary.keys()):
            axis.hist(dataset_dictionary[dataset_key][parameter_key], bins=50, histtype="step", log=True, label=label, color=color, weights=dataset_dictionary[dataset_key][KEYS["weights"]])
            axis.legend()
            axis.set(
                title=parameter_key,
                ylabel="Values",
                xlabel=x_labels[idx]
            )
            axis.ticklabel_format(axis='x', style='sci', scilimits=(0,0))

    plt.tight_layout()

# Plot the kinematic parameters
print("KINEMATIC PARAMETERS: NO CUTS\n")
plot_kinematic_parameters(dataset_dictionary)

Now we define cuts which we will apply to each kinematic parameter in each dataset

We will define each cutoffs in a dictionary in order to facilitate applying the cuts. The dictionary is used such that we can add a new key containing a new set of cuts. Therefore, if we want to apply a new set of cuts, all we have to do is change the key used in the function which applies cuts.

The dictionary contain the upper and lower bound to a cut that we wish to make. By including a lower bound, we can increase how harsh we can make our cuts if we wish

There are tree sets of cutoffs which have been analysied:

- Standard cuts: These cuts have been found to have efficencies of:
    - Signal efficiency: 77.1%
    - Background efficiency: 11.2%

- Harsh cuts: These are a set of harshed cuts aimed to reduce the background efficneny without a significant impact in signal efficiency:
    - Signal efficiency: 47.5%
    - Background efficiency: 3.0%

- Loose cuts: These are a set of harshed cuts aimed to increase the signal efficneny without a care on the reduction of the background efficiency:
    - Signal efficiency: 99.6%
    - Background efficiency: 96.6%

Harsh cuts increase the visibility of the signal, but at the cost of **reducing the number of signal entries by 14239 entries**. 

In [None]:
# Define a set of harsh cuts to improve signal visibility
harsh_cut_dictionary =  {
    # Define the kinematic parameter to be cut
    "lep1_pt":  {
        # Define bounds of cut
        "minimum"   : 1.2e5,
        "maximum"   : 6e5,
    },
    "lep2_pt":  {
        "minimum"   : 5e4,
        "maximum"   : 3.5e5,
    },
    "fatjet_pt":  {
        "minimum"   : 2e5,
        "maximum"   : 7.8e5,
    },
    "fatjet_D2":    {
        "minimum"   : 0.0,
        "maximum"   : 1.2, 
    },
    "Zll_mass":    {
        "minimum"   : 6e4,  
        "maximum"   : 1.3e5,
    },
    "Zll_pt":    {
        "minimum"   : 2.2e5,  
        "maximum"   : 6.8e5,
    },
    "MET":    {
        "minimum"   : 1e2,  
        "maximum"   : 2.5e5,
    },
}

# Define a set of loose cuts
loose_cut_dictionary =  {
    # Define the kinematic parameter to be cut
    "lep1_pt":  {
        # Define bounds of cut
        "minimum"   : 0,
        "maximum"   : 7.5e5,
    },
    "lep2_pt":  {
        "minimum"   : 0,
        "maximum"   : 4e5,
    },
    "fatjet_pt":  {
        "minimum"   : 0,
        "maximum"   : 1.2e6,
    },
    "fatjet_D2":    {
        "minimum"   : 0,
        "maximum"   : 1.4e1, 
    },
    "Zll_mass":    {
        "minimum"   : 0,  
        "maximum"   : 5e5,
    },
    "Zll_pt":    {
        "minimum"   : 0,  
        "maximum"   : 1.25e6,
    },
    "MET":    {
        "minimum"   : 0,  
        "maximum"   : 4.4e5,
    },
}

# Define a set of standard cuts
standard_cut_dictionary =  {
    # Define the kinematic parameter to be cut
    "lep1_pt":  {
        # Define bounds of cut
        "minimum"   : 9e4,
        "maximum"   : 5e5,
    },
    "lep2_pt":  {
        "minimum"   : 2.5e4,
        "maximum"   : 6.5e5,
    },
    "fatjet_pt":  {
        "minimum"   : 2.5e5,
        "maximum"   : 7.5e5,
    },
    "fatjet_D2":    {
        "minimum"   : 1e-2,
        "maximum"   : 2.5, 
    },
    "Zll_mass":    {
        "minimum"   : 7.5e4,  
        "maximum"   : 1.05e5,
    },
    "Zll_pt":    {
        "minimum"   : 2.5e5,  
        "maximum"   : 8e5,
    },
    "MET":    {
        "minimum"   : 50,  
        "maximum"   : 1e6,
    },
}

Now that we have created the dictionary to contain our cuts, we must make a function which is capable of applying the cuts to each of our datasets. 

To do this, the function will first loop over each of the datasets in the dataset dictionaries and then use the query method to find all entries within the bounds defined by the cuts for each kinematic parameter. 

The funcnction will then return two dataframes, the original dataset dictionary prior to applying the cuits in addition to a dataset dictionary were each of the dataframes have had the kinematic cuts applied.

In [None]:
def apply_cuts(dataset_dictionary, cut_dictionary, ):
    """
    Make a copy of the dataset dictionary and apply the cuts passed to the function. Return the new dictionary with the cuts applied
    """
    # Make a copy of the dataset dictionary with same keys and empty values
    dictionary_dataset_precut = {key:None for key in dataset_dictionary.keys()}
    dictionary_dataset_postcut = {key:None for key in dataset_dictionary.keys()}

    # Make deep copies of the dasatset dataframes and assign them to the copy dictionary
    for key in dictionary_dataset_precut:
        dictionary_dataset_precut[key] = dataset_dictionary[key].copy(deep=True)
        dictionary_dataset_postcut[key] = dataset_dictionary[key].copy(deep=True)

    # Apply the cuts
    for cut_key in cut_dictionary.keys():
        for dataset_key in dataset_dictionary.keys():
            # query_string = f"{cut_key} < {cut_dictionary[cut_key][dataset_key]}"
            query_string = f"{cut_key} > {cut_dictionary[cut_key]['minimum']}" + " & " + f"{cut_key} < {cut_dictionary[cut_key]['maximum']}"
            dictionary_dataset_postcut[dataset_key] = dictionary_dataset_postcut[dataset_key].query(query_string)

    return dictionary_dataset_precut, dictionary_dataset_postcut


Now that we have defined the function which will apply our kinematic cuts, we will apply all three cuts (harsh, standard and loose) top the dataset dictionaries

In [None]:
# Apply the harsh cuts to the dataframe
dataset_dictionary_pre_harsh_cut, dataset_dictionary_post_harsh_cut = apply_cuts(dataset_dictionary, harsh_cut_dictionary)

# Apply the standard cuts on the dataframe
dataset_dictionary_pre_standard_cut, dataset_dictionary_post_standard_cut = apply_cuts(dataset_dictionary, standard_cut_dictionary)

# Apply the standard cuts on the dataframe
dataset_dictionary_pre_loose_cut, dataset_dictionary_post_loose_cut = apply_cuts(dataset_dictionary, loose_cut_dictionary)

We will now visualise the kineamic parameter distributions post cut for all three cuts defined. This will allow us to view how the cuts affected the distribution, which parameters can be used to better remove the most background event possible by spotting if any distributuion has a different mode for background entries and signal entries.

This will help us tweak the cuts to improve their performance in increasing signal efficiency or reducing bacground efficiency.

Below are the distribution plots for all three kinematic cuts.

In [None]:
# Plot the datasets with a standard cut
print("KINEMATIC PARAMETERS: STANDARD CUTS\n")
plot_kinematic_parameters(dataset_dictionary_post_standard_cut)

In [None]:
# Plot the datasets with a harsh cut
print("KINEMATIC PARAMETERS: HARSH CUTS\n")
plot_kinematic_parameters(dataset_dictionary_post_harsh_cut)

In [None]:
# Plot the datasets with a loose cut
print("KINEMATIC PARAMETERS: LOOSE CUTS\n")
plot_kinematic_parameters(dataset_dictionary_post_loose_cut)

Something we notice from these distributions is the fact that for most parameters, the mode of the distributions of signal and backgoeund entries match. However, for the transversal momentum distriutuions (```lep1_pt```, ```lep2_pt```, ```fatjet_pt``` and ```Zll_pt```), the mode of the ggH1000 distribution is skewed more towards higher masses. 

Hence for these parametes, we canuse cuts which remove a majority of the background entris without majorly affecting the signal distributions. These are the kinematic parameters of uinterects when tweaking the cuts.

Given that we still have all four datasets in individual dataframes within the library, we will now want to merge them to have one large dataframe with mixed signal and background entries.

TYo do this, we create a function which will concatinate all dataframes within a dataset dictionary.

In [None]:
def merge_datasets(dataset_dictionary):
    """
    Function to merge the all datasets in a dictionary 
    """
    
    # Merge the dataset
    merged_dataset = pd.concat(
        [dataset_dataframe for dataset_dataframe in dataset_dictionary.values()],
        ignore_index = True
    )

    # Shuffle the merged dataset
    merged_dataset = shuffle(merged_dataset, random_state=SEED)

    # Reset indexes of merged sdataset
    merged_dataset = merged_dataset.reset_index(drop=True)

    return merged_dataset

Now that we have our function to merge datasrt dictionaries, we will merge all the dataset dictionaeies we have, including those pre and post the application of our kinematic parameters.

In [None]:
# Merge the pre and post cuts for standard cut
merge_dataset_pre_standard_cut = merge_datasets(dataset_dictionary_pre_standard_cut)
merge_dataset_post_standard_cut = merge_datasets(dataset_dictionary_post_standard_cut)

# Merge the pre and post cuts for harsh cut
merge_dataset_pre_harsh_cut = merge_datasets(dataset_dictionary_pre_harsh_cut)
merge_dataset_post_harsh_cut = merge_datasets(dataset_dictionary_post_harsh_cut)

# Merge the pre and post cuts for loose cut
merge_dataset_pre_loose_cut = merge_datasets(dataset_dictionary_pre_loose_cut)
merge_dataset_post_loose_cut = merge_datasets(dataset_dictionary_post_loose_cut)

To visualise the mass distributions for each kinematic cuts which were uses, a function was written which will plot a stacked historgeram of the entries in the reconstructed mass distributuion. The two distributions which are stacked are the signal entries and the bakcgreound entries. 

Additionally, the full weighted distribution of the combined signal and background mass distrivution is plotted so that we can accuratelly gauge how noticable the signal is whn using the correct weighiting between background events and signal events.

In [None]:
def plot_merged_datasets(merged_post, range=(0.7e6, 1.5e6), plot_title=""):
    """
    Plot the merged datasets to visualise events and the effects of the kinematic cuts applied to the entire merged dataset.
    """

    # Plot the weighted distributions
    fig, ax = plt.subplots(1, 2, figsize=(10, 5))

    # List of events 
    events = [merged_post.query("isSignal==0")["reco_zv_mass"], merged_post.query("isSignal==1")["reco_zv_mass"]]
    labels = ["Background", "Signal"]

    # Iterate over all merged datasets to plot
    ax[0].hist(events, histtype="step", log=True, bins=75, label=labels, stacked=True, range=range)
    ax[1].hist(merged_post["reco_zv_mass"], log=False, bins=75, range=range, histtype="step", weights=merged_post["FullEventWeight"])
    
    ax[0].set_title("Entry distribution for Reco Mass", fontsize=14)
    ax[1].set_title("Event distribution for Reco Mass", fontsize=14)
    ax[0].set_xlabel("Reco Mass (MeV)", fontsize=12)
    ax[1].set_xlabel("Reco Mass (MeV)", fontsize=12)
    ax[0].set_ylabel("Number of Entries (Log)", fontsize=12)
    ax[1].set_ylabel("Number of Events", fontsize=12)
    ax[0].legend()
    fig.tight_layout()
    fig.suptitle(f"Dataset w/ {plot_title}", y=1.05, fontsize=14)

In addition to plotting the mass distrivutions for each of the kinematic cuts, we calso want to parametrise the efficiencies for signal and background entries pre and post applying cuts.  The plots will show the range of (0.7e6, 1.5e6) so that we can easily spot the signal peak.

These efficiencies will help us understant just how harsh each set of cuts is, and can help us tweak the cuts if we realise we have lost signal efficiency without a reduction in background efficiencies.

Lastly, we also develop a function which can integrate the weighted mass distribution for both signal and background events so that we can quatify exactly how many events we have post kinematic cuts. 

All of these fuctions cvan help us judge the quality of the cuts and allow us to select a set of cuts for further analysis.

In [None]:
def calculate_merged_efficiencies(merged_pre, merged_post,):
    """
    Function will print the efficiencies on the background and signal enties after appling kinematic cuts
    """
    # Define parameters for cunction
    isSignal_values = [0, 1]
    isSignal_labels = ["Background", "Signal"]

    # Iterate over signal and background
    for value, label in zip(isSignal_values, isSignal_labels):
        
        # Find number of variables pre and post application of cuts
        precut_entries = merged_pre.query(f"isSignal == {value}").shape[0] 
        postcut_entries = merged_post.query(f"isSignal == {value}").shape[0] 

        # Calculate efficiency
        efficiency = postcut_entries/precut_entries

        # Print the data computed in the function
        print(f"{label}:   Efficiency: {efficiency:.1%} ##### Precut Length: {precut_entries} ##### Postcut Length: {postcut_entries}")

def integrate_dataset_events(merged_dataset,):
    """
    Perform an integral on the signal and background sitributions and return the number fo events for each distribution
    """

    # Get the reco_mass and weights for all entries for both the signal entries
    reco_mass_signal = merged_dataset.query("isSignal == 1")["reco_zv_mass"]
    event_weights_signal = merged_dataset.query("isSignal == 1")["FullEventWeight"]

    # Get the reco_mass and weights for all entries for both the background entries
    reco_mass_background = merged_dataset.query("isSignal == 0")["reco_zv_mass"]
    event_weights_background = merged_dataset.query("isSignal == 0")["FullEventWeight"]

    # Compute the integral for the signals
    values_signal, bins_signal = np.histogram(reco_mass_signal, weights=event_weights_signal)
    integrated_area_signal = sum(np.diff(bins_signal)*values_signal)

    # Compute the integral for the background
    values_background, bins_background = np.histogram(reco_mass_background,)
    integrated_area_background = sum(np.diff(bins_background)*values_background)

    print(f"Number of events in signal distribution: {integrated_area_signal:.2e}")
    print(f"Number of events in background distribution: {integrated_area_background:.2e}")

Now we will plot a pairplot for all the kinemaric parameters. As this plot is desiged to see the correlations between the kinematic paramerters, and the correlations shouldnt be affected by the cuts; we will only compute the pairplot for one of the datasets post cut.

We will use the dataset with the harsh cut as we are happy that the dataset has removed any major outliers.

Due to the size of the datasets, the creation of the pairplot can take various minutes, hence there is a method to force the notebook to skip the plot

In [None]:
plot_pairplot = False

if plot_pairplot:
    sns.pairplot(
        merge_dataset_post_harsh_cut[KEYS["features_and_targets"]],
        hue = KEYS["targets"][0],
        plot_kws=dict(marker="+", linewidth=1, alpha=0.5),
        corner = False,
    )

Now we will go through each of thwe merged datasets post application of the kinematic cuts and compute the efficiencies, the event integrals and plottting each of the mass distribuitions to visualise the signal peak visibility.

We will do this for all kineamic cut sets (standard, loose and harsh) and discuss them at the end of this section.

In [None]:
# Print efficiencies and number of entries pre and post harsh cuts for signal and background events
print("EFFICIENCIES AND PURITIES FOR DATASET WITH STANDARD CUT\n")
calculate_merged_efficiencies(merge_dataset_pre_standard_cut, merge_dataset_post_standard_cut)

print("\n")

# Print the number of events in the signal and background distribution
integrate_dataset_events(merge_dataset_post_standard_cut)

print("\n")

# Plot the distribution of events (weighted) pre and post harsh cuts
plot_merged_datasets(merge_dataset_post_standard_cut, plot_title="Standard cuts")

In [None]:
# Print efficiencies and number of entries pre and post harsh cuts for signal and background events
print("EFFICIENCIES AND PURITIES FOR DATASET WITH HARSH CUT\n")
calculate_merged_efficiencies(merge_dataset_pre_harsh_cut, merge_dataset_post_harsh_cut)

print("\n")

# Print the number of events in the signal and background distribution
integrate_dataset_events(merge_dataset_post_harsh_cut)

print("\n")

# Plot the distribution of events (weighted) pre and post harsh cuts
plot_merged_datasets(merge_dataset_post_harsh_cut, plot_title="Harsh cuts")

In [None]:
# Print efficiencies and number of entries pre and post loose cuts for signal and background events
print("EFFICIENCIES AND PURITIES FOR DATASET WITH LOOSE CUT\n")
calculate_merged_efficiencies(merge_dataset_pre_loose_cut, merge_dataset_post_loose_cut)

print("\n")

# Print the number of events in the signal and background distribution
integrate_dataset_events(merge_dataset_post_loose_cut)

print("\n")

# Plot the distribution of events (weighted) pre and post loose cuts
plot_merged_datasets(merge_dataset_post_loose_cut, plot_title="Loose cuts")

### Discussion

From the efficencies, which were quoted previously, we see that the harsh cut does the best job at reducuing the number of background enties to 3% of what they were pre application of the cuts. However, this sadly came with the downside of reducing the efficiencies of signal entries. 

This did significantly increase the visibility of the signal peak though, with the standard cuts being the second most visible signal peak and the loose cuts not showing the signal peak at all.

Another thing to not that the cuts affect is the shape of the background distribution. From the loose cut, we can tell that the background shape is ecxponential. However, as the cuts get sharper, the distribution becomes more linear. This distortion is the downside of reducing the background efficiency too much. Although the harsh cuts increase the signal peak, they also distort the shape of the backgeound, hence our hypothesis test may not be as accurate as we initially expect.

As such, we will attemnt to use the harsh cuts for our hypothesis testing as even though the background signal is distorted, it is the cut that best increases the visibility of the signal peak in the weighted distribution

# **Hypothesis testing**

### Data processing and distribution visualisation

Now that we are satisfied with our processing of the distrivution data, we can start to build the code which we will use to fit our null and alternate hypothesis to the reconstructed mass distribution.

We will start with writitng a funtion which will take out merged distribnution post cut and return individual distributions conmtaining only the signal and background entries. This is done so that we can fit functions to the signal and backgeround distributions seperately.

This will not only help us find out which functions best describe the signal and background distributions, but also help us determine individual values when fitting the combined distribution.

In [None]:
def split_dataset_signal_background(dataset, subrange=(0.7e6, 1.5e6), classification_score_threshold=None):
    """
    This function will split a merged dataframe into two dataframes containing Signal and Background entries respectivly. 
    Function is also capable of applying a subrange to the dataset if an ArrayLike object of length 2.
    """

    # Apply subrange to dataset if subrange is an ArrayLike object with two limits
    if isinstance(subrange, Sequence) and len(subrange)==2:
        signal_background_dataset = dataset.query(f"reco_zv_mass > {subrange[0]} & reco_zv_mass < {subrange[1]}")
    else:
        signal_background_dataset = dataset

    # If a classification score is provided, remove all signal entries below the threshold
    if classification_score_threshold:
        # Create a datasety with all background entries and all signal entries above the threshold
        signal_background_dataset = pd.concat([ signal_background_dataset.query("isSignal==1 & nn_classification_score>=0.5"), signal_background_dataset.query("isSignal==0") ])    

    # Create individual datasets for signal 
    signal_dataset = signal_background_dataset.query("isSignal==1")

    # Create individual datasets for background 
    background_dataset = signal_background_dataset.query("isSignal==0")

    return signal_background_dataset, signal_dataset, background_dataset

Now we define a function which will return calculate our reconstructed mass distributuion's histograms and return the bin ranges, and counts in each bins of the distribution.

Additionally, the function will compute the center of the bins so that we can plot the hstogram as a scatter plot. 

Lastly, the function will go through each entry in a dataset and assign the index of the bin to which the entry corresponds in the mass spectrum distribution.

In [None]:
def compute_dataset_histograms(dataset, n_bins=60):
    """
    Compute the histogram bins and counts for a dataset. This function will also add a column to the dataset with the index 
    of the histogram bin to which the entry correspons to
    """

    # Compute the counts and the bins for the dataset according to the event weights
    dataset_counts, dataset_bins = np.histogram(dataset["reco_zv_mass"], bins=n_bins, weights=dataset["FullEventWeight"])

    # Compute the center point of the bins of the dataset histogram
    dataset_bins_center = ( dataset_bins + (dataset_bins[1]-dataset_bins[0])/2 )[:-1]

    # Compute an array containing the bin index to which each index corresponds to
    dataset_bin_indices = np.digitize(dataset["reco_zv_mass"], bins=dataset_bins) - 1
    
    # Add a column with the bin idex to which each entry corresponds to
    dataset["bin_index"] = dataset_bin_indices

    return dataset, dataset_counts, dataset_bins, dataset_bins_center

Given that out distributuions are weighted corresponding to the process which created the entry, we need to compute the weighted number of events and the weighted standard seviations in order to latyer compute the chi squared of a fit.

Hence we write a function which will take a dataframe of a specific distribution and compute the weighted observed events in addition tothe weighted errors. 

In [None]:
def find_dataset_binned_events(dataset, dataset_bins_center):
    """
    Find the weighted number of events and the weighted errors per bin of the dataset. 
    """
    # Create an empty array to contain the weighted events and errors per bin
    n_observed = np.zeros([dataset_bins_center.size])
    sigma_squared = np.zeros([dataset_bins_center.size])

    # Iterate over all bins in the dataset
    for i in range(dataset_bins_center.size):
        # Compute and assign the weighted number of observed events
        n_observed[i] = np.sum( dataset.query(f"bin_index == {i}")["FullEventWeight"] )

        # Compute and assign the weighted errors 
        sigma_squared[i] = np.sum( dataset.query(f"bin_index == {i}")["FullEventWeight"]**2 )
    
    return n_observed, sigma_squared

Now that we have defiend all out functions to process out datasets prior to curve fitting, we can apply them in order to obtain all relevant parameters for the signal, bakcground and signal+backgreound reconstructed mass distributions

In [None]:
# Split the dataset into signal and background entries
signal_background_dataset, signal_dataset, background_dataset = split_dataset_signal_background(merge_dataset_post_harsh_cut)

# Compute the histogram bins and counts for each dataset
signal_background_dataset, signal_background_counts, _, signal_background_bins_center = compute_dataset_histograms(signal_background_dataset, n_bins=N_BINS)
signal_dataset, signal_counts, _, signal_bins_center = compute_dataset_histograms(signal_dataset, N_BINS)
background_dataset, background_counts, _, background_bins_center = compute_dataset_histograms(background_dataset, N_BINS)

# Compute the weigthed number of events and errors for each histogram bin
n_observed_signal_background, sigma_squared_signal_background = find_dataset_binned_events(signal_background_dataset, signal_background_bins_center)
n_observed_signal, sigma_squared_signal = find_dataset_binned_events(signal_dataset, signal_bins_center)
n_observed_background, sigma_squared_background = find_dataset_binned_events(background_dataset, background_bins_center)

In order to visualise the mass distribution before fitting, we can plot the weighted signal, backgeround and signal+background distributions. This will help us visualise the peak, in addition to see the shapes of the signal and background distributions seperately.

We also notice that the number of bins plays a critical role in the visibility of the peak. We selected 75 bins as it gives us a nice balance between increasing the granularity of the distribution to show visibility of the signal peak without reducing too much the number of events in each bin, which could negativly affect the fitting.

In [None]:
# rlot the Counts and pos to verify that they represent the histogram correctly
fig, ax = plt.subplots(1, 3, figsize=(16, 5))
titles = ["Total Mass distribution", "Signal mass distribution", "Background mass distribution"]

for idx, (dataset, bins_center, counts) in enumerate(zip([signal_background_dataset, signal_dataset, background_dataset], [signal_background_bins_center, signal_bins_center, background_bins_center], [signal_background_counts, signal_counts, background_counts])):
    ax[idx].scatter(bins_center, counts, c="r", marker="x")
    ax[idx].hist(dataset["reco_zv_mass"], weights=dataset["FullEventWeight"], bins=N_BINS, alpha=0.5, color="blue")
    ax[idx].set_title(titles[idx], fontsize=14)
    ax[idx].set_xlabel("Reco Mass (MeV)", fontsize=12)
    ax[idx].set_ylabel("Number of Events", fontsize=12)

fig.tight_layout()

Now we define the functions which we are going to fit. 

Initially, we are going to describe the signal with a gaussian function and the background with a fourth order polynomial.

We also define a signal+background function combning the gaussian and the fourth order polynomial which we will treat as our alternative hypothesis.

In [None]:
def gaussian(x, mu, sigma, norm):
    """
    Gaussian function to describe signal distribution
    """
    return (norm * np.exp(-0.5 * ((x - mu) / sigma) ** 2))

def fourth_order_poly(x, a, b, c, d, e):
    """
    Fourth order polynomial to describe background distribution
    """
    return a + b*x + c*x**2 + d*x**3 + e*x**4

def second_order_poly(x, a, b, c,):
    """
    second order polynomial to describe background distribution
    """
    return a + b*x + c*x**2 

def first_order_poly(x, a, b,):
    """
    first order polynomial to describe background distribution
    """
    return a + b*x 

def gaussian_plus_fourth_poly(x, a, b, c, d, e, mu, sigma, norm):
    """
    Funciton combining a gaussian with a fourth order polynomial to represent our alternative hypothjesis of the overall reconstructed mass distribution
    """
    return gaussian(x, mu, sigma, norm,) + fourth_order_poly(x, a, b, c, d, e)

Now we define the different chi square functions which we will be using throughout out minimisation processes. 

Here we will define a chi squared function for:

- A pure signal distribution dewscribed by a gaussian function

- A pure background distribution described by a fourth order polynomial

- A signal + background function described by a fourth order polynomial (This will act as our null hypothesis)

- A signal + background function described by a gaussian + fourth order polynomial (This will act as our alternate hypothesis)

The chi squared functions are modified to take into account the faxct that our distributions are weighted by different weights

In [None]:
def chi_squared_signal_gaussian(mu, sigma, norm):
    """
    Modified weighted chi squared function for a signal distribution described by gaussian
    """
    numerator = (n_observed_signal - gaussian(signal_bins_center, mu, sigma, norm) )**2
    return np.sum( numerator/sigma_squared_signal )

def chi_squared_signal_second_poly( a, b, c, ):
    """
    Modified weighted chi squared function for a signal distribution described by second order polynomial 
    """
    numerator = (n_observed_signal - second_order_poly(signal_bins_center, a, b, c, ) )**2
    return np.sum( numerator/sigma_squared_signal )

def chi_squared_background_fourth_poly(a, b, c, d, e):
    """
    Modified weighted chi squared function for a background distribution described by fourth order polynomial 
    """
    numerator = (n_observed_background - fourth_order_poly(background_bins_center, a, b, c, d, e) )**2
    return np.sum( numerator/sigma_squared_background )

def chi_squared_background_first_poly(a, b,):
    """
    Modified weighted chi squared function for a background distribution described by first order polynomial 
    """
    numerator = (n_observed_background - first_order_poly(background_bins_center, a, b, ) )**2
    return np.sum( numerator/sigma_squared_background )

def chi_squared_alternative_gaussian_plus_fourth_poly( a, b, c, d, e, mu, sigma, norm,):
    """
    Modified weighted chi squared function for a signal + background distribution described by a gaussian + fourth order polynomial 
    """
    numerator = (n_observed_signal_background - gaussian_plus_fourth_poly(signal_background_bins_center, a, b, c, d, e, mu, sigma, norm) )**2
    return np.sum( numerator/sigma_squared_signal_background )

def chi_squared_null_fourth_poly( a, b, c, d, e,):
    """
    Modified weighted chi squared function for a signal + background distribution described by fourth order polynomial 
    """
    numerator = (n_observed_signal_background - fourth_order_poly(signal_background_bins_center, a, b, c, d, e) )**2
    return np.sum( numerator/sigma_squared_signal_background )

###  Fitting the signal distribution

We will now fit a our gaussian function to the signal mass distribution and validate that it indeed is a good description of the signal.

We will also plot the distrbution to validate the fit.

We will first compute the signal fit with fourth order polynomial and then with a gaussian

In [None]:
# Define minimiser to minimise the signal function
signal_fit_fourth_poly_results = Minuit(
    chi_squared_signal_second_poly,
    a=2e6,
    b=7e6,
    c=6e6, 
)

# Minimise the signal function
signal_fit_fourth_poly_results.migrad()
signal_fit_fourth_poly_results.hesse()

# Print minimised parameters
print(signal_fit_fourth_poly_results.params)

# Print the minimised chi squared
print('Final chisq for gaussian signal: ', signal_fit_fourth_poly_results.fval)

# Plot the minimised signal function
plt.scatter(signal_bins_center, signal_counts, c="r", marker="x", label="Observed events")
plt.hist(signal_dataset["reco_zv_mass"], weights=signal_dataset["FullEventWeight"], bins=N_BINS, alpha=0.3, color="blue",)
plt.plot(signal_bins_center, second_order_poly(signal_bins_center, *signal_fit_fourth_poly_results.values), color="green", label="quadratic fit")
plt.title("Quadratic fit on signal mass distribution", fontsize=14)
plt.xlabel("Reco Mass (MeV)", fontsize=12)
plt.ylabel("Number of Events", fontsize=12)
plt.legend()

In [None]:
# Define minimiser to minimise the signal function
signal_fit_results = Minuit(
    chi_squared_signal_gaussian,
    mu=1e6,
    sigma=1e6, 
    norm=10,
)

# Minimise the signal function
signal_fit_results.migrad()
signal_fit_results.hesse()

# Print minimised parameters
print(signal_fit_results.params)

# Print the minimised chi squared
print('Final chisq for gaussian signal: ', signal_fit_results.fval)

# Plot the minimised signal function
plt.scatter(signal_bins_center, signal_counts, c="r", marker="x", label="Observed events")
plt.hist(signal_dataset["reco_zv_mass"], weights=signal_dataset["FullEventWeight"], bins=N_BINS, alpha=0.3, color="blue")
plt.plot(signal_bins_center, gaussian(signal_bins_center, *signal_fit_results.values), color="green", label="Gaussian fit")
plt.title("Gaussian fit on signal mass distribution", fontsize=14)
plt.xlabel("Reco Mass (MeV)", fontsize=12)
plt.ylabel("Number of Events", fontsize=12)
plt.legend()

From our fit, we obtain the best fit from a gaussian with a chi squared of 1805. From the plot, we are relativly happy with the fit and can continue to use it as the model for our signal in the alternate hypothesis

### Fitting the background

We will now fit a our first and fourth order polynomial function to the background mass distribution and validate that it indeed is a good description of the bakckground.

We will also plot the distrbution to validate the fit.

In [None]:
# Define minimiser to minimise the background function
background_fit_results = Minuit(
    chi_squared_background_first_poly,
    a=2e3,
    b=7e6,
)

# Minimise the background function
background_fit_results.migrad()
background_fit_results.hesse()

# Print minimised parameters
print(background_fit_results.params)

# Print the minimised chi squared
print('Final chisq for fourth order polynomial background: ', background_fit_results.fval)

# Plot the minimised background function
plt.scatter(background_bins_center, background_counts, c="r", marker="x", label="Observed events")
plt.hist(background_dataset["reco_zv_mass"], weights=background_dataset["FullEventWeight"], bins=N_BINS, alpha=0.3, color="blue")
plt.plot(background_bins_center, first_order_poly(background_bins_center, *background_fit_results.values), label="linear fit", color="green")
plt.title("1th order polynomial fit  on background mass distribution", fontsize=14)
plt.xlabel("Reco Mass (MeV)", fontsize=12)
plt.ylabel("Number of Events", fontsize=12)
plt.legend()

In [None]:
# Define minimiser to minimise the background function
background_fit_results = Minuit(
    chi_squared_background_fourth_poly,
    a=2e6,
    b=7e6,
    c=-6e6, 
    d=2e6,
    e=-1e6
)

# Minimise the background function
background_fit_results.migrad()
background_fit_results.hesse()

# Print minimised parameters
print(background_fit_results.params)

# Print the minimised chi squared
print('Final chisq for fourth order polynomial background: ', background_fit_results.fval)

# Plot the minimised background function
plt.scatter(background_bins_center, background_counts, c="r", marker="x", label="Observed events")
plt.hist(background_dataset["reco_zv_mass"], weights=background_dataset["FullEventWeight"], bins=N_BINS, alpha=0.3, color="blue")
plt.plot(background_bins_center, fourth_order_poly(background_bins_center, *background_fit_results.values), label="Fourth order poly fit", color="green")
plt.title("4th order polynomial fit  on background mass distribution", fontsize=14)
plt.xlabel("Reco Mass (MeV)", fontsize=12)
plt.ylabel("Number of Events", fontsize=12)
plt.legend()

From our fit, we obtain the best fit from a background with a chi squared of 129 compared to the first order polynomial with a worse chi squared. From the plot, we are relativly happy with the fit and can continue to use it as the model for our background in the alternate hypothesis and our null hpyothesis

### Signal + Background fit on merged dataset (alternate hypothsis)

Now that we have found functions which we are happy with describin gthe signal and backgeround distributions individually, we can merge them togethewr to obtain a fit on our alternate hypothesis.

To imporve the fit, we will be using the estimated parameters obtained form the signal and background fits as our initiall guesses for this fit. In addition, we will be fixing the mean and standard deviation of the signal distributuion.

As before, we will plot the total distribution overlayed on top of the fitted alternate hypothesis to validate that the fit indeed is correct.

In [None]:
# Define minimiser to minimise the alternatice hypothesis
signal_background_fit_alternative_results = Minuit(
    chi_squared_alternative_gaussian_plus_fourth_poly,
    a=background_fit_results.values[0],
    b=background_fit_results.values[1],
    c=background_fit_results.values[2], 
    d=background_fit_results.values[3],
    e=background_fit_results.values[4],
    mu=signal_fit_results.values[0],
    sigma=signal_fit_results.values[1],
    norm=signal_fit_results.values[2],
)

# Fix the mean and std of the signal
signal_background_fit_alternative_results.fixed["mu"] = True 
signal_background_fit_alternative_results.fixed["sigma"] = True

# Minimise the alternatice hypothsis 
signal_background_fit_alternative_results.migrad()
signal_background_fit_alternative_results.hesse()

# Print minimised parameters
print(signal_background_fit_alternative_results.params)

# Print the minimised chi squared
print('Final chisq for alternative hypothesis: ', signal_background_fit_alternative_results.fval)

# Plot the minimised alternative hypothesis 
plt.scatter(signal_background_bins_center, signal_background_counts, c="r", marker="x", label="Observed events")
plt.hist(signal_background_dataset["reco_zv_mass"], weights=signal_background_dataset["FullEventWeight"], bins=N_BINS, alpha=0.3, color="blue")
plt.plot(signal_background_bins_center, gaussian_plus_fourth_poly(signal_background_bins_center, *signal_background_fit_alternative_results.values), color="green", label="Alternate hupothesis")
plt.title("Alternate hypothesis fit on total mass distribution", fontsize=14)
plt.xlabel("Reco Mass (MeV)", fontsize=12)
plt.ylabel("Number of Events", fontsize=12)
plt.legend()

 From the fit, we can tell that the alternate hypothesis does a good job at descirbing the distribution and the signal peak in particluar. We get an overall chi sqwuared for the alternate hypotheiss of 82.8.

### Background fit on Merged dataset

Now that we have obtained our alternate hypothesis fit, we can move on to obtaining our null hypothesis fit.

To imporve the fit, we will be using the estimated parameters obtained form the background fit as our initiall guesses for this fit. 

As before, we will plot the total distribution overlayed on top of the fitted alternate hypothesis to validate that the fit indeed is correct.

In [None]:
# Define minimiser to minimise the null hypothesis
signal_background_null_fit_results = Minuit(
    chi_squared_null_fourth_poly,
    a=background_fit_results.values[0],
    b=background_fit_results.values[1],
    c=background_fit_results.values[2], 
    d=background_fit_results.values[3],
    e=background_fit_results.values[4],
)

# Minimise the null hypothesis
signal_background_null_fit_results.migrad()
signal_background_null_fit_results.hesse()

# Print minimised parameters
print(signal_background_null_fit_results.params)

# Print the minimised chi squared
print('Final chisq for the null hypothesis: ', signal_background_null_fit_results.fval)

# Plot the minimised null hypothesis 
plt.scatter(signal_background_bins_center, signal_background_counts, c="r", marker="x", label="Observed events")
plt.hist(signal_background_dataset["reco_zv_mass"], weights=signal_background_dataset["FullEventWeight"], bins=N_BINS, alpha=0.3, color="blue")
plt.plot(signal_background_bins_center, fourth_order_poly(signal_background_bins_center, *signal_background_null_fit_results.values), label="Null hypothesis", color="green")
plt.title("Null hypothesis fit on total mass distribution", fontsize=14)
plt.xlabel("Reco Mass (MeV)", fontsize=12)
plt.ylabel("Number of Events", fontsize=12)
plt.legend()

As is to be expected, the null hypothesis fit is worse as it does not take into account the signal peak. This is reflected with an increased chi squared of 118.3, which is larger than the chi squared obtained by the alternate hypothesis. 

Below we plot the null and alternate hypothesis overlayed on top of each other, where it is more noticable that the alternate fit describes the overall signal + background distribution better than the null hypothesis

In [None]:
# Plot the results
plt.scatter(signal_background_bins_center, signal_background_counts, c="r", marker="x")
plt.hist(signal_background_dataset["reco_zv_mass"], weights=signal_background_dataset["FullEventWeight"], bins=N_BINS, alpha=0.3, color="blue")
plt.plot(signal_background_bins_center, fourth_order_poly(signal_background_bins_center, *signal_background_null_fit_results.values), label="Null hypothesis", color="darkgreen")
plt.plot(signal_background_bins_center, gaussian_plus_fourth_poly(signal_background_bins_center, *signal_background_fit_alternative_results.values), label="Alternate hypothesis", color="black")
plt.title("Total mass distribution with multiple hypotheses fits", fontsize=14)
plt.xlabel("Reco Mass (MeV)", fontsize=12)
plt.ylabel("Number of Events", fontsize=12)
plt.legend()

### Parametrising the significance of the alternate hypothesis

Now that we have obtained a null and alternate hypothesis, we can apply Wilks theorem to quantify the significnse of the alternate fit

In [None]:
# Compute the delta chi squared between the null and alternative hypothesis
delt_chi = signal_background_null_fit_results.fval - signal_background_fit_alternative_results.fval

# Define the difference number of degrees of freedom vetween the two fits
dof = 1 

# Compute the significance and the p score of the differentcwe in hypotheses
p_value = chi2.sf(delt_chi, dof)
z_score = np.sqrt(2)*erfinv(1-p_value)

print(f"The p value of the alternate hypothesis is {p_value:.3e}")
print(f"The statistical significance of the signal deviation is {z_score:.3f}")

We find that the significance of the alternate fit is almost 6 sigma, which is above the 5 sigma threshold for accepting the alternate hypothesis. Hence, we are happy accept that there indeed a additional contribution to the mass distribution which is atributable to higgs decay and which can be modeled as a gaussian centered around 1e6MeV on top of the standard background which is to be expected.

# **Develpment of a NN Classifier**

Now that we have confirmed that there is indeed an additional signal in our reconstructed mass distribution (which was to be expected given that we put it there), we will now move on to developing an ML algrothim which can classify whether an entry from our dataframnes belongs to a signal of wherther it belongs to a backgeound entry.

We can go about it multiple ways, however we will initially stick a feed forward neural network.

Below we define some global hyperparametes which will be used for all NN processes while training them.

In [None]:
# Define some hyperparametes
EPOCHS = 50
BATCH_SIZE = 2048
N_SIGNAL_SIZE = 22000

Firstly, to train a NN classifier, we will need to process the datasets which will be fed into the network. As such, we will create a function which will take as an input a dataframe containing all the signal and backgeound entries, and will return a training and validation features and targets which represent a 50/50 admixture of signal and events. 

This is done to allow the network to effectivly learn both the features for signal and background entries. If we didnt impose this 50/50 admixture condition, we would have an over representation of background entries and our network would only get the opportuinitiy to effectivly learn to classify background entires. 

The function first created a dataframe containing a 50/50 admixture, it then creates a numpy array containing the features and targets which will be used to train the network. The features are then passed through a scaller s.t the fistrivution of values for the features is better scaled to improve network training.

In [None]:
def create_ml_training_datasets(dataset, admixture_size=10000, scaler=PowerTransformer(), split_dataset=True, validation_size=0.2, feature_keys=KEYS["features"]):
    """
    Apply preprocessing steps to dataset and create a training and validation dataset for training an ml classification algorithm
    """
    # Create a dataset with a 50/50 admixture of signal and background entires
    dataset_admixture = pd.concat([dataset.query("isSignal==1").iloc[:admixture_size], dataset.query("isSignal==0").iloc[:admixture_size]])

    # Find the targets and featurs from the dataset
    features, targets = dataset_admixture[feature_keys], dataset_admixture[KEYS["targets"]]

    # Scale the features for imporved nn training
    features = scaler.fit_transform(features)

    # # Make a one hot encoded version of the targets
    # targets_onehot = np.squeeze( tf.one_hot(targets.to_numpy(), depth=2), 1 )

    # Create a test train split if sequested
    if split_dataset:
        # Create a validation and training dataset
        features_train, features_val, targets_train, targets_val = train_test_split(features, targets, test_size=validation_size, random_state=SEED)
        
        # # Make a one hot encoded version of the targets
        # targets_val_onehot = np.squeeze( tf.one_hot(targets_val.to_numpy(), depth=2), 1 )
        # targets_train_onehot = np.squeeze( tf.one_hot(targets_train.to_numpy(), depth=2), 1 )

        return features_train, targets_train, features_val, targets_val
    
    return features, targets

Using the function defined above, we create training and validation datasets for our datasets before and after having applied kineametic cuts.

The idea is to train a network on both datasets and see which dataset can better train the model. This will help us identify which dataset we should use mmoving forward to train the networks

In [None]:
# Create trainign and validation datasets for datasewts with no and harsh kinematic cuts
no_cut_features_train, no_cut_targets_train, no_cut_features_val, no_cut_targets_val = create_ml_training_datasets(merge_dataset_pre_harsh_cut, admixture_size=N_SIGNAL_SIZE)
harsh_cut_features_train, harsh_cut_targets_train, harsh_cut_features_val, harsh_cut_targets_val = create_ml_training_datasets(merge_dataset_pre_harsh_cut, admixture_size=N_SIGNAL_SIZE)

Before training our networks, we define some functions wich will help us with the evaluation of our networks. Each function will plot a specific figure. The are:

- The Loss and accuracies throughout the training of the network. This will allow us to visually inspect the performance of the network.

- Print the final validation accuracy of the network. Will work in conjnction with the plot of aaccuracy to help us determine which model is better.

- Plot a confusion matrix of the validation dataset using the predictions made by a network. Can help us delve into what specifically the network is misclassifying.

- Plot a ROC curve which will give us an indication of the preformance of the network as a classification algorithm.

- Plot the distriburtion of classificvation scores outputted by the model for both signal and background entries. This will tell us if there is a range of classification scores where it is ambigous as the distribution for signal and background entries overlap.

In [None]:
def plot_training_metrics(history_dict):
    """
    Function to plot the training and validation loss of the network
    """

    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 4))

    # Plot the loss 
    ax1.plot(history_dict["loss"], color="maroon", label="Training")
    ax1.plot(history_dict["val_loss"], color="darkblue", label="Validation")
    ax1.legend()
    ax1.set(
        xlabel = "Epochs",
        ylabel = "BCE loss",
        title  = "Loss during training",
    )

    # Plot the  accuracy
    ax2.plot(history_dict["accuracy"], color="maroon", label="Training")
    ax2.plot(history_dict["val_accuracy"], color="darkblue", label="Validation")
    ax2.legend()
    ax2.set(
        xlabel = "Epochs",
        ylabel = "Accuracy",
        title  = "Accuracy during training",
    )

    fig.tight_layout()

def print_final_accuracy(history_dict):
    """
    Print the final accuracy of the network
    """

    # Find index of largest val accuracy
    # max_idx = np.argmax(np.array(history_dict["val_accuracy"]))

    print(f"The final training accuracy is {history_dict['accuracy'][-1]:.2%}")
    print(f"The final validation accuracy is {history_dict['val_accuracy'][-1]:.2%}")

def plot_confusion_matrix(classification_scores, targets, labels=["Background", "Signal"]):
    """
    Compute and plot the confusion matrix using the classificatrion scores provided by the ml classification algorithm
    """

    conf = confusion_matrix(targets, classification_scores,)
    df = pd.DataFrame(conf, columns=labels, index=labels)
    perc = df.copy()
    cols=perc.columns.values
    perc[cols]=perc[cols].div(perc[cols].sum(axis=1), axis=0).multiply(100)

    annot=df.round(0).astype(str) + "\n" + perc.round(1).astype(str) + "%"

    x = sns.heatmap(df, annot=annot, fmt='', cmap="Blues",  annot_kws={"fontsize":16}, linewidth=1, cbar=False)
    x.set_xticklabels(x.get_xmajorticklabels(), fontsize = 14)
    x.set_yticklabels(x.get_ymajorticklabels(), fontsize = 14)
    plt.xlabel("Predicted label", fontsize=16, )
    plt.ylabel("Truth label", fontsize=16, )
    plt.tight_layout()
    plt.show()

def plot_roc_curve(classification_socre_train, classification_score_val, targets_train, targets_val):
    """
    Plot the roc curve using the validation scores of the classification scores ml classification algorithm
    """
    # Compute the false positive and true positive rate of for the roc curve
    fpr_val, tpr_val, _     = roc_curve(targets_val, classification_score_val,)
    fpr_train, tpr_train, _ = roc_curve(targets_train, classification_socre_train,)

    # Compute the area under curve for the roc curve
    aoc_val     = roc_auc_score(targets_val, classification_score_val,)
    aoc_train   = roc_auc_score(targets_train, classification_socre_train,)

    # Plot the ROC curve
    plt.plot(fpr_val, tpr_val, color="maroon", label=f"Validation (AUC = {aoc_val:.2f})")
    plt.plot(fpr_train, tpr_train, color="darkblue", label=f"Training (AUC = {aoc_train:.2f})")
    plt.plot(np.linspace(0, 1, 100), np.linspace(0, 1, 100), color="black", label="Random (AUC = 0.5)", linestyle="--")
    plt.xlabel("False positive rate")
    plt.ylabel("True positive rate")
    plt.title("Improved classifier ROC curve", fontsize=14, y=1.03)
    plt.legend()

def plot_classification_score_distribution(classification_scores, targets):
    """
    Plot the distribution of classification scores for the signal and background entries
    """
    background_mask, signal_mask = (targets.ravel() == 0), (targets.ravel() == 1)
    plt.hist(classification_scores.ravel()[background_mask], bins=100, label="Background", color="maroon", alpha=0.5,)
    plt.hist(classification_scores.ravel()[signal_mask], bins=100, label="Signal", color="darkblue", alpha=0.5, )
    plt.xlabel("Classification score")
    plt.ylabel("Number of entries")
    plt.title("Classification score distribution")
    plt.legend()

Now we will create a function which we can feed it all the necesary parameters to create a compuled neural network. by giving it a different list for the ```features``` parameter, we can modily the number of hidden layers and the number of features each layer contains. This will permit us to rapidly create new NN oibjects to test different layouts.

In [None]:
def create_nn_classifier(num_inputs, num_outputs, features, activation, loss="binary_crossentropy", learning_rate=5e-4, metrics=["accuracy"], initializer="normal", final_layer="sigmoid"):
    """
    Function that will create a new compiled neural network 
    """

    # Define initializer
    initializer = tf.keras.initializers.GlorotNormal(seed=SEED) if initializer == "xavier" else "normal"

    # Define initial layer of the network
    network = tf.keras.models.Sequential([
        tf.keras.layers.Dense(features[0], activation=activation, input_dim=num_inputs, kernel_initializer=initializer),
        tf.keras.layers.Dropout(0.2),
    ])

    # Add the hidden layers of the neural network
    for feature in features[1:]:
        network.add(
            tf.keras.layers.Dense(feature, activation=activation, kernel_initializer=initializer)
        )
        network.add(
            tf.keras.layers.Dropout(0.2)
        )
    
    # Add output layer of neural network
    network.add(
        tf.keras.layers.Dense(num_outputs, activation=final_layer, kernel_initializer=initializer,)
    )

    # Define optimizer object
    optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)

    # Compile and return the network
    network.compile(optimizer=optimizer, loss=loss, metrics=metrics,)

    return network

Below we will create two identicle compiled neural networks which will be used to test which one of the two datasets (pre or post kinematic cuts) provides the better trained network.

We will use a standard upscale form the number of inputs to 512 nodes. Follwed by subsequent hidden layers downscaleing the latent vector down untill se provide a binary output sclaed from 0 to 1 which will serve as our classification score. Where 0 represents a perfect xlassification as a backgreound event and vice versa.

To train the network, we will use the standard binary cross entropy loss function.

The activation fuctnions used after every hidden layer will be tested between RELU,. GELU and SELU. We will check which one precisely results in better classification accuracies.

Lastly we will test the kernal initialised which is used by the network, between a normal distribution and Xaviar. 

Below is also a network which output normalised probabilities for both the classification of backgeound and signal using one hot encoded vectors. While uncommon, its worth a try.

The training of all the networks wtested will be contained in this notbook at the same time, this is to cut down on the execution time of the notebook. However, the results of the training of networks which are ommited will be mentioned later on.

In [None]:
# Define hyperparameters for improved network
improved_network_hyperparameters = {"features": [512, 256, 128, 64, 32, 16, 8, 4]}

# # Create the classifier object
# classifier_network_no_cut = create_nn_classifier(
#     num_inputs          = len(KEYS["features"]),
#     num_outputs         = 2,
#     features            = improved_network_hyperparameters["features"],
#     activation          = "selu",
#     initializer         = "normal",
#     final_layer         = "softmax",
#     loss = "categorical_crossentropy"
# )

# Create the classifier object
classifier_network_no_cut = create_nn_classifier(
    num_inputs          = len(KEYS["features"]),
    num_outputs         = len(KEYS["targets"]),
    features            = improved_network_hyperparameters["features"],
    activation          = "selu",
    initializer         = "normal",
)

classifier_network_harsh_cut = create_nn_classifier(
    num_inputs          = len(KEYS["features"]),
    num_outputs         = len(KEYS["targets"]),
    features            = improved_network_hyperparameters["features"],
    activation          = "selu",
    initializer         = "normal",
)

### Train on no cut dataset

First we will train the network on the dataset without kinematic parameters. The loss and accuracies throughout the training are also plotted below.

In [None]:
history_no_cut = classifier_network_no_cut.fit(
    no_cut_features_train,
    no_cut_targets_train,
    batch_size = BATCH_SIZE,
    epochs = EPOCHS,
    validation_data = (no_cut_features_val, no_cut_targets_val),
    verbose=2
)

In [None]:
# print the final accuracy of the model 
print_final_accuracy(history_no_cut.history)

print("\n")

# plot the loss and accuracy while training simpole model
plot_training_metrics(history_dict=history_no_cut.history)

Now we try with dataset using harsh kinematic cuts and compare the difference in the two networks


In [None]:
history_harsh_cut = classifier_network_harsh_cut.fit(
    harsh_cut_features_train,
    harsh_cut_targets_train,
    batch_size = BATCH_SIZE,
    epochs = EPOCHS,
    validation_data = (harsh_cut_features_val, harsh_cut_targets_val),
    verbose = 2
)

In [None]:
# print the final accuracy of the model 
print_final_accuracy(history_harsh_cut.history)

print("\n")

# plot the loss and accuracy while training simpole model
plot_training_metrics(history_dict=history_harsh_cut.history)

From these results, we we see that there is no real overtraing (in fact the validation dataset is slightly better than the training dataset). We also see that the model trained of the harsh cuts performs marginally better than the dataset with no cuts, hence we will use harshly cut datasets for the training of subsequent models.

The reason for this small improvement is likely due tio the network not having to learn outlier background events which are far out of the range of the mass range of the signal events. that sair, given that they do not make up for a majority of the events, the increase is not significant.

### Finding better architectures for NN classifier

Now that we have identified which dataset better trains the network, we can move onto improving the architecture of the network to func a specific number of hidden layers and nodes which will improve the network.

Here we will also test other improvements like learning rates, different kernel initialised and different activation funtions for the hidden layers.

Similar to the procidure used above, we will train a specific network, and then we will plot the training metrics to evaluate the network performance.

In [None]:
# Create the classifier object
classifier_network_harsh_cut_deep = create_nn_classifier(
    num_inputs          = len(KEYS["features"]),
    num_outputs         = len(KEYS["targets"]),
    features            = [1024, 512, 256, 128, 64, 32, 16, 8, 4],
    activation          = "selu",
    initializer         = "normal",
)
classifier_network_harsh_cut_deeper = create_nn_classifier(
    num_inputs          = len(KEYS["features"]),
    num_outputs         = len(KEYS["targets"]),
    features            = [2048, 1024, 512, 256, 128, 64, 32, 16, 8, 4],
    activation          = "selu",
    initializer         = "normal",
)

In [None]:
history_harsh_cut_deep = classifier_network_harsh_cut_deep.fit(
    harsh_cut_features_train,
    harsh_cut_targets_train,
    batch_size = BATCH_SIZE,
    epochs = EPOCHS,
    validation_data = (harsh_cut_features_val, harsh_cut_targets_val),
    verbose = 2
)

In [None]:
# print the final accuracy of the model 
print_final_accuracy(history_harsh_cut_deep.history)

print("\n")

# plot the loss and accuracy while training simpole model
plot_training_metrics(history_dict=history_harsh_cut_deep.history)

In [None]:
history_harsh_cut_deeper = classifier_network_harsh_cut_deeper.fit(
    harsh_cut_features_train,
    harsh_cut_targets_train,
    batch_size = BATCH_SIZE,
    epochs = EPOCHS,
    validation_data = (no_cut_features_val, no_cut_targets_val),
    verbose = 2
)

In [None]:
#print the final accuracy of the model 
print_final_accuracy(history_harsh_cut_deeper.history)

print("\n")

# plot the loss and accuracy while training simpole model
plot_training_metrics(history_dict=history_harsh_cut_deeper.history)

From above, we can tell that the deeper network with 10 hidden layers preformns the best with an accuracy of 90.5%. This marginal increase is likely due to the additional hidden layer allowing the network to make further abstractions on the input features which allow it to make an improved prediction on a specific entry.

Additional configurations that were tested but ommited were:

- Using GELU activation fucntion on the deeper NN $\rightarrow$ 89% validation accuracy

- Using RELU activation fucntion on the deeper NN $\rightarrow$ 87% validation accuracy

- Using 1e3 learning rate on the deeper NN $\rightarrow$ 90.1% validation accuracy

- Using Xaviar kernal initialisation on the deeper NN $\rightarrow$ 90.0% validation accuracy

From all the networks tested, it was forund that the 10 hidden layer feed forward network with SELU activation function trained on the dataset with harsh cuts applied performend the best when it came to classifting Signal and Background entries.

Using the best performing network, we compute the classification scores and the final predictions of the NN on the training ands validation dataset so that we cam make the additional plots to evaluate the network.

In [None]:
# Get predictions on validation dataset
harsh_cut_deeper_classification_scores_val = classifier_network_harsh_cut_deeper.predict(no_cut_features_val)
harsh_cut_deeper_predictions_val = (harsh_cut_deeper_classification_scores_val >= 0.5).astype(int) 

# Get predictions on training dataset
harsh_cut_deeper_classification_scores_train = classifier_network_harsh_cut_deeper.predict(no_cut_features_train)
harsh_cut_deeper_predictions_train = (harsh_cut_deeper_classification_scores_train >= 0.5).astype(int) 

### We can conform that the deeper network using the dataset pre kinematic cuts is the best.

Using the functions created previously, we can make all relevant plots to evaluate the functionality of the network. 

We will first make all relevant plots and then comment on them atg the end

In [None]:
plot_classification_score_distribution(harsh_cut_deeper_classification_scores_val, no_cut_targets_val.to_numpy())

In [None]:
plot_roc_curve(harsh_cut_deeper_classification_scores_train, harsh_cut_deeper_classification_scores_val, harsh_cut_targets_train, harsh_cut_targets_val)

In [None]:
plot_confusion_matrix(harsh_cut_deeper_predictions_val.ravel(), harsh_cut_targets_val.to_numpy().ravel())

From these plots, we can tell that the classiciation algorithm which was developed is very good. It is capable of predicting entries with an accuracy of ~90% efficneicy. From the classification plot distribution, we see that the distributions are skewed to their ideal classification scores with minimal overlap bvetween the distributions.

This is supported by the ROC curve with an aalmost perfect ROC curve with an area under the curve of 0.96 for the validation dataset.

From the confusion matrix, we can tell that the netweks has a slightly harder time at classifying background events, however the difference is marginal.

### Now lets try BDT

Given that we only need a classification algorithm, we are not limited to only using neural networks, we can also use gradient boosted decision trees such as the BDT algorithm.

We can use HalvingGridSearchCV to scan through the hyperparameters to find the optimal set and compare it to the neural network.

We will define a numer of numpy arrays with hyperparameters for the grid search top scan through and return them. Wa can then create a BDT with the optimal parameters to compute the classification socres.

Similar to our best NN, we will be using the datasts with the harsh kineamtic cuts to train the BDT

In [None]:
# # Grid Search parameters
# param_grid = {
#     "n_estimators": [100],
#     "learning_rate": np.linspace(1e-2, 4e-2, 10),
#     "max_depth":    np.arange(10, 30),
#     "min_samples_leaf": np.arange(30, 50,),
# }

# # Number of threads to use to speed up the grid search
# n_jobs = 20

# # Set the numpy pseudo RNG to our set seed 
# np.random.seed(SEED)

# # Initialte our BDt object with base parameters which we will feed into our grid search
# bdt_classifier = GradientBoostingClassifier(random_state=SEED)

# # Initite our grid search object which will scan through our hyperparameter space
# grid_search = HalvingGridSearchCV(estimator=bdt_classifier, param_grid=param_grid, n_jobs=n_jobs, verbose =1,)

# grid_search.fit(harsh_cut_features_train, harsh_cut_targets_train.to_numpy().ravel())
# print("Best estimator:")
# print(grid_search.best_estimator_)
# print(grid_search.best_score_)

What theearch found was that the ideal hyperparameters for the BDT are:

- Learninig rate: 2.33e-2

- Max depth: 11

- minimum sample leaves: 45

Hence we will create a BDT with these hyperparametrs and compute the validation accuracy

In [None]:
# initialte the BDT with optimal hyperparameters
optimal_bdt = GradientBoostingClassifier(
    learning_rate=2.33e-2,
    max_depth=11,
    min_samples_leaf=45,
    random_state=SEED
)

# Train the BDT on the training dataset
optimal_bdt.fit(harsh_cut_features_train, harsh_cut_targets_train.to_numpy().ravel(),)

# Use the trained BDT to compute the classification scores
harsh_cut_bdt_classification_scores_val = optimal_bdt.predict(harsh_cut_features_val)
harsh_cut_bdt_predictions_val = (harsh_cut_bdt_classification_scores_val >= 0.5).astype(int)

# Compute the validation accuracy of the optimal BDT
bdt_val_accuracy = accuracy_score(harsh_cut_targets_val.to_numpy().ravel(), harsh_cut_bdt_classification_scores_val)

# Print the validation accuracy
print(f"The validation accuracy of the optimal BDT is {bdt_val_accuracy:.1%}")

In [None]:
plot_confusion_matrix(harsh_cut_bdt_predictions_val.ravel(), harsh_cut_targets_val.to_numpy().ravel())

While able to compete with the neural networks, the accuracy of the BDT is just shy of the best perfomring neural network. From the confusion matrix, we see that it has very similar perfomance to the Neural networks.

Ultimately, due to it not being able to outpreform the NN, we will not investigate BDTs further in the report.

# **Use classifcation network to improve fitting**

Now that we have developed a NN classification algorithm, we can move on to using it to attempt to improve the hypothesis tensting by adding an additional restriction that only signal enties with a classification score above 0.5 will be included in the sigal mass distrivbution.

Below we take the kinematic features used as the input for the neural network from the dataset after having applied harsh cuts and use the NN to obtrain classification scores. We then add those classification scores to the dataset's dataframe

In [None]:
# Make network input array using all entries in the pre kiunematic cut dataset
harsh_cut_all_features = merge_dataset_post_harsh_cut[KEYS["features"]]

# Use network to get classification scores
harsh_cut_all_classification_scores = classifier_network_harsh_cut_deeper.predict(PowerTransformer().fit_transform(harsh_cut_all_features), batch_size=8192)

# Create array with all the predictions using the classificaton socres
harsh_cut_all_predictions = (harsh_cut_all_classification_scores >= 0.5).astype(int)

# Add the predicitons to the merged dataframe with the precuts
merge_dataset_post_harsh_cut["nn_prediction"] = harsh_cut_all_predictions
merge_dataset_post_harsh_cut["nn_classification_score"] = harsh_cut_all_classification_scores

merge_dataset_post_harsh_cut.head()

After doing this, we can follow the same frocedure wheich was undertaken on the Hypothesis testing section to fit the alternate and null hpyothesis to the new cleaned up signal+background distribution. The two fits are then used in conjucnction with Wilks theorem to compute the updated significance. 

Given that the procefure is almost identicle to the procedure used previously and all the funcitons get reused, we will not add any additional commentary to this section untill we obtain the updated significance and we can comment on it.

In [None]:
# Apply kinematic cuts on the dataset
for cut_key in harsh_cut_dictionary.keys():
    # query_string = f"{cut_key} < {cut_dictionary[cut_key][dataset_key]}"
    query_string = f"{cut_key} > {harsh_cut_dictionary[cut_key]['minimum']}" + " & " + f"{cut_key} < {harsh_cut_dictionary[cut_key]['maximum']}"
    merge_dataset_pre_harsh_cut = merge_dataset_pre_harsh_cut.query(query_string)

# Create datasets with signal and background entries in addition to applying threshold on classifcation score
signal_background_dataset_cleaned, signal_dataset_cleaned, background_dataset_cleaned = split_dataset_signal_background(merge_dataset_post_harsh_cut, classification_score_threshold=0.5)

# Compute the histogram bins and counts for each dataset
signal_background_dataset_cleaned, signal_background_counts_cleaned, _, signal_background_bins_center_cleaned = compute_dataset_histograms(signal_background_dataset_cleaned, n_bins=N_BINS)
signal_dataset_cleaned, signal_counts_cleaned, _, signal_bins_center_cleaned = compute_dataset_histograms(signal_dataset_cleaned, n_bins=N_BINS)
background_dataset_cleaned, background_counts_cleaned, _, background_bins_center_cleaned = compute_dataset_histograms(background_dataset_cleaned, n_bins=N_BINS)

# Compute the weigthed number of events and errors for each histogram bin
n_observed_signal_background_cleaned, sigma_squared_signal_background_cleaned = find_dataset_binned_events(signal_background_dataset_cleaned, signal_background_bins_center_cleaned)
n_observed_signal_cleaned, sigma_squared_signal_cleaned = find_dataset_binned_events(signal_dataset_cleaned, signal_bins_center_cleaned)
n_observed_background_cleaned, sigma_squared_background_cleaned = find_dataset_binned_events(background_dataset_cleaned, background_bins_center_cleaned,)

In [None]:
# Plot the Counts and pos to verify that they represent the histogram correctly
fig, ax = plt.subplots(1, 3, figsize=(16, 5))
titles = ["Total Mass distribution", "Signal mass distribution", "Background mass distribution"]

for idx, (dataset, bins_center, counts) in enumerate(zip([signal_background_dataset_cleaned, signal_dataset_cleaned, background_dataset_cleaned], [signal_background_bins_center_cleaned, signal_bins_center_cleaned, background_bins_center_cleaned], [signal_background_counts_cleaned, signal_counts_cleaned, background_counts_cleaned])):
    ax[idx].scatter(bins_center, counts, c="r", marker="x", label="Post NN")
    ax[idx].hist(dataset["reco_zv_mass"], weights=dataset["FullEventWeight"], bins=N_BINS, alpha=0.5, color="blue", label="Post NN")
    ax[idx].set_xlabel("Reco Mass (MeV)", fontsize=12)
    ax[idx].set_ylabel("Number of Events", fontsize=12)

for idx, (dataset, bins_center, counts) in enumerate(zip([signal_background_dataset, signal_dataset, background_dataset], [signal_background_bins_center, signal_bins_center, background_bins_center], [signal_background_counts, signal_counts, background_counts])):
    ax[idx].scatter(bins_center, counts, c="r", marker="o", label="Pre NN")
    ax[idx].hist(dataset["reco_zv_mass"], weights=dataset["FullEventWeight"], bins=N_BINS, alpha=0.5, color="green", label="Pre NN")
    ax[idx].set_title(titles[idx], fontsize=14)
    ax[idx].set_xlabel("Reco Mass (MeV)", fontsize=12)
    ax[idx].set_ylabel("Number of Events", fontsize=12)
    ax[idx].legend()

fig.tight_layout()

In [None]:
def chi_squared_alternative_gaussian_plus_fourth_poly_cleaned( a, b, c, d, e, mu, sigma, norm,):
    numerator = (n_observed_signal_background_cleaned - gaussian_plus_fourth_poly(signal_background_bins_center_cleaned, a, b, c, d, e, mu, sigma, norm) )**2
    return np.sum( numerator/sigma_squared_signal_background_cleaned )

def chi_squared_null_fourth_poly_cleaned( a, b, c, d, e,):
    numerator = (n_observed_signal_background_cleaned - fourth_order_poly(signal_background_bins_center_cleaned, a, b, c, d, e) )**2
    return np.sum( numerator/sigma_squared_signal_background_cleaned )

In [None]:
# Define minimiser to minimise the alternatice hypothesis
signal_background_cleaned_fit_alternate_results = Minuit(
    chi_squared_alternative_gaussian_plus_fourth_poly_cleaned,
    a=background_fit_results.values[0],
    b=background_fit_results.values[1],
    c=background_fit_results.values[2], 
    d=background_fit_results.values[3],
    e=background_fit_results.values[4],
    mu=signal_fit_results.values[0],
    sigma=signal_fit_results.values[1],
    norm=signal_fit_results.values[2],
)

# Fix the mean and std of the signal
signal_background_cleaned_fit_alternate_results.fixed["mu"] = True 
signal_background_cleaned_fit_alternate_results.fixed["sigma"] = True

# Minimise the null hypothesis
signal_background_cleaned_fit_alternate_results.migrad()
signal_background_cleaned_fit_alternate_results.hesse()

# Print minimised parameters
print(signal_background_cleaned_fit_alternate_results.params)

# Print the minimised chi squared
print('Final chisq for alternative hypothesis on cleaned dataset: ', signal_background_cleaned_fit_alternate_results.fval)

# Plot the minimised alternative hypothesis 
plt.scatter(signal_background_bins_center_cleaned, signal_background_counts_cleaned, c="r", marker="x", label="Observed events")
plt.hist(signal_background_dataset_cleaned["reco_zv_mass"], weights=signal_background_dataset_cleaned["FullEventWeight"], bins=N_BINS, alpha=0.3, color="blue")
plt.plot(signal_background_bins_center_cleaned, gaussian_plus_fourth_poly(signal_background_bins_center_cleaned, *signal_background_cleaned_fit_alternate_results.values), color="green", label="Alternate hypothesis")
plt.title("Alternative hypothesis fit on total mass distribution", fontsize=14)
plt.xlabel("Reco Mass (MeV)", fontsize=12)
plt.ylabel("Number of Events", fontsize=12)
plt.legend()

In [None]:
# Define minimiser to minimise the null hypothesis
signal_background_cleaned_null_fit_results = Minuit(
    chi_squared_null_fourth_poly_cleaned,
    a=background_fit_results.values[0],
    b=background_fit_results.values[1],
    c=background_fit_results.values[2], 
    d=background_fit_results.values[3],
    e=background_fit_results.values[4],
)

# Minimise the null hypothesis
signal_background_cleaned_null_fit_results.migrad()
signal_background_cleaned_null_fit_results.hesse()

# Print minimised parameters
print(signal_background_cleaned_null_fit_results.params)

# Print the minimised chi squared
print('Final chisq for the null hypothesis: ', signal_background_cleaned_null_fit_results.fval)

# Plot the results
plt.scatter(signal_background_bins_center_cleaned, signal_background_counts_cleaned, c="r", marker="x", label="Observed events")
plt.hist(signal_background_dataset_cleaned["reco_zv_mass"], weights=signal_background_dataset_cleaned["FullEventWeight"], bins=N_BINS, alpha=0.3, color="blue")
plt.plot(signal_background_bins_center_cleaned, fourth_order_poly(signal_background_bins_center_cleaned, *signal_background_cleaned_null_fit_results.values), color="green", label="Null hypothesis")
plt.title("Null hypothesis fit on total mass distribution", fontsize=14)
plt.xlabel("Reco Mass (MeV)", fontsize=12)
plt.ylabel("Number of Events", fontsize=12)
plt.legend()

In [None]:
# Plot the results
plt.scatter(signal_background_bins_center_cleaned, signal_background_counts_cleaned, c="r", marker="x")
plt.hist(signal_background_dataset_cleaned["reco_zv_mass"], weights=signal_background_dataset_cleaned["FullEventWeight"], bins=N_BINS, alpha=0.3, color="blue")
plt.plot(signal_background_bins_center_cleaned, fourth_order_poly(signal_background_bins_center_cleaned, *signal_background_cleaned_null_fit_results.values), label="Null hypothesis", color="darkgreen")
plt.plot(signal_background_bins_center_cleaned, gaussian_plus_fourth_poly(signal_background_bins_center_cleaned, *signal_background_cleaned_fit_alternate_results.values), label="Alternate hypothesis", color="black")
plt.title("Total mass distribution with hypotheses fits", fontsize=14)
plt.xlabel("Reco Mass (MeV)", fontsize=12)
plt.ylabel("Number of Events", fontsize=12)
plt.legend()

In [None]:
# Compute the delta chi squared between the null and alternative hypothesis
delt_chi = signal_background_cleaned_null_fit_results.fval - signal_background_cleaned_fit_alternate_results.fval

# Define the difference number of degrees of freedom vetween the two fits
dof = 1 

# Compute the significance and the p score of the differentcwe in hypotheses
p_value = chi2.sf(delt_chi, dof)
z_score = np.sqrt(2)*erfinv(1-p_value)

print(f"The p value of the alternate hypothesis is {p_value:.3e}")
print(f"The statistical significance of the signal deviation is {z_score:.3f}")

From this Updated dignificance, we immediatly see that the significance using the NN algorithm to clean up the signal has made the significanc smaller. This is contradictiry to what one would expect, however it makes sence.

Given that the harsh cuts alreadyt significantly reduced the number of signal events in th mass distribution, adding another filter which will further reduce the total efficiency prior to hypothesis tensting only works to recuce the number of statistics for the signal distribution. This in turn will make the signal peak slightly less visiable, hence redicing the significance of the alternate hypotheses.

Were we to have cuts which have a greater signal efficiency, or we had more signal entires to begin with and using the harsh cuts + the NN clasisfier, we could perhapse see some imporvements in the significance and the alternate hypothesis. However given the current circumstances of this analysis, it not possible.

This reduction in visibuility in the signal peak is made more evident in the plot which overlays thew distribution pre application of the NN and opost application. There are less events in the signal distribution, while the backgeound remains unchanged given that we never applied the NN to background events, hence recucing the visibuility of the signal peak.

# **Improving the NN classifier**


In this section, we retrain the NN with the inclusion of the reco_zv_mass in the training features to see how it will affect the network. Givent aht the procedure is identicle to that employed in the previous NN section, no additional commentary is provided untill the end where the results of the section are discussed.

The basic overview on the seps which are applied are: Create a new dataset identicle to that used to train the last network with the inclusion of the reconstructed mass, We create a compiled NN, we train it and then we plot all relevant plots to evaluate the network's preformance.

In [None]:
# Create new features key list including reco_zv_mass
feature_keys_imporved = KEYS["features"]
feature_keys_imporved.append("reco_zv_mass")
print(feature_keys_imporved)

# Create a new dataset including the reco zv mass
harsh_cut_features_train_improved, harsh_cut_targets_train_improved, harsh_cut_features_val_improved, harsh_cut_targets_val_improved = create_ml_training_datasets(
    merge_dataset_post_harsh_cut,
    admixture_size=N_SIGNAL_SIZE,
    feature_keys = feature_keys_imporved
)

# Create nn object for new imporved dataset
classifier_network_harsh_cut_deeper_improved = create_nn_classifier(
    num_inputs          = len(KEYS["features"]),
    num_outputs         = len(KEYS["targets"]),
    features            = [2048, 1024, 512, 256, 128, 64, 32, 16, 8, 4],
    activation          = "selu",
    initializer         = "normal",
)

# Fit the neural network to the imporved dataset
history_harsh_cut_deeper_improved = classifier_network_harsh_cut_deeper_improved.fit(
    harsh_cut_features_train_improved,
    harsh_cut_targets_train_improved,
    batch_size = BATCH_SIZE,
    epochs = EPOCHS,
    validation_data = (harsh_cut_features_val_improved, harsh_cut_targets_val_improved),
    verbose = 2
)

In [None]:
# Print the metrics of the network throughout training
print_final_accuracy(history_harsh_cut_deeper_improved.history)

print("\n")

# plot the loss and accuracy while training simpole model
plot_training_metrics(history_dict=history_harsh_cut_deeper_improved.history)

In [None]:
# Get predictions on validation dataset
harsh_cut_deeper_classification_scores_val_improved = classifier_network_harsh_cut_deeper_improved.predict(harsh_cut_features_val_improved)
harsh_cut_deeper_predictions_val_improved = (harsh_cut_deeper_classification_scores_val_improved >= 0.5).astype(int) 

# Get predictions on training dataset
harsh_cut_deeper_classification_scores_train_improved = classifier_network_harsh_cut_deeper_improved.predict(harsh_cut_features_train_improved)
harsh_cut_deeper_predictions_train_improved = (harsh_cut_deeper_classification_scores_train_improved >= 0.5).astype(int) 

In [None]:
plot_classification_score_distribution(harsh_cut_deeper_classification_scores_val_improved, harsh_cut_targets_val_improved.to_numpy())

In [None]:
plot_roc_curve(harsh_cut_deeper_classification_scores_train_improved, harsh_cut_deeper_classification_scores_val_improved, harsh_cut_targets_train_improved, harsh_cut_targets_val_improved)

In [None]:
plot_confusion_matrix(harsh_cut_deeper_predictions_val_improved.ravel(), harsh_cut_targets_val_improved.to_numpy().ravel())

By including the reconstructed mass as a featurew in the NN, we find that the accuracy was slightly impoved to a validation accuracy of 90.7%. While the incerease is marginal, it is better than the NN which trained without the reconstructed mass.

The exact effects are broken down in the consufion matrix. This imporved NN impoved the classification of signal entries by 3%. This is likely due to the fact that with the reocnstructed mass, it is easier for the network to identify which enties are signals, given that the signal only shows up in a very specific energy range, hence any entries that have a reconstructed mass outside of this range can automatically rejected as a signal by the network.

Paradoxiacally, while the signal accuracy of the network increased, the backgeound accuracy reduces slightly by 2%. It is unknown if this is due to the stocastic nature of neural networks causing a worse performance on this seed with the addition of the new features, and if we were to change the seed we would se an inprovement in backgeound accuracy. Either way, given the marginal increase in overall accuracy, added with the fact that the efficeincy of the network is still not perfect, and would cause in the reduction signal entgries and hence most likely still result in the reducction of the significanse of the alternate hypothesis if it were applied ot the signal+bakcground reconstructed mass distribution.