This Python code defines and utilizes the `Data_generator` class to generate synthetic data for a physics experiment. The program then cleans up the data by removing outliers and subsequently stores the clean data in a pickle file. 

## Data Generation

`Data_generator` is the main part of this script and is initialized with two parameters: the number of events (`numevents`) and a boolean indicating whether the generated data should be normalized or not. The synthetic data is a representation of the result of physics events involving multiple particles, where each particle is characterized by different properties such as `eta`, `mass`, `phi`, `pt`, `charge`, and `genPartFlav`.

Upon initialization, `Data_generator` configures a series of functions and input variables, which are then utilized in the `generate_fake_data` method to generate synthetic data. 

The `generate_fake_data` method starts by creating a dictionary, `data`, with keys for each variable and empty lists as values. It then generates random data for these variables using relevant statistical distributions which are based on the physical properties being simulated. 

After the data has been generated, `Data_generator` includes an optional step of renaming and reordering keys in the dictionary in a more standard format, before returning the data.

## Data Cleaning

After generating the data, outliers are removed using the `remove_outliers` function. This function uses the limits defined in a YAML file to identify and remove the outliers in the data. 

## Data Storage

Once the outliers have been removed, the data is then stored as a pickle file at a specified location for later use.



In [8]:
import pickle
import numpy as np
from copy import deepcopy
import sys
sys.path.append('../utils/')
from DD_data_extractor_git import Data_generator, remove_all_outliers
import matplotlib.pyplot as plt
import pandas as pd
import yaml
import os

In [9]:
num_events = 100000
Data_generator1 = Data_generator(num_events, normalize=True)
data_dict=Data_generator1.getData()

In [10]:
def read_yaml_file(filename):
    with open(filename, 'r') as stream:
        try:
            data = yaml.safe_load(stream)
        except yaml.YAMLError as exc:
            print(exc)
    return data


limits = read_yaml_file('GenerateDataClean.yaml')


def remove_outliers(data, limits):
    outlier_mask = np.zeros(data[next(iter(data))].shape, dtype=bool)  # initially no outliers

    do_not_cut = limits.get('do_not_cut', [])
    cuts = limits.get('cuts', {})

    for feature_name in data.keys():
        if feature_name in do_not_cut:
            continue  # do not change the outlier mask

        feature_limits = cuts.get(feature_name)
        if feature_limits is not None:
            lower_limit, upper_limit = feature_limits.get('lower_percentile', 0.03), feature_limits.get('upper_percentile', 99.7)
        else:
            lower_limit, upper_limit = 0.03, 99.7

        lower_value, upper_value = np.percentile(data[feature_name], [lower_limit, upper_limit])
        feature_outlier_mask = (data[feature_name] < lower_value) | (data[feature_name] > upper_value)
        outlier_mask |= feature_outlier_mask  # update the outlier mask

    # remove outliers from all features
    for feature_name in data.keys():
        data[feature_name] = data[feature_name][~outlier_mask]

    return data



data_dict_removed_outliers2 = deepcopy(data_dict)
data_dict_removed_outliers2 = remove_outliers(data_dict_removed_outliers2, limits)


In [15]:
base_path = os.path.dirname(os.getcwd())
folder = "fake_data"

full_folder_path = os.path.join(base_path,"saved_files", folder)

os.makedirs(full_folder_path, exist_ok=True)

filename = "data_dict_removed_outliersAug4.pkl"

full_file_path = os.path.join(full_folder_path, filename)

with open(full_file_path, 'wb') as f:
    pickle.dump(data_dict_removed_outliers2, f)


print(data_dict_removed_outliers2.keys())
print(data_dict_removed_outliers2['mass_12'].shape)

dict_keys(['event', 'genWeight', 'MET_phi', '1_phi', '1_genPartFlav', '2_phi', '2_genPartFlav', '3_phi', '3_genPartFlav', 'charge_1', 'charge_2', 'charge_3', 'pt_1', 'pt_2', 'pt_3', 'pt_MET', 'eta_1', 'eta_2', 'eta_3', 'mass_1', 'mass_2', 'mass_3', 'deltaphi_12', 'deltaphi_13', 'deltaphi_23', 'deltaphi_1MET', 'deltaphi_2MET', 'deltaphi_3MET', 'deltaphi_1(23)', 'deltaphi_2(13)', 'deltaphi_3(12)', 'deltaphi_MET(12)', 'deltaphi_MET(13)', 'deltaphi_MET(23)', 'deltaphi_1(2MET)', 'deltaphi_1(3MET)', 'deltaphi_2(1MET)', 'deltaphi_2(3MET)', 'deltaphi_3(1MET)', 'deltaphi_3(2MET)', 'deltaeta_12', 'deltaeta_13', 'deltaeta_23', 'deltaeta_1(23)', 'deltaeta_2(13)', 'deltaeta_3(12)', 'deltaR_12', 'deltaR_13', 'deltaR_23', 'deltaR_1(23)', 'deltaR_2(13)', 'deltaR_3(12)', 'pt_123', 'mt_12', 'mt_13', 'mt_23', 'mt_1MET', 'mt_2MET', 'mt_3MET', 'mt_1(23)', 'mt_2(13)', 'mt_3(12)', 'mt_MET(12)', 'mt_MET(13)', 'mt_MET(23)', 'mt_1(2MET)', 'mt_1(3MET)', 'mt_2(1MET)', 'mt_2(3MET)', 'mt_3(1MET)', 'mt_3(2MET)', 'ma