### Dataset license

The research is based on the dataset "[ATLAS Top Tagging Open Data Set](https://opendata.cern.ch/record/15013)" from European Organization for Nuclear Research (CERN) which is released under the Creative Commons CC0 waiver allowing open and free use of the data without limitations.

### Data structure

In [1]:
# Importing the required libraries
import pandas as pd
import numpy as np
import h5py

The data are split into two orthogonal sets, named train and test and stored in the HDF5 file format, containing 42 million and 2.5 million jets respectively.
For the purposes of this study, the dataset test.h5 was chosen due to limitations on computational power and computation time.

HDF5 files represent structured data storage, used for organizing and storing data in a hierarchical format. H5 files can contain datasets, groups, and attributes.
First, let's examine the structure of the file.

In [2]:
# Path to the file
file_path = 'test.h5'

# Open the .h5 file for reading
with h5py.File(file_path, 'r') as f:
    print('HDF5 File Structure:')
    for name, dataset in f.items():  # Loop through all items in the root group
        dataset_name = f'Dataset: {name.ljust(14)}'
        shape_str = f'| Shape: {dataset.shape}'
        datatype_str = f'| Datatype: {dataset.dtype}'
        print(f"{dataset_name} {shape_str.ljust(23)} {datatype_str}")  # Print name, shape, and datatype of the dataset

HDF5 File Structure:
Dataset: fjet_C2        | Shape: (2484117,)     | Datatype: float32
Dataset: fjet_D2        | Shape: (2484117,)     | Datatype: float32
Dataset: fjet_ECF1      | Shape: (2484117,)     | Datatype: float32
Dataset: fjet_ECF2      | Shape: (2484117,)     | Datatype: float32
Dataset: fjet_ECF3      | Shape: (2484117,)     | Datatype: float32
Dataset: fjet_L2        | Shape: (2484117,)     | Datatype: float32
Dataset: fjet_L3        | Shape: (2484117,)     | Datatype: float32
Dataset: fjet_Qw        | Shape: (2484117,)     | Datatype: float32
Dataset: fjet_Split12   | Shape: (2484117,)     | Datatype: float32
Dataset: fjet_Split23   | Shape: (2484117,)     | Datatype: float32
Dataset: fjet_Tau1_wta  | Shape: (2484117,)     | Datatype: float32
Dataset: fjet_Tau2_wta  | Shape: (2484117,)     | Datatype: float32
Dataset: fjet_Tau3_wta  | Shape: (2484117,)     | Datatype: float32
Dataset: fjet_Tau4_wta  | Shape: (2484117,)     | Datatype: float32
Dataset: fjet_ThrustMaj | S

The file structure consists of 25 datasets, 21 of which have 1 feature each, and 4 have 200 features each. 

### Converting to CSV

To facilitate the file structure for processing and analysis, let's extract these datasets separately, concatenate them into one dataset, and upload the results in a CSV file.

In [3]:
# List of datasets to extract
datasets_one_feature = [
    '/fjet_C2', '/fjet_D2', '/fjet_ECF1', '/fjet_ECF2', '/fjet_ECF3',
    '/fjet_L2', '/fjet_L3', '/fjet_Qw', '/fjet_Split12', '/fjet_Split23',
    '/fjet_Tau1_wta', '/fjet_Tau2_wta', '/fjet_Tau3_wta', '/fjet_Tau4_wta',
    '/fjet_ThrustMaj', '/fjet_eta', '/fjet_m', '/fjet_phi', '/fjet_pt',
    '/labels', '/weights'
]
datasets_200_features = [
    '/fjet_clus_E', '/fjet_clus_eta',
    '/fjet_clus_phi', '/fjet_clus_pt'
]

# Create an empty DataFrame
df_1 = pd.DataFrame()

# Open the .h5 file for reading
with h5py.File(file_path, 'r') as f:
    
    # Extracting and concatenating datasets with 1 feature
    
    # Iterate over the specified datasets
    for dataset_name in datasets_one_feature:
        # If the dataset exists, add it to the DataFrame
        if dataset_name in f:
            dataset = f[dataset_name]
            df_1[dataset_name[1:]] = dataset[:]
    
     # Extracting and concatenating datasets with 200 features
    
    datasets = []
    
    # Iterate over the datasets and append their values to the list
    for dataset_name in datasets_200_features:
        dataset_200 = f[dataset_name]
        datasets.append(dataset_200[:])

    # Concatenate the datasets along the second axis (axis=1) using NumPy
    concatenated_data = np.concatenate(datasets, axis=1)
    
    # Generate column names
    column_names = []
    for dataset_name in datasets_200_features:
        for i in range(200):
            column_names.append(f'{dataset_name[1:]}_{i}')

    # Convert to pandas DataFrame with column names
    df_200 = pd.DataFrame(concatenated_data, columns=column_names)
    
# Concatenate df_1 and df_200 along axis=1 using NumPy
# I use np.concatenate instead of pd.concat because the latter crushed the system and didn't provide any results
result_array = np.concatenate([df_1.to_numpy(), df_200.to_numpy()], axis=1)

# Convert the concatenated array to a DataFrame
result_df = pd.DataFrame(result_array, columns=list(df_1.columns) + list(df_200.columns))

# Save the DataFrame to a CSV file
result_df.to_csv('result.csv', index=False)

The dataset has been transformed and saved into the file result.csv. Let's verify the information in the obtained dataset.

In [4]:
result_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2484117 entries, 0 to 2484116
Columns: 821 entries, fjet_C2 to fjet_clus_pt_199
dtypes: float64(821)
memory usage: 15.2 GB
