# Create datasets

This notebook shows how to convert events of ROOT branches into different representations to prepare
a sample and collect them into a dataset.

## Introduction to representations

Data representations are data structures to store data. There are usually three representations in
high energy physics: Set, Image, and Graph.

Set is a 1D representation of events, which contains a set of observables of physics objects.
Observables are designed variables to refect the properties of events. For example, the invariant
mass of dijet indicate the middle particle; N-subjettiness indicates the number of subjets in a jet.
Representing events in Set is actively used in a large number of phenomenological studies.
Observables as high-level low-dimensional features are usually designed with physics insight and
expertise. Set is usually used in traditional machine learning methods, such as cut-based analysis,
decision tree, and so on.

Image is a 2D (gray scale) or 3D (RGB) representation. From the aspect of detectors, they are like
a giant camera to take pictures of final stable particles. Then these particles naturally form a 2D
image whose pixel intensity is the energy deposit of particles or other particle-related features,
and width and height are $\eta-\phi$ (pseudorapidity and azimuthal angle plane). Image could reflect
the spatial characteristics of events, which is suitable for convolutional neural networks (CNNs).
Its information is more complete than Set, but it is usually a sparse matrix, which is not storeage
efficient.

Graph is a 2D representation. A graph consists of nodes and edges. Here nodes are particles and
edges are the relationship between particles (e.g. the distance between particles in $\eta-\phi$
plane). Graph as a low-level high-dimensional representation, contains all the details of events,
which makes it most suitable for fully exploring potential of new physics and machine learning
techniques. Graph is usually used in graph neural networks (GNNs).

## Get events from ROOT files

First, we need to read events from output `.root` file. Let's import necessary modules from HML.

In [1]:
import numpy as np
from hml.generators import MG5Run
from hml.representations import Set
from hml.observables import Pt, M, DeltaR
from hml.datasets import Dataset

Welcome to JupyROOT 6.24/02


`MG5Run` is here to get information about a run such as cross section and events. It requires the
run directory to retrieve all useful information.

In [2]:
sig_run = MG5Run("./data/pp2zz/Events/run_01/")
bkg_run = MG5Run("./data/pp2jj/Events/run_01/")

Then we need define a representation. Here we use `Set` as an example. It requires a list of
observables. You can find all supported observables in `hml.observables` module.

In [3]:
representation = Set([Pt("Jet1"), Pt("Jet2"), DeltaR("Jet1", "Jet2"), M("FatJet1")])

Two core parts of a dataset are data and target: data is the values of defined representation and
features one event; target is the integer label of each event. We create data and target as two empty
lists and fill them by looping over all events.

To obtain the representation values, call `.from_event` method, then HML will do the rest to correctly
get the values in the proper format. In event loop, you can add customed cuts to select events as the
same way as in `PyROOT`.

In [4]:
data, target = [], []

for event in sig_run.events:
    if event.Jet_size >= 2 and event.FatJet_size >= 1:
        representation.from_event(event)
        data.append(representation.values)
        target.append(1)

for event in bkg_run.events:
    if event.Jet_size >= 2 and event.FatJet_size >= 1:
        representation.from_event(event)
        data.append(representation.values)
        target.append(0)

data = np.array(data, dtype=np.float32)
target = np.array(target, dtype=np.int32)

After filling data and target, now we can complete a dataset with necessary information.

In [5]:
dataset = Dataset(
    data,
    target,
    feature_names=representation.names,
    target_names=["pp2jj", "pp2zz"],
    description="This is a demo dataset for HML. Classify Z jets and QCD jets.",
    dir_path="./data/z_vs_qcd",
)

Finally, call `save` method to save the dataset to a directory where descriptive information is
stored as metadata in a yaml file and data and target are stored in a npz file. To learn more about
`Dataset`, check out the [API documentation](../../api-reference/hml.datasets/#hml.datasets.Dataset).

In [6]:
dataset.save(exist_ok=True)

In the next notebook, we will see how to load a local dataset and use it to train a machine learning
model.