# Create datasets

This notebook shows how to prepare data in different representations and create a dataset.

## Introduction to representations

Before applying methods to analyze events, we need to represent them in a proper format.

Set is a 1D representation of events, which contains a set of observables of physics objects.
Observables are designed variables to refect the properties of events. For example, the invariant
mass of dijet indicate the middle particle; N-subjettiness indicates the number of subjets in a jet.
It's a straightforward way and actively used in a large number of analyses.

! Need description of Image and Graph.

## Get events from ROOT files

First, we need to read events from output `.root` file. Let's import nessary modules from HML.

In [1]:
import numpy as np
from hml.generators import MG5Run
from hml.representations import Set
from hml.observables import Pt, M, DeltaR
from hml.datasets import Dataset

Welcome to JupyROOT 6.24/02


`MG5Run` is here to get information about a run such as cross section and events. It requires the
run directory to retrieve all useful information.

In [2]:
sig_run = MG5Run("./data/pp2zj/Events/run_01/")
bkg_run = MG5Run("./data/qcd/Events/run_01/")

Then we need define a representation. Here we use `Set` as an example. It requires a list of
observables. You can find all supported observables in `hml.observables` module.

In [3]:
representation = Set([Pt("Jet1"), Pt("Jet2"), DeltaR("Jet1", "Jet2"), M("FatJet1")])

Two core parts of a dataset are data and target: data is the values of defined representation and
features one event; target is the integer label of each event. We create data and target as two empty
lists and fill them by looping over all events.

To obtain the representation values, call `.from_event` method, then HML will do the rest to correctly
get the values in the proper format. In event loop, you can add customed cuts to select events as the
same way as in `pyROOT`.

In [4]:
data, target = [], []

for event in sig_run.events:
    if event.Jet_size >= 2 and event.FatJet_size >= 1:
        representation.from_event(event)
        data.append(representation.values)
        target.append(1)

for event in bkg_run.events:
    if event.Jet_size >= 2 and event.FatJet_size >= 1:
        representation.from_event(event)
        data.append(representation.values)
        target.append(0)

data = np.array(data, dtype=np.float32)
target = np.array(target, dtype=np.int32)


In [5]:
data.shape

(16289, 4)

In [6]:
np.unique(target, return_counts=True)

(array([0, 1], dtype=int32), array([7433, 8856]))

After filling data and target, now we can complete a dataset with necessary information.

In [7]:
dataset = Dataset(
    data,
    target,
    feature_names=representation.names,
    target_names=["pp2wz", "qcd"],
    description="This is a demo dataset for HML. Classify pp2wz and qcd events.",
    dataset_dir="./data/z_vs_qcd",
)

In [8]:
dataset.save()

Finally, call `save` method to save the dataset to a directory where descriptive information is stored
as metadata in a yaml file and data and target are stored in a npz file.