
# Histograms

Histograms provide a neat way of visualizing data distribution. Morevoer, they reduce the data size by binning the data significantly: instead of storing each data point, that is N floats, we only need to store the bin content, which is a much smaller, and constant, number.

For large data sets, histograms are often the way to go.

While matplotlib provides a way to plot histograms, and numpy a way to bin data, they offer only basic functionality and do not help with other aspects like binning, axis labeling, etc.
Furthermore, matplotlib does not provide a way to plot an **already binned histogram**. With large amounts of data, it is maybe a necessity to bin the data before plotting it.

## hist

`hist` is a library that provides a way to create and manipulate histograms, it is highly performant and offers a lot of functionality around histograms

In [None]:
import hist
import mplhep
import numpy as np

In [None]:
nsamples = 10_000_000
array1 = np.random.normal(0, 1, nsamples)
array2 = np.random.normal(1, 3, nsamples)
array12 = np.stack([array1, array2], axis=-1)

In [None]:
bins, edges = np.histogramdd(array12, bins=6)

In [None]:
%%timeit
bins, edges = np.histogramdd(array12, bins=6)

In [None]:
bins

We cannot easily plot them. No uncertainty bars, no labels, the binning is not clear, etc.

In [None]:
# Compose axis however you like; this is a 2D histogram
axis1 = hist.axis.Regular(6, -5, 5, name='x')
axis2 = hist.axis.Regular(6, -15, 20, name='y')
h = hist.Hist(axis1, axis2)

In [None]:
# Filling can be done with arrays, one per dimension
h.fill(x=array1, y=array2)

In [None]:
%%timeit  # to time it, we put everything in one cell
# Filling can be done with arrays, one per dimension
axistmp1 = hist.axis.Regular(6, -5, 5, name='x')
axistmp2 = hist.axis.Regular(6, -15, 20, name='y')
h = hist.Hist(axistmp1, axistmp2)
h.fill(x=array1, y=array2)

In [None]:
# NumPy array view into histogram counts, no overflow bins
counts = h.counts()
variances = h.variances()  # errors
print(f"counts = {counts}, variances = {variances}")

In [None]:
# Let's plot it
h.plot2d()

## Axes

A cental part of a histogram are the axes: They define the binning and other treats of the axis.

A Hist can have multiple axes of different types.

All axes are described [here](https://hist.readthedocs.io/en/latest/user-guide/axes.html#axes).



The most important types are


### Regular

This is an axis with lower, upper limits, **regularly** split into n bins.

```
axis_reg = hist.axis.Regular(nbins, lower, upper, name=name)
```

### Variable

A variable axis allows to set the bin edges arbitrarily using an array-like object.mro
```
axis_var = hist.axis.Variable([0, 0.5, 3.1, 3.4], name="eta")
```

## Axis Name

An axis has a name, which can be used as the identifier
when working with the histogram (instead of using plain integer indexes) and optionally a label, which can be used for plotting.

In [None]:
axisreg = hist.axis.Regular(bins=50, start=-10, stop=10, name="length", label="Length [cm]")

To create a histogram, we can pass one or multiple axes to a histogram

In [None]:
data_h = hist.Hist(axisreg)

In [None]:
data_h.fill(length=array1)

In [None]:
# only filling the first 1000 entries to see the uncertainty
data_h2 = hist.Hist(axisreg).fill(array1[:1000])  # we can also chain the commands

### Plotting with mplhep

As matplotlib doesn't offer a native way for plotting histograms, we can use the `hist` methods. Another way with more options is the `mplhep` package, which is (like all plotting shown here) a high-level interface to matplotlib.
In short, mplhep and hist work seamless together:

In [None]:
mplhep.histplot(data_h2, histtype="errorbar")  # we clearly see the uncertainty

### Plotting with hist


As we've seen already, `hist` itself provides also plotting functionality

In [None]:
data_h.plot1d()

## Multiple dimensions

Histograms can be multiple dimensional. Let's add a dimension to it.

In [None]:
mplhep.hist2dplot(h)

## Access Bins

hist allows you to access the bins of your Hist by various ways. Besides the normal access by index, you can use locations (supported by boost-histogram), complex numbers, and the dictionary to access the bins.

In [None]:
# Access by bin number
h[3, 2]

## Getting Density

If you want to get the density of an existing histogram, .density() is capable to do it and will return you the density array without overflow and underflow bins.

A histogram is a count, so it's an **integral over a density**. To obtain the density, one can devide by the area of the bin, this gives the "average density" in a bin.

In [None]:
h.density()

## Projecting axes

We can also project onto a certain axis

In [None]:
hx = h.project("x")

In [None]:
hx.plot1d()  # hx is now a 1D histogram

## Accessing everything relevant

Hist is transparent and let's us use many things

In [None]:
h.axes

In [None]:
h.axes['x']

In [None]:
h.axes['x'].edges

In [None]:
h.axes['x'].centers  # bin centers

In [None]:
h.axes['x'].widths  # bin widths

## Arithmetics

We can use the histograms to do math! We can multiply, add with each other or with scalars.

We can find the ratio between two histograms by dividing them

In [None]:
ratio_large = hx * 10
ratio_large.plot1d()

## Weights

Weights are an essential part in HEP histograms and hist fully supports weigths. We can simply give an array of weights when filling the histogram.

We first need to specify the storage type to be of type `Weight` in order to make sure we keep track of the weigths.

In [None]:
weight = np.random.normal(1., 0.1, size=nsamples)
storage = hist.storage.Weight()
h2d_weighted = hist.Hist(axis1, axis2, storage=storage).fill(x=array1, y=array2, weight=weight) # using names

In [None]:
h2d_weighted.plot2d()

In [None]:
h2d_weighted.variances()