# histbook

Numpy-native histogramming

Numpy, Pandas, Spark, and just about every statistical toolbox has a "histogram" command, but HEP physicists would find them underwhelming:

   * they usually go directly from data in memory to a plot: no interface for accumulating data from disk, accessing bin contents programmatically, or even setting bin ranges by hand
   * some don't have weights, particularly negative weights; and I've never seen profile plots

Could use ROOT, but `TH1` doesn't have a `Fill` method for whole arrays. (Surely that's doable...)

Besides, I've been wanting to improve histogram interfaces for a while.   `:)`

histbook's interface is familiar (to HEP), except that we fill a whole array instead of one value:

In [None]:
import numpy
from histbook import *
h = Hist(bin("x", 10, -5, 5))

In [None]:
h.fill(x=numpy.random.normal(-1.5, 1, 1000000))        # fill with some data

In [None]:
h.fill(x=numpy.random.normal(1.5, 1, 1000000))         # fill with more data

Plotting in or out of Jupyter; tables of numbers.

In [None]:
from vega import VegaLite as canvas                    # for inline plots in Jupyter
# import vegascope; canvas = vegascope.LocalCanvas()   # if you don't want to use Jupyter (traditional TCanvas in browser)

In [None]:
h.step().to(canvas)
# h.pandas()

<p style="margin-bottom: -20px; padding-bottom: 0px">Plays well with the scientific Python ecosystem:</p>

   * Numpy for large data input
   * Plots in or out of Jupyter (both through [Vega-Lite](https://vega.github.io/vega-lite/)), as well as ROOT histograms through PyROOT
   * Pandas for tabular access

<p style="margin-bottom: -20px; padding-bottom: 0px">Fits HEP expectations:</p>

   * re-fillable, "hadd"-able
   * weights with sumw2, HEP-style error handling
   * profile plots

<p style="margin-bottom: -20px; padding-bottom: 0px">But also some improvements:</p>

   * all histograms are n-dimensional
   * plotting is a matter of slicing and projecting onto graphical facets (x, y, color, trellis)
   * quantities to plot are reordered to fill in an optimal way, minimizing passes over data

N-dimensional histograms:

In [None]:
h = Hist(bin("Jet_pt", 100, 0, 100), bin("Jet_eta", 100, -5, 5), bin("Jet_phi", 100, -numpy.pi, numpy.pi))

In [None]:
import uproot
tree = uproot.open("~/NanoAOD-DYJetsToLL.root")["Events"]
h.fill(**tree.arrays(["Jet_pt", "Jet_eta", "Jet_phi"]))

In [None]:
h.step("Jet_eta").to(canvas)

In [None]:
h.overlay("Jet_pt").step("Jet_eta").to(canvas)

In [None]:
h.select("15 <= Jet_pt < 20 and -0.1 <= Jet_eta < 0.1 and Jet_phi < -pi*0.94").pandas()

These axis labels are not strings; they're algebraic expressions.

In [None]:
h = Hist(bin("a**2", 100, 0, 100), bin("a**3", 100, 0, 100), bin("a*b", 100, 0, 100))
h.fill(a=numpy.random.normal(0, 1, 1000000), b=numpy.random.normal(0, 1, 1000000))

In [None]:
h.fields
# h.axis
# h._showgoals()    # internal debugging method

An n-dimensional distribution can be binned in a variety of ways in the same histogram:

   * **bin(expr, numbins, low, high)** regularly bin **low** to **high** in **numbins** bins
   * **intbin(expr, min, max)** integers from **min** to **max** (inclusive)
   * **split(expr, edges)** irregular binning at each **edge**
   * **cut(expr)** boolean expression (2 bins) for post-fill selection or tables/plots of efficiency
   * **groupby(expr)** categorical data (strings)
   * **groupbin(expr, binwidth, origin=0)** sparse binning

And as many dependent variables as you want can be added as profile axes:

   * **profile(expr)** collect mean and error-in-mean of **expr** at each bin

If a set of histograms are collected into a `Book`, they are all filled with a single `.fill` call that avoids repeated reading and/or repeated calculations.

In [None]:
h = Book()
h["vs pt"] = Hist(bin("Jet_pt", 100, 0, 100), profile("Jet_chHEF"), profile("Jet_chEmEF"), profile("Jet_neHEF"))
h["vs eta"] = Hist(bin("Jet_eta", 100, -5, 5), profile("Jet_chHEF"), profile("Jet_chEmEF"), profile("Jet_neHEF"))
h["vs phi"] = Hist(bin("Jet_phi", 100, -numpy.pi, numpy.pi), profile("Jet_chHEF"), profile("Jet_chEmEF"), profile("Jet_neHEF"))

h.fill(**tree.arrays(["Jet_pt", "Jet_eta", "Jet_phi", "Jet_area", "Jet_qgl", "Jet_chHEF", "Jet_chEmEF", "Jet_neHEF"]))

In [None]:
h["vs eta"].marker(profile="Jet_neHEF").to(canvas)
# h._showgoals()

→ Write analysis scripts that look like many `TTree::Draw` commands, yet only pass over the data once.

Separately created and filled histograms can be added (hadd) or grouped (new categorical axis).

In [None]:
muonpt = Hist(bin("pt", 100, 0, 100), fill=tree.array("Muon_pt"), weight=10)
jetpt = Hist(bin("pt", 100, 0, 100), fill=tree.array("Jet_pt"), weight=1)

In [None]:
h = Hist.group(muon=muonpt, jet=jetpt)

In [None]:
h.pandas()
# h.beside("source").step("pt").to(canvas)

Of course you can make ROOT histograms, too!

In [None]:
import ROOT
tcanvas = ROOT.TCanvas()

cache = {}                                  # cache gives histograms a place to live
h.project("pt").root(cache=cache).Draw()    # to create and draw them in one line
tcanvas.Draw()