coffea.hist to hist migration guide #705

nsmith- · 2022-08-02T16:22:34Z

nsmith-
Aug 2, 2022
Maintainer

As of coffea v0.7.16 one now receives a warning on import of coffea.hist:

FutureWarning: In coffea version v0.8.0 (target date: 31 Dec 2022), this will be an error.

This discussion is meant to collect tips for migrating from coffea.hist to hist, and I've started by noting changes that were needed in the original coffea histogram tutorial. Feel free to add replies with additional tips and I'll incorporate them here!

Construction

A quick guide to the name changes in the constructor:

`coffea.hist.`	`hist.`	Notes
`Hist(...)`	`Hist(...)`	Constructor name is the same. The label for the accumulator must be a keyword argument, e.g. `Hist(..., name="Counts")`. A name or label can be specified, and later accessed using the corresponding attribute (e.g. `h.name`). 1D plots (e.g. `h.plot1d()`) do not seem to use the name as a y-axis label.
`Cat("name", "label")`	`axis.StrCategory([], growth=True, name="name", label="label")`	One can specify optionally a fixed set of categories and then leave the `growth` argument to the default False. The name and label appear at the end as (optional) keyword arguments.
`Bin("name", "label", 10, 0.0, 1.0)`	`axis.Regular(10, 0.0, 1.0, name="name", label="label")`	For regular-binned axes, the argument order is unchanged with the exception of the name and label becoming (optional) keyword arguments.
`Bin("name", "label", [0.0, 1.0, 3.0])`	`axis.Variable([0.0, 1.0, 3.0], name="name", label="label")`	For variable-binned axes. Again, name and label are optional keyword arguments
`h1.compatible(h2)`	`h1.axes == h2.axes`	Useful to test if the axes definition of two histograms is compatible, i.e. they do not need to be rebinned to add them together

⚠️ In coffea, weight was a protected keyword and was not usable as an axis name. In hist, both weight and sample are protected (the latter is used e.g. for Mean() accumulator types)

⚠️ In hist, all axes are dense, whereas for coffea the Cat axes acted as sparse axes, which means that if you have several categorical axes in your histogram where the number of combinations of values is much less than the full outer product of axis values, the storage requirements for hist may be significantly larger than for coffea. So it is advisable to not use e.g. dataset as an axis in hist, but rather store a dictionary mapping dataset name to a replica of a hist object. Alternatively there are Stack objects.

Tip: Some space savings can be found by turning off flow bins in axes where they are not needed. For example, a neural network output might be bound between [0, 1] and so an appropriate binning might be hist.axis.Regular(20, 0, 1, flow=False). The savings multiply as the number of dimensions grows.

Hist introduces a new "quick-construct" utility based on the method chaining idiom. For a histogram defined analogously to the one in the coffea histograms tutorial notebook:

histo = hist.Hist(
    hist.axis.StrCategory([], growth=True, name="samp", label="sample name"),
    hist.axis.Regular(20, -10, 10, name="x", label="x value"),
    hist.axis.Regular(20, -10, 10, name="y", label="y value"),
    hist.axis.Regular(20, -10, 10, name="z", label="z value"),
    storage="weight",
    name="Counts"
)

we could also construct it with:

histo = (
    hist.Hist.new
    .StrCat([], growth=True, name="samp", label="sample name")
    .Reg(20, -10, 10, name="x", label="x value")
    .Reg(20, -10, 10, name="y", label="y value")
    .Reg(20, -10, 10, name="z", label="z value")
    .Weight(name="Counts")
)

of course the character count savings is more pronounced for quick checks where axis names and labels are omitted.

Filling

Hist supports filling with identical parameters to coffea. The example in coffea histograms would be unchanged:

histo.fill(samp="sample 1", x=xyz[:,0], y=xyz[:,1], z=xyz[:,2])

In addition, hist can also fill histograms by positional arguments, i.e. if you specify the arguments in the same order as in the constructor, you can fill without naming the axes:

histo.fill("sample 1", xyz[:,0], xyz[:,1], xyz[:,2])

⚠️ In coffea, all histograms are automatically promoted to tracking sum of weights squared (for uncertainty) on the first fill with weight= argument. For hist, one has to opt-in explicitly during construction by using the weight storage type (as opposed to the default double.

Transformation

Below are some possible transformations that can be done on the above-defined histogram histo. Many of the methods have direct analogs. Some are a bit more opaque as they utilize the Unified Histogram Indexing (UHI) syntax.

coffea method	hist method	Notes
`histo.sum("z")`	`histo[{"z": slice(0, 20, sum)}]`	Coffea by default sums over non-overflow bins on an axis. To not include overflow in hist, we have to specify exactly the first and last bin location. `len` is a nice shorthand for specifying the last+1 bin index (`20` in this case, also `hist.overflow-1` works.) One can also use positional arguments in the slicing, e.g. `histo[:, :, :, 0:len:sum]` or `histo[..., 0:20:sum]`. Alternative to using `slice()` directly, one can define `s = hist.tag.Slicer()` and then use the usual slice syntax on the `s` object: `histo[{"z": s[0:len:sum]}]`.
`histo.sum("z", overflow='over')`	`histo[{"z": slice(0, hist.overflow, sum)}]`	Sum all regular bins and overflow, but not underflow
`histo.sum("z", overflow='all')`	`histo[{"z": sum}]`	Sum all regular bins and over/underflow. ⚠️ For hist there is no separate nanflow bin, and `float("NaN")` entries will end up in the overflow bin.
`histo.sum("z", overflow='allnan')`	`histo[{"z": sum}]`	Sum all regular bins and over/under/nanflow
`histo[:, 0:, 4:, 0:]`	`histo[:, hist.loc(0.0):, 4.0j:, 0.0j:]`	Slices in coffea always refer to the bin edge values themselves. For hist, one can either specify the bin index as an integer or the bin edge location using the `hist.loc()` tag. A shorthand for `hist.loc` is to specify the bin edge as a pure-imaginary complex number using the python syntax `4.0j` where `j` is the imaginary unit.
`histo.integrate("y", slice(0, 10))`	`histo[{"y": slice(0.0j, 10.0j, sum)}]`	Integrate y bins from 0 to +10
`histo.project("y", "z", overflow="allnan")`	`histo.project("y", "z")`	Same syntax, though the overflow behavior is not specifiable in hist, and by default is the same as coffea.hists's `allnan` option. The ordering of arguments allows to permute the axis positions in hist, and they can also be specified by integer index instead of name.
`histo.rebin("z", hist.Bin("znew", "rebinned z value", [-10, -6, 6, 10]))`	🚧 no simple analog. See scikit-hep/hist#345 for the feature request
`histo.rebin("z", 2)`	`histo[..., ::hist.rebin(2)]`	Rebin z axis by downsampling, merging every two bins together
`histo.scale(3.)`	`histo *= 3.0`	Scale in-place. For hist, one can also created a scaled copy with `histo*3.0`
`histo.identifiers('samp')`	`histo.axes["samp"]`	The string representation includes the bin names, and iterating over the axis returns the names, i.e. `list(histo.axes["samp"]) == ['sample 1', 'sample 2']`
`histo.identifiers('x')`	`histo.axes["x"].edges`	The edges can be paired with e.g. `for lo, hi in zip(edges[:-1], edges[1:])`. The centers are also available.
`histo.group(...)`	🚧 no analog implemented. See scikit-hep/hist#211 (comment) for a workaround
`histo.values(overflow: str)`	`histo.values(flow: bool)`	In both cases you can retrieve overflow entries if you opt-in, but in `hist` you can only choose either none or all of the overflow bins. Instead of a dictionary mapping identifiers on the sparse axes to the dense axes, in `hist` you get a rectangular numpy array, where all the axes are dense. You can use the slicing syntax to reduce the number of axes beforehand. Note that in `hist` there is also a `histo.view()` that provides read-write support so you can use it to update entries.

Scaling a categorical axis in-place (e.g. for dataset luminosity normalization) changes from

scales = {
    'sample 1': 1.2,
    'sample 2': 0.2,
}
histo.scale(scales, axis='sample')

to

scales = {
    'sample 1': 1.2,
    'sample 2': 0.2,
}
for i, name in enumerate(histo.axes["samp"]):
    histo.view(flow=True)[i] *= scales[name]

where histo.view(flow=True) provides a read-write view into the histogram's storage with numpy semantics, similar to coffea histo.values(). Note that if the samp axis was not the first one, the numpy-style slice histo.view()[i] would have to change.

Saving

As was the case for coffea, hist histograms can be pickled. In addition, hist histograms are much more compatible with uproot4 writing. For example one can straightaway write any TH* if it has between 1 and 3 axes (TH1, TH2, TH3). So our above histo could be projected by sample and written as a TH3:

import uproot

with uproot.recreate("output.root") as fout:
    for s in histo.axes["samp"]:
        fout[f"histo/{s}"] = histo[{"samp": s}]
    
with uproot.open("output.root") as fin:
    print(fin.items())

which prints [('histo;1', <ReadOnlyDirectory '/histo' at 0x00011ff9c2e0>), ('histo/sample 1;1', <TH3D (version 4) at 0x00011ff233d0>), ('histo/sample 2;1', <TH3D (version 4) at 0x00011ff9df10>)]

Plotting

The mplhep package helps interface matplotlib with hist in a similar way as it does for coffea.hist. One can use either mplhep directly or via convenience functions like plot1d, etc. Some of them

coffea method	hist method	Notes
`hist.plot1d(histo.sum("x", "y"), overlay='sample');`	histo[{"x": sum, "y": sum}].plot1d(overlay="samp");	Plot the `z` distribution per sample without stacking or fill
`hist.plot1d(histo.sum("x", "y"), overlay='sample', stack=True);`	`histo[{"x": sum, "y": sum}].plot1d(overlay="samp", histtype="fill", stack=True);`	Same, but with stacking and fill
`hist.plot2d(histo.sum('x', 'sample'), xaxis='y');`	`histo[{"x": sum, "samp": sum}].plot2d();`	Plot 2D with y as the horizontal axis. To flip the axis orientation, use `project()` with the approriate index permutaiton, e.g. `histo[{"x": sum, "samp": sum}].project(1, 0).plot2d();` would put the z axis on the horizontal.
`hist.plotgrid(...)`	🚧 no analog	This can be done by hand as well
`hist.plotratio(...)`	See plot_ratio example	The various confidence intervals available in coffea.hist have been ported to hist.intervals

Styling

Many of the styling examples carry over directly, since they modify matplotlib attributes. It is worth noting that mplhep now has stylesheets for all four major LHC experiments.

Use within processors

To accomodate the fact that hist.Hist does not (and will not) subclass AccumulatorABC, a change to the accumulator semantics was introduced in coffea v0.7.2 to allow a more flexible definition. A full processor modernization guide will be put in a separate discussion, but it boils down to replacing any defaultdict_accumulator({"myhist": coffea.hist.Hist(...)}) objects with simple dictionaries of hist objects {"myhist": hist.Hist(...)}. As for the other types of accumulators, anything that natively supports __add__ (i.e. floats, ints, etc.) is natively supported now. For example, the following is now sufficient to track sum of weights in the return of a ProcessorABC.process method:

return {
    "sumw": {events.metadata.dataset: np.sum(events.genWeight)},
    "myhist": histo,
}

As a consequence, ProcessorABC.accumulator and any use of AccumulatorABC.identity() is deprecated. To see more details of the new accumulator semantics, check out https://coffeateam.github.io/coffea/notebooks/accumulators.html#Accumulator-semantics

andreypz · 2022-10-28T12:06:50Z

andreypz
Oct 28, 2022

And how about hist.group() method?

1 reply

nsmith- Jan 5, 2023
Maintainer Author

I found an existing issue for it in scikit-hep/hist#211 and added a proposed workaround.
I'll copy it here as well:

import hist

def group(h, oldname, newname, grouping):
    hnew = hist.Hist(
        hist.axis.StrCategory(grouping, name=newname),
        *(ax for ax in h.axes if ax.name != oldname),
        storage=h.storage_type(),
    )
    for i, indices in enumerate(grouping.values()):
        hnew.view(flow=True)[i] = h[{oldname: indices}][{oldname: sum}].view(flow=True)

    return hnew

jquetzalcoatl · 2024-06-04T02:59:52Z

jquetzalcoatl
Jun 4, 2024

What happened to hist.Cat.identifiers()?
Tried hist.axis.StrCategory.identifiers() but no luck!

1 reply

nsmith- Jun 4, 2024
Maintainer Author

Do you mean how to enumerate the names of the categories in the case of a categorical axis? In general for hist histograms you can use list(histo.axes["axisname"]) to get a list of bin names

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

coffea.hist to hist migration guide #705

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

coffea.hist to hist migration guide #705

nsmith- Aug 2, 2022 Maintainer

Construction

Filling

Transformation

Saving

Plotting

Styling

Use within processors

Replies: 2 comments · 2 replies

andreypz Oct 28, 2022

nsmith- Jan 5, 2023 Maintainer Author

jquetzalcoatl Jun 4, 2024

nsmith- Jun 4, 2024 Maintainer Author

nsmith-
Aug 2, 2022
Maintainer

Replies: 2 comments 2 replies

andreypz
Oct 28, 2022

nsmith- Jan 5, 2023
Maintainer Author

jquetzalcoatl
Jun 4, 2024

nsmith- Jun 4, 2024
Maintainer Author