# Binned data

Scipp distinguishes **histogrammed** data from **binned** data:

- Histogrammed data refers to regular dense arrays of, e.g., floating-point values with an associated bin-edge coordinate.
- Binned data refers to the precursor of histogrammed data, i.e., each bin contains a “list” of contributing events or values. Binned data can be converted into a histogram by computing the sum over all events or values in a bin.

This is conceptually similar to a multi-dimensional [AwkwardArray](https://awkward-array.org/doc/main/).

![img](../images/binned_drawing.svg)

# Taxi-Bins!

It is best illustrated with an example of data analysis.
For this, we will use one of the NYC taxi datasets.

<img src="https://vaex.readthedocs.io/en/latest/_images/datasets_2_1.png" /> <img src="https://cdn-images-1.medium.com/v2/resize:fit:2680/1*fqrY2h4uLD3eKEvJ6hlI2g.png" width="600" />

(https://vaex.readthedocs.io/en/latest/datasets.html, Dataset from 2015, obtained as a HDF5 file from the Vaex docs,
and subsequently cleaned of outliers).

In [None]:
%matplotlib widget

import scipp as sc
import numpy as np
from utils.helper import scatter

In [None]:
da = sc.io.load_hdf5("../data/nyc_taxi_data_2015_small.h5")
da

In [None]:
sc.table(da)

In [None]:
n = 5000
x = da.coords["dropoff_longitude"].values[::n]
y = da.coords["dropoff_latitude"].values[::n]
scatter(x, y)

## Binning the data records

Working with binned data is most efficient when keeping the number of bins relatively low.

Binning is essentially like overlaying a grid of bin edges onto our data:

In [None]:
ax = scatter(x, y)
for lon in np.linspace(*ax.get_xlim(), 9):
    ax.axvline(lon, color="gray")
for lat in np.linspace(*ax.get_ylim(), 9):
    ax.axhline(lat, color="gray")

In [None]:
# Bin into 8 longitude & latitude bins
binned = da.bin(dropoff_latitude=8, dropoff_longitude=8)
binned

In [None]:
binned.hist().plot(aspect="equal", norm="log")

### Selecting/slicing bins

Binning groups the data into bins, but keeps the underlying table beneath; **no information is lost, it is simply re-ordered**.
The bins can then be used for slicing the data, providing extremely efficient data selection and filtering.

For example, we select one bin in Manhattan by slicing both `dropoff_longitude` and `dropoff_latitude` dimensions:

In [None]:
manh = binned["dropoff_longitude", 1]["dropoff_latitude", 4]
manh

In [None]:
manh.hist(dropoff_latitude=10, dropoff_longitude=10).plot(norm="log", aspect="equal")

We select another bin, which contains the JFK airport:

In [None]:
jfk = binned["dropoff_longitude", 6]["dropoff_latitude", 1]
jfk.hist(dropoff_latitude=300, dropoff_longitude=300).plot(norm="log", aspect="equal")

![jfk](https://upload.wikimedia.org/wikipedia/commons/thumb/5/5a/JFK_airport_terminal_map.png/640px-JFK_airport_terminal_map.png)

(https://commons.wikimedia.org/wiki/File:JFK_airport_terminal_map.png)

### Binning into a new dimension

Data that has already been binned can be binned further into new dimensions, because the underlying records from the original table are still available.

In [None]:
manh

In the following, we look at the trip distances inside the Manhattan and JFK bins we have selected above.

In [None]:
# Use 100 distance bins
manh_dist = manh.bin(trip_distance=100)
manh_dist

In [None]:
manh_dist.hist().plot()

In [None]:
jfk_dist = jfk.bin(trip_distance=100)
jfk_dist.hist().plot()

### Other operations on bins: what is the mean fare amount as a function of distance?

In addition to summing/histogramming, bins can be used for other reduction operations: `min()`, `max()`, and `mean()`.

To illustrate this, we will now inspect a new variable in our Manhattan data which is the fare amount (in dollars).

We start from our result from the previous section, where the Manhattan data has been binned into 100 `'trip_distance'` bins.

In [None]:
manh_dist

We use the `.bins` property to access the underlying coordinate values of the points that lie inside our selected map area.
We can then look at the properties of those coordinates.

For example, to get the minimum and maximum fares for all trips that ended inside our Manhattan area, we can do

In [None]:
manh_dist.bins.coords["fare_amount"].min(), manh.bins.coords["fare_amount"].max()

In [None]:
da.coords["fare_amount"].max()

These values are somewhat strange, indicative of bad data in the table.

To proceed further in our analysis, we shall restrict our fare range from \\$0 to \\$200.

We first want to visually inspect the fare amount as a function of trip distance.

In [None]:
# Make 100 bins between 0 and 200 dollars
nbins = 100
fare_bins = sc.linspace("fare_amount", 0, 200, nbins + 1, unit="dollar")

# Bin & plot our data
manh_dist.bin(fare_amount=fare_bins).hist().transpose().plot(norm="log")

Some things we can say about the data:

- there appears to be a (somewhat expected) correlation between fare amount and trip distance: the further you go, the more you'll have to pay
- for a given trip distance, clients usually pay above the diagonal line, but very rarely below
- there appears to be a magic fare amount (~\\$52) that will take you anywhere from 0 to 60 miles (will come back to this later)

Our goal is now to try and compute some average fare amount as a function of distance.

We again use the `.bins` property to get to the `'fare_amount'` coordinate, showing it is made up of 100 bins in the `'trip_distance'` dimension:

In [None]:
manh_dist.bins.coords["fare_amount"]

In [None]:
mean_fare = manh_dist.bins.coords["fare_amount"].bins.mean()
mean_fare

This is *almost* what we were after, except that it contains only values.
We need to combine this with the coordinate of the `'trip_distance'` bins:

In [None]:
# Remember to add the coordinate for the `trip_distance` bins back
mean_fare = sc.DataArray(
    data=mean_fare, coords={"trip_distance": manh_dist.coords["trip_distance"]}
)
mean_fare.plot()

In [None]:
mean_fare

### Filtering out the magic \$52 fare

We would like to clean up our `fare_amount` vs `trip_distance` relation by filtering out all trips that have a fare amount of \$52.

One way to do this would be to use Numpy masking or smart indexing to filter out all \$52 fares in the original data table.
But this can potentially be quite a costly operation (both in CPU and memory, as the list of indices to save could be large).

An alternative way is to once again use bins.

We make 3 bins in the `'fare_amount'` dimension, where the middle bin is very narrow, centered around \$52.

https://www.nytimes.com/2022/11/17/nyregion/taxi-fare-hike-nyc.html

In [None]:
# Make 3 bins = 4 bin edges
fare_bins = sc.array(dims=["fare_amount"], values=[0, 51.75, 52.25, 200], unit="dollar")
manh_dist_fare = manh_dist.bin(fare_amount=fare_bins)
manh_dist_fare

Once we have this, we leave the middle bin out by indexing with a step of 2,
concatenate the first and last `'fare_amount'` bins into a single bin using `concat()`,
and finally compute the mean fare as we did above.

In [None]:
#                                   Access fare_amount coord | Select first & last bin | Concatenate        | Compute mean as above
mean_fare_filtered = (
    manh_dist_fare.bins.coords["fare_amount"]["fare_amount", ::2]
    .bins.concat("fare_amount")
    .bins.mean()
)
mean_fare_filtered

In [None]:
# Remember to add the coordinate for the `trip_distance` bins back
mean_fare_filtered = sc.DataArray(
    data=mean_fare_filtered, coords={"trip_distance": manh_dist.coords["trip_distance"]}
)

# Plot both results
import plopp as pp

pp.plot({"unfiltered": mean_fare, "filtered": mean_fare_filtered})

We can now see that the \$52 fares were introducing significant skew in our result.

In [None]:
manh.hist(dropoff_latitude=400, dropoff_longitude=400).plot(norm="log", aspect="equal")

:::{important} What about the other airport?

- Using the `binned` data array, extract the tile for LaGuardia Airport (hint: The coordinates are: (40.7766, -73.8742))
- Plot the histogram in that LGA bin just like we did for the `manh` and `jfk`.
- Plot the histogram of `trip_distance` by binning`lga` along the `trip_distance` dimension, just like we did for `jfk`. Where does it peak?

:::

:::{note} Solution
:class: dropdown
```
lga = binned["dropoff_longitude", 4]["dropoff_latitude", 4]

lga.hist(dropoff_latitude=300, dropoff_longitude=300).plot(norm="log", aspect="equal")

lga_dist = lga.bin(trip_distance=100)
lga_dist.hist().plot()
```

:::