Support a mean aggregation strategy #78

keller-mark · 2024-03-20T14:27:07Z

Currently, the implementation always takes the sum. However we might want to use a different aggregation strategy, in particular, mean. This is possible with the DeckGL ContourLayer, for example: https://deck.gl/docs/api-reference/aggregation-layers/contour-layer#aggregation

This PR adds a .aggregation('MEAN') option to specify this as an aggregation strategy.

Fil

I don't understand the use case this addresses. It would be helpful to have a concrete example.

When talking about the mean, we certainly want to specify the mean of a certain quantity? not just of ones—and I don't believe weights and value are the same. Are we passing the value as the weight? Shouldn't there be a value independent from the weights?

src/density.js

Fil · 2024-03-20T18:32:42Z

src/density.js

+        if (valueCounts[i]) {
+          values[i] /= valueCounts[i];


what is the mean in locations where there is no data around? Should it be NaN, or 0?

I think zero makes sense when the values are non-negative. It appears that is the assumption in DeckGL https://github.com/visgl/deck.gl/blob/f4a9afc07b925c5c9bb6b53d96d15541a2f0fb4f/modules/aggregation-layers/src/utils/aggregation-operation-utils.ts#L46 I suppose ideally there should be a way to override that with a different value, like .emptyValue(-Infinity)

Fil · 2024-03-20T18:35:21Z

src/density.js

@@ -139,6 +156,12 @@ export default function() {
    return arguments.length ? (threshold = typeof _ === "function" ? _ : Array.isArray(_) ? constant(slice.call(_)) : constant(_), density) : threshold;
  };

+  density.aggregation = function(_) {
+    if(!arguments.length) return aggregation;
+    if(_ !== 'SUM' && _ !== 'MEAN') throw new Error("invalid aggregation");


Could this be passed as a reducer function (in JavaScript)? What arguments would it take?

(Also it's a cosmetic detail but I don't think we'd use these ALLCAPS keywords.)

It definitely could. In order to pass a list of the values per grid cell, we would need to store all of those. That could have a performance implication, in contrast to the current incremental computation which only requires a constant amount (1 previously, 2 in this PR) of values per grid cell (with the += operations).

Would you be open to implementation of

.aggregation(strategy : 'sum' | 'min' | 'max' | 'mean' | (cellValues: number[]) => number)

where the per-cell values are only stored if the user passes the custom reducer? It would certainly make the code more complex, but I am currently using this with over 300,000 points (and millions in the future), so I would want the option to avoid the overhead of storing the per-cell data for the most common aggregation strategies.

Co-authored-by: Philippe Rivière <fil@rezo.net>

keller-mark · 2024-03-21T20:50:31Z

I don't understand the use case this addresses. It would be helpful to have a concrete example.

I am using d3-contour to obtain the contour thresholds only. I am not using the geometries.

Therefore, I need the contour threshold values to be meaningful relative to the per-dataPoint weights. With sum-based aggregation, the contour threshold values cannot be compared to the per-dataPoint weights because the thresholds depend on the number of data points in a local region.

Shouldn't there be a value independent from the weights?

I am not sure i follow. Are you saying there needs to be support for a separate accessor to be passed like .value()?

contourDensity()
      .x(d => d.x)
      .y(d => d.y)
      .weight(d => d.weight)
      .value(d => d.value)

Fil · 2024-03-22T09:25:02Z

(Conceptually for me weights should be "multiplicators", not the same as "value". However it seems that the formulas end up being the same for both "sum" and "mean", so maybe I should just accept that it isn't an issue if the term is "wrong", as long as the math works out.)

As I understand it, the algorithm in deck.gl is only binning — that is, points are grouped into square bins and then reduced (by count, sum, mean, min or max). You can do this more or less efficiently in D3 with something like https://observablehq.com/@fil/n-dimensions-binning

What D3’s contourDensity method does is a bit similar to binning, but with a more efficient code path that doesn't actually computes bins to reduce them later—it just increments a counter on a grid. Additionally, the counter is fractional, to apply bilinear interpolation.

A second step diffuses the resulting grid with a "bandwidth" (using blur2 under the hood (itself an optimized version of the older code of the density function)). The extent of the resulting values on the grid is then used to determine the thresholds.

Finally, for each threshold a contour is computed with the marching squares algorithm. (But you're saying that you don't need that part, just the thresholds?)

Your question made me go on a tangent. I've tried to replicate the chart you show above (sort of), by making a map of the 191,842 building/addresses in the French département of Loir-et-Cher, and coloring them by their street number (using a log10 scale). The map shows densities (in cities and villages, and on both banks of the rivers) where these numbers are in the 10s, and a large area in the south east where numbers are in the 100s and sometimes 1000s: these are mostly forest areas, with long non interrupted roads, where the numbering system looks quite different. (This was a new insight to me! Fun!)

Now, if I want to compute the extent of the values on that map, I don't need to place points and do local averages… I can just work on the values (C) and ignore the locations. Averaging locally might tone down a few outliers and shrink this extent a little, but I can mimic that by taking a percentile. In my case, the 95th percentile means 2.54 instead of 3.95. If I want about 20 thresholds, this gives me:
d3.ticks(0, d3.quantile(C, 0.95), 20) // [0, 0.1, … 2.4, 2.5]

If I want to identify the region where the local average is higher than, say, the 90% percentile, I can feed that number directly to the random-walk spatial interpolator. With a bit of blurring to make smooth contours, it results in this map:

Finally, this should be “clipped” so that it doesn't extend too much into the “void” (which are not void in that case, but nearby départements). To do this, I'm using a modified version of interpolatorRandomWalk with a maximum distance threshold (observablehq/plot#2032):

I hope some of this makes sense for what you are doing. I started by prototyping a "mean" spatial interpolator, but the results I got on this dataset were not great, because the "void" had too much of an impact on the "local mean". (I'll share the code soon.)

Note: I'm using Plot because it's faster to iterate — but you can use its spatial interpolators as independent functions: they're not tied to Plot charts.

keller-mark · 2024-03-22T14:46:09Z

I can just work on the values (C) and ignore the locations. Averaging locally might tone down a few outliers and shrink this extent a little, but I can mimic that by taking a percentile.

Thank you so much for this suggestion! I am playing with this idea now. I think it is a better fit for my use case, as it will make a legend for the contours more interpretable (can be in terms of the percentiles) and also because the gene expression distributions can be very skewed so mean is probably not the best option anyway.

Further, I had previously considered pre-computing the contour thresholds and/or the grids server-side, but with this approach I may be able to pre-compute the percentiles, which would be much simpler.

Support a mean aggregation strategy

82f6830

Fil requested changes Mar 20, 2024

View reviewed changes

Update src/density.js

c69d0bc

Co-authored-by: Philippe Rivière <fil@rezo.net>

keller-mark closed this Mar 22, 2024

keller-mark mentioned this pull request Mar 22, 2024

Contour layer for scatterplot vitessce/vitessce#1823

Draft

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support a mean aggregation strategy #78

Support a mean aggregation strategy #78

keller-mark commented Mar 20, 2024

Fil left a comment

Fil Mar 20, 2024

keller-mark Mar 21, 2024

Fil Mar 20, 2024

keller-mark Mar 21, 2024

keller-mark commented Mar 21, 2024

Fil commented Mar 22, 2024 •

edited

keller-mark commented Mar 22, 2024 •

edited

Support a mean aggregation strategy #78

Support a mean aggregation strategy #78

Conversation

keller-mark commented Mar 20, 2024

Fil left a comment

Choose a reason for hiding this comment

Fil Mar 20, 2024

Choose a reason for hiding this comment

keller-mark Mar 21, 2024

Choose a reason for hiding this comment

Fil Mar 20, 2024

Choose a reason for hiding this comment

keller-mark Mar 21, 2024

Choose a reason for hiding this comment

keller-mark commented Mar 21, 2024

Fil commented Mar 22, 2024 • edited

keller-mark commented Mar 22, 2024 • edited

Fil commented Mar 22, 2024 •

edited

keller-mark commented Mar 22, 2024 •

edited