Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support a mean aggregation strategy #78

Closed

Conversation

keller-mark
Copy link

Currently, the implementation always takes the sum. However we might want to use a different aggregation strategy, in particular, mean. This is possible with the DeckGL ContourLayer, for example: https://deck.gl/docs/api-reference/aggregation-layers/contour-layer#aggregation

This PR adds a .aggregation('MEAN') option to specify this as an aggregation strategy.

Copy link
Member

@Fil Fil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the use case this addresses. It would be helpful to have a concrete example.

When talking about the mean, we certainly want to specify the mean of a certain quantity? not just of ones—and I don't believe weights and value are the same. Are we passing the value as the weight? Shouldn't there be a value independent from the weights?

src/density.js Outdated Show resolved Hide resolved
Comment on lines +63 to +64
if (valueCounts[i]) {
values[i] /= valueCounts[i];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the mean in locations where there is no data around? Should it be NaN, or 0?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think zero makes sense when the values are non-negative. It appears that is the assumption in DeckGL https://github.com/visgl/deck.gl/blob/f4a9afc07b925c5c9bb6b53d96d15541a2f0fb4f/modules/aggregation-layers/src/utils/aggregation-operation-utils.ts#L46 I suppose ideally there should be a way to override that with a different value, like .emptyValue(-Infinity)

@@ -139,6 +156,12 @@ export default function() {
return arguments.length ? (threshold = typeof _ === "function" ? _ : Array.isArray(_) ? constant(slice.call(_)) : constant(_), density) : threshold;
};

density.aggregation = function(_) {
if(!arguments.length) return aggregation;
if(_ !== 'SUM' && _ !== 'MEAN') throw new Error("invalid aggregation");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be passed as a reducer function (in JavaScript)? What arguments would it take?

(Also it's a cosmetic detail but I don't think we'd use these ALLCAPS keywords.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It definitely could. In order to pass a list of the values per grid cell, we would need to store all of those. That could have a performance implication, in contrast to the current incremental computation which only requires a constant amount (1 previously, 2 in this PR) of values per grid cell (with the += operations).

Would you be open to implementation of

.aggregation(strategy : 'sum' | 'min' | 'max' | 'mean' | (cellValues: number[]) => number)

where the per-cell values are only stored if the user passes the custom reducer? It would certainly make the code more complex, but I am currently using this with over 300,000 points (and millions in the future), so I would want the option to avoid the overhead of storing the per-cell data for the most common aggregation strategies.

Co-authored-by: Philippe Rivière <fil@rezo.net>
@keller-mark
Copy link
Author

I don't understand the use case this addresses. It would be helpful to have a concrete example.

I am using d3-contour to obtain the contour thresholds only. I am not using the geometries.

Therefore, I need the contour threshold values to be meaningful relative to the per-dataPoint weights. With sum-based aggregation, the contour threshold values cannot be compared to the per-dataPoint weights because the thresholds depend on the number of data points in a local region.

Screenshot 2024-03-21 at 4 44 17 PM

Shouldn't there be a value independent from the weights?

I am not sure i follow. Are you saying there needs to be support for a separate accessor to be passed like .value()?

contourDensity()
      .x(d => d.x)
      .y(d => d.y)
      .weight(d => d.weight)
      .value(d => d.value)

@Fil
Copy link
Member

Fil commented Mar 22, 2024

(Conceptually for me weights should be "multiplicators", not the same as "value". However it seems that the formulas end up being the same for both "sum" and "mean", so maybe I should just accept that it isn't an issue if the term is "wrong", as long as the math works out.)

As I understand it, the algorithm in deck.gl is only binning — that is, points are grouped into square bins and then reduced (by count, sum, mean, min or max). You can do this more or less efficiently in D3 with something like https://observablehq.com/@fil/n-dimensions-binning

What D3’s contourDensity method does is a bit similar to binning, but with a more efficient code path that doesn't actually computes bins to reduce them later—it just increments a counter on a grid. Additionally, the counter is fractional, to apply bilinear interpolation.

A second step diffuses the resulting grid with a "bandwidth" (using blur2 under the hood (itself an optimized version of the older code of the density function)). The extent of the resulting values on the grid is then used to determine the thresholds.

Finally, for each threshold a contour is computed with the marching squares algorithm. (But you're saying that you don't need that part, just the thresholds?)


Your question made me go on a tangent. I've tried to replicate the chart you show above (sort of), by making a map of the 191,842 building/addresses in the French département of Loir-et-Cher, and coloring them by their street number (using a log10 scale). The map shows densities (in cities and villages, and on both banks of the rivers) where these numbers are in the 10s, and a large area in the south east where numbers are in the 100s and sometimes 1000s: these are mostly forest areas, with long non interrupted roads, where the numbering system looks quite different. (This was a new insight to me! Fun!)

map0

Now, if I want to compute the extent of the values on that map, I don't need to place points and do local averages… I can just work on the values (C) and ignore the locations. Averaging locally might tone down a few outliers and shrink this extent a little, but I can mimic that by taking a percentile. In my case, the 95th percentile means 2.54 instead of 3.95. If I want about 20 thresholds, this gives me:
d3.ticks(0, d3.quantile(C, 0.95), 20) // [0, 0.1, … 2.4, 2.5]

If I want to identify the region where the local average is higher than, say, the 90% percentile, I can feed that number directly to the random-walk spatial interpolator. With a bit of blurring to make smooth contours, it results in this map:

map1

Finally, this should be “clipped” so that it doesn't extend too much into the “void” (which are not void in that case, but nearby départements). To do this, I'm using a modified version of interpolatorRandomWalk with a maximum distance threshold (observablehq/plot#2032):

map2

I hope some of this makes sense for what you are doing. I started by prototyping a "mean" spatial interpolator, but the results I got on this dataset were not great, because the "void" had too much of an impact on the "local mean". (I'll share the code soon.)

Note: I'm using Plot because it's faster to iterate — but you can use its spatial interpolators as independent functions: they're not tied to Plot charts.

@keller-mark
Copy link
Author

keller-mark commented Mar 22, 2024

I can just work on the values (C) and ignore the locations. Averaging locally might tone down a few outliers and shrink this extent a little, but I can mimic that by taking a percentile.

Thank you so much for this suggestion! I am playing with this idea now. I think it is a better fit for my use case, as it will make a legend for the contours more interpretable (can be in terms of the percentiles) and also because the gene expression distributions can be very skewed so mean is probably not the best option anyway.

Further, I had previously considered pre-computing the contour thresholds and/or the grids server-side, but with this approach I may be able to pre-compute the percentiles, which would be much simpler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants