Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weighted Arrays? #776

Open
ParadaCarleton opened this issue Mar 23, 2022 · 2 comments
Open

Weighted Arrays? #776

ParadaCarleton opened this issue Mar 23, 2022 · 2 comments

Comments

@ParadaCarleton
Copy link
Contributor

I've been considering this for a while. Would it make sense to define a new struct, a weighted_array, which contains both an array and a set of weights? The primary advantages are as follows:

  1. The weighted array can be stored contiguously in memory as an array of (element, weight) tuples. Weights and array elements are almost always accessed together, so this allows for faster access.
  2. Allows weighted_arrays to be passed as a single argument in place of an array.
  3. The user can conveniently manipulate weights together with observations. For example, dropping missing values would also automatically drop the weights associated with them.
    (The old interface can also be kept.)
@nalimilan
Copy link
Member

There's been some discussion about something similar at JuliaLang/julia#33310 (comment) (and following comments) and https://github.com/JuliaLang/Statistics.jl/issues/88. It could be an interesting alternative to passing weights as a separate argument. But I find the syntax a bit weird with functions that take several arguments, like cor(weighted(w, x), weighted(w, y)) or (more compact but weirder) cor(weighted(w, x, y)) -- and ideally we want to have a consistent syntax for single- and multiple-argument functions. Of course we could support two different syntaxes, but for now I'd rather focus on getting a single syntax work correctly in all cases (notably skipping missing values).

Regarding performance, it would probably not be faster:

  • It's not clear that it would be faster to store (element, weight) tuples. When processing arrays in loops, AFAIK it's easier to get the compiler to use SIMD instructions when working on two separate arrays. And if you combine e.g. an Int8 value with a Float64 weight, you have to add some padding in the array to ensure all elements are aligned:
julia> Base.summarysize(fill((Int8(1), 1.0), 10_000))
160040

julia> Base.summarysize(fill(Int8(1), 10_000)) + Base.summarysize(fill(1.0, 10_000))
90080
  • Even if it was faster, having weighted_array make a copy of the values and weights to allocate a vector of (element, weight) tuples would be prohibitively slow if you need to compute weighted stats on different variables. That said, we could implement such a wrapper which would be a view of the inputs (like AbstractWeights currently).

@perrette
Copy link

It's a semantic issue rather than a syntax issue. Think about x = WeightedArray( a, w) and y = WeightedArray ( b, v). What should cov(x, y) return?

If one could define a meaningful operation then the semantic problem would be gone entirely.

E.g.

x_centered = a - ... # uses w weight
y_centered = b - ... # uses v weight 
cov = ... # uses sqrt (v * w) ???

Here the definition of sqrt(v *w) does the trick of making it consistent with cov(x, x) but it is not satisfying because I don't think it has a statistical meaning. There might be a solution to this, but until one is found, I would simply not extend cov or other similarly problematic multivariate operators to weighted arrays.

But I'm not a fan of hiding things by putting everything into classes. Long live Vector functions!

Coming from python, I find having to define classes for weights and for weighted arrays is not very transparent, because it is a judgement about the "quality" of the objects instead of its "function", which I find is problematic in general, in coding (lack of transparency) like in society. I would much prefer having separate functions such as wmedian (possibly with type 1, ... , 7?? as additional parameter) that take Vector typed values and weights as separate arguments, like it is now, but also without the Weight classes really (or at least also accept Vector and internally convert to Weight?). In any case the distinction between Frequency and Probability Weights already goes too far IMO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants