Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: A more feature-complete histogram, where and how? #704

Closed
Moelf opened this issue Jul 30, 2021 · 14 comments
Closed

Discussion: A more feature-complete histogram, where and how? #704

Moelf opened this issue Jul 30, 2021 · 14 comments

Comments

@Moelf
Copy link
Contributor

Moelf commented Jul 30, 2021

The current Histogram is great, but it falls a bit short in some areas:

  1. We have weights, but in order to track error properly, we need to keep track of sum of weights .^ 2 (in each bin)
  2. This is more critical when our data doesn't fit in memory and we had to push!(hist, val, weight). This is also not supported right now
  3. For 2 to work nicely, we want to make histograms thread-safe. The easy way to do this is to add a SpinLock, but this can be slow in certain workload. Another way is to have a buffer, which is what ROOT does, but I think that's a bit overkill. But in any case push! should be thread-safe.
  4. Overflow bins support. (for 1D only, have 2 extra bins to track overflow and underflow data points)

My question for StatsBase community is: Do we welcome these changes (of course it will not be breaking)? Or this is too "bloated" that we should make a new type of histogram (I would hate this option because having two sets of normal histogram is gonna be annoying)

cc for perspective: @oschulz

@oschulz
Copy link
Contributor

oschulz commented Jul 30, 2021

May be related: #650

@oschulz
Copy link
Contributor

oschulz commented Jul 30, 2021

in order to track error properly, we need to keep track of sum of weights .^ 2

Hm, maybe I misunderstand, but sum of weights .^ 2 is only a rough estimate of that "error" (more precisely, the uncertainty on the assumption that the weight equals the expected value), and is anyhow only appropriate if the number of events is very high, right?

we want to make histograms thread-safe

Yes, I think that should be non-controversial.

Overflow bins support.

I guess we'd want a specialized histogram type for that, not everyone will want overflow bins, and a lot of current code makes assumptions on what's in weights.

@Moelf
Copy link
Contributor Author

Moelf commented Jul 30, 2021

Hm, maybe I misunderstand, but sum of weights .^ 2 is only a rough estimate of that "error"

it's not the error itself, but it will be needed if you want to compute errors. say you have 2 entries, with weight 0.3 and 0.7. If you only track count, you will see 1.0, which is the same than if you have 1 entry with unit weight. But sqrt(0.3^2 + 0.7^2) != 1 will be useful for doing error computation down the road and for combining histograms with correct error propagation.

@oschulz
Copy link
Contributor

oschulz commented Jul 30, 2021

Ah, I see - yes, but the validity of that calculation is dependent on your statistical model. I have a feeling that would be a bit too "biased" for Histogram itself. But maybe we could have thread-safely in general, and then a specialized histogram type with both overflow bins and sum of weights .^ 2? That should probably leave in a separate package, to decouple it from StatsBase release cycles.

@Moelf
Copy link
Contributor Author

Moelf commented Jul 30, 2021

I don't think it's that "biased". Basically the only alternative is to record ALL weights of the bin, which I think isn't happening -- or, if it's that simple, the user can easily do that in-memory.

Btw, I think if we are gonna make a separate histogram pkg anyway for the general physics (maybe just HEP?) community, we can just do it without adding thread safety here.

Because adding thread-safe push!(hist, val, weight) is almost useless if we can't track weight other than knowing the final bin count.

@nalimilan
Copy link
Member

My question for StatsBase community is: Do we welcome these changes (of course it will not be breaking)? Or this is too "bloated" that we should make a new type of histogram (I would hate this option because having two sets of normal histogram is gonna be annoying)

I'm not super familiar with the histograms code, but this kind of improvements sounds seem worth discussing for a possible inclusion in StatsBase. If you two can agree on features that are generic enough (and/or can be enabled only for those who need them) we can have a look at what kind of API would be needed.

Regarding thread safety, we will have to check that the overhead isn't too large when you work from a single thread.

@Moelf
Copy link
Contributor Author

Moelf commented Aug 31, 2021

checkout unsafe_push!() and push!() from FHist

@nalimilan
Copy link
Member

I see the definition, but that doesn't tell me whether the overhead is high or not. :-) I assume that you added unsafe_push! because it's relatively high? I'm hesitant to adopt this design, as I'm not aware of any precedent of using unsafe_* to indicate a non-thread-safe method (neither in StatsBase, nor in Base, nor in other packages). Why should this method be thread-safe when the rest of the API isn't?

@Moelf
Copy link
Contributor Author

Moelf commented Aug 31, 2021

julia> const ary = rand(10^5);

julia> @benchmark for a in ary
           push!(h1, a)
       end setup=(h1=Histogram(0:0.03:1)) evals=1
BenchmarkTools.Trial: 1994 samples with 1 evaluation.
 Range (min  max):  2.474 ms   2.778 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     2.504 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.507 ms ± 14.866 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                  ▅▇█▄▅▃          ▁                           
  ▂▁▁▂▁▂▂▁▁▁▂▂▂▃▅█████████▅▆▆▆▅█▆███████▇▇▅▄▄▄▄▂▃▂▂▂▂▁▂▂▂▂▂▂ ▄
  2.47 ms        Histogram: frequency by time        2.54 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.


julia> @benchmark for a in ary
           push!(h1, a)
       end setup=(h1=Hist1D(Int; bins=0:0.03:1)) evals=1
BenchmarkTools.Trial: 4180 samples with 1 evaluation.
 Range (min  max):  1.120 ms   1.389 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     1.192 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.194 ms ± 27.280 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                  ▁▁▃▂▂▁▃▆▅▆▆▇██▅▂ ▁  ▁                       
  ▂▂▁▁▁▁▂▂▂▃▅▆▇▆▆███████████████████▇▇██▆▅▆▄▄▄▄▃▃▃▃▃▃▃▃▂▂▂▂▂ ▄
  1.12 ms        Histogram: frequency by time        1.28 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark for a in ary
           unsafe_push!(h1, a)
       end setup=(h1=Hist1D(Int; bins=0:0.03:1)) evals=1
BenchmarkTools.Trial: 8811 samples with 1 evaluation.
 Range (min  max):  123.033 μs  590.720 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     565.413 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   565.395 μs ±  20.235 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                                             █   
  ▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▆ ▂
  123 μs           Histogram: frequency by time          575 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

@Moelf
Copy link
Contributor Author

Moelf commented Aug 31, 2021

because it's relatively high?

you assumed wrong, in the sense that it's still faster (at least no slower) than StatsBase's thread unsafe push!. Hope one day the @atomic work in base would help array element increment here.

unsafe_* to indicate a non-thread-safe method.

good point, maybe a re-name before 1.0. But the naming comes from thread-safety

Why should this method be thread-safe when the rest of the API isn't?

you mean this method unsafe_push!() isn't thread-safe and the rest of API is? I guess we could let push!() be thread-unsafe and have a safe_push!()?

@nalimilan
Copy link
Member

you assumed wrong, in the sense that it's still faster (at least no slower) than StatsBase's thread unsafe push!. Hope one day the @atomic work in base would help array element increment here.

Well that's not what I said. I mean that push! is slower than unsafe_push!. I'm afraid atomics can't help here as that would mean all functions working on histograms would have to work with atomic types, which are generally noticeably slower (otherwise we would always use atomic types).

good point, maybe a re-name before 1.0. But the naming comes from thread-safety

Yes, I get it, but do you have other examples of this terminology? In general it would be interesting to see whether a convention emerges in the ecosystem to distinguish thread-safe from non-thread-safe functions.

you mean this method unsafe_push!() isn't thread-safe and the rest of API is? I guess we could let push!() be thread-unsafe and have a safe_push!()?

No I mean that none of the StatsBase API is thread-safe currently, so why should push! be thread-safe, or why should we provide a thread-safe safe_push! function but not equivalent functions in other areas? Basically any function in the API that mutates its argument could gain a safe_* variant that takes a lock (for example addcounts!, setindex! for weight vectors...). That approach doesn't really scale.

@Moelf
Copy link
Contributor Author

Moelf commented Aug 31, 2021

mean all functions working on histograms would have to work with atomic types

I thought @atomic fields has to pair with @atomic at access site too, other wise it's just normal.

have other examples of this terminology

no, do you have other recommendation?

why should we provide a thread-safe safe_push! function but not equivalent functions in other areas

because AFAIK only histogram has the practical demand of "push into from multiple threads because we're reducing over large amount of data and benefits from parallelism".

so why should push! be thread-safe. That approach doesn't really scale.

ok, let's not have push! to be thread-safe.

@Moelf Moelf closed this as completed Aug 31, 2021
@nalimilan
Copy link
Member

I thought @atomic fields has to pair with @atomic at access site too, other wise it's just normal.

No, AFAIK if you don't pair them with @atomic you get an error:

julia> mutable struct A
           @atomic x::Int
       end

julia> x = A(1);

julia> x.x = 1
ERROR: ConcurrencyViolationError("setfield!: atomic field cannot be written non-atomically")
Stacktrace:
 [1] setproperty!(x::A, f::Symbol, v::Int64)
   @ Base ./Base.jl:39
 [2] top-level scope
   @ REPL[10]:1

(AFAIK the storage is different.)

no, do you have other recommendation?

Unfortunately not. :-/

because AFAIK only histogram has the practical demand of "push into from multiple threads because we're reducing over large amount of data and benefits from parallelism".

I'm not sure that's the only case. For example, addcounts! may also quite similarly be called from multiple threads that do independent computations. Also in Base push! on Dict is very similar and it's not thread-safe because that hurts performance.

ok, let's not have push! to be thread-safe.

OK. Are you still interested in other features though? Maybe worth filing separate issues to discuss them?

@bkamins
Copy link
Contributor

bkamins commented Aug 31, 2021

My thinking on this issue is:

  • following Julia Base contract for push! it should be made as fast possible (i.e. if thread safety would reduce speed we do not have to make it thread safe)
  • unsafe prefix in Julia Base means usually that the operation does not perform bounds checking or pointer validity checking; Therefore I would restrict the use of unsafe for such cases;
  • regarding threaded/distributed computation of histogram, I find a map-reduce pattern natural to use (i.e. compute several histograms in parallel, and then merge them as I assume merging would have a negligible cost in comparison to initial histogram construction that would be parallelized)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants