Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split Dashboard can take minutes to load after upgrading to Split v4.x (with RubyStats) #713

Open
vladr opened this issue Jul 23, 2023 · 5 comments
Assignees

Comments

@vladr
Copy link

vladr commented Jul 23, 2023

Describe the bug
When using Split 3.4.1, the Dashboard page loaded quasi-instantaneously.
After upgrading to Split 4.0.2, the Dashboard can take so long to load (render), that the request times out, making the Dashboard unusable. See below for possible root cause.

To Reproduce
Steps to reproduce the behavior:

  1. Ensure a number of experiments exist, where each alternative has a few million participants and ~25% completion rate.
  2. Go the Split Dashboard
  3. Observe the time that it takes the page to load

Alternatively, execute the following (what Split::Experiment#estimate_winning_alternative calls when invoked by the Dashboard for an alternative with 4M participants and ~25% trial completion)--in my case, just this one call takes >10s!

> Benchmark.measure { 10000.times { Split::Algorithms::beta_distribution_rng(1_000_000, 4_000_000) } }
=> #<Benchmark::Tms:0x0000555f7b388508 @label="", @real=10.572727171995211, @cstime=0.0, @cutime=0.0, @stime=0.006018999999999997, @utime=10.566891999999996, @total=10.572910999999996>

Expected behavior

Additional context
This is the stack trace at the time of the request timeout:

Rack::Timeout::RequestTimeoutException - Request ran for longer than 28000ms :
  ruby/2.7.0/gems/rubystats-0.4.1/lib/rubystats/modules.rb:491:in `*'
  ruby/2.7.0/gems/rubystats-0.4.1/lib/rubystats/modules.rb:491:in `beta_fraction'
  ruby/2.7.0/gems/rubystats-0.4.1/lib/rubystats/modules.rb:464:in `incomplete_beta'
  ruby/2.7.0/gems/rubystats-0.4.1/lib/rubystats/beta_distribution.rb:61:in `cdf'
  ruby/2.7.0/gems/rubystats-0.4.1/lib/rubystats/probability_distribution.rb:148:in `find_root'
  ruby/2.7.0/gems/rubystats-0.4.1/lib/rubystats/beta_distribution.rb:84:in `icdf'
  ruby/2.7.0/gems/rubystats-0.4.1/lib/rubystats/beta_distribution.rb:89:in `rng'
  ruby/2.7.0/gems/split-4.0.2/lib/split/algorithms.rb:18:in `beta_distribution_rng'
  ruby/2.7.0/gems/split-4.0.2/lib/split/experiment.rb:364:in `block in calc_simulated_conversion_rates'
  ruby/2.7.0/gems/split-4.0.2/lib/split/experiment.rb:361:in `each'
  ruby/2.7.0/gems/split-4.0.2/lib/split/experiment.rb:361:in `calc_simulated_conversion_rates'
  ruby/2.7.0/gems/split-4.0.2/lib/split/experiment.rb:301:in `block in estimate_winning_alternative'
  ruby/2.7.0/gems/split-4.0.2/lib/split/experiment.rb:299:in `times'
  ruby/2.7.0/gems/split-4.0.2/lib/split/experiment.rb:299:in `estimate_winning_alternative'
  ruby/2.7.0/gems/split-4.0.2/lib/split/experiment.rb:280:in `calc_winning_alternatives'

This is the StackProf summary for Split::Experiment#estimate_winning_alternative on one of the more problematic experiments:

==================================
  Mode: cpu(1000)
  Samples: 7719 (0.00% miss rate)
  GC: 44 (0.57%)
==================================
     TOTAL    (pct)     SAMPLES    (pct)     FRAME
      7309  (94.7%)        7309  (94.7%)     Rubystats::SpecialMath#beta_fraction
      7466  (96.7%)          71   (0.9%)     Rubystats::SpecialMath#incomplete_beta
        86   (1.1%)          64   (0.8%)     Rubystats::BetaDistribution#pdf
      7619  (98.7%)          45   (0.6%)     Rubystats::ProbabilityDistribution#find_root
        61   (0.8%)          43   (0.6%)     Rubystats::SpecialMath#log_gamma
       102   (1.3%)          41   (0.5%)     Rubystats::SpecialMath#log_beta
        28   (0.4%)          28   (0.4%)     (marking)
        21   (0.3%)          21   (0.3%)     Rubystats::BetaDistribution#initialize
        18   (0.2%)          18   (0.2%)     Rubystats::ProbabilityDistribution#check_range
        17   (0.2%)          17   (0.2%)     ActiveSupport::EachTimeWithZone#ensure_iteration_allowed
        16   (0.2%)          16   (0.2%)     (sweeping)
        ...
@andrehjr
Copy link
Member

Thanks for detailed report @vladr !

I'll check this out 🤔 My expectation is the same as yours. The performance should've been in the same range as the previous version. Or at least, it shouldn't break/time out the whole dashboard.

For the workaround, I also need to review the math involved, it's been a while 😅 but If I'm not mistaken, lowering 10K beta_probability_simulations should impact the confidence over the winning alternative.

@vladr
Copy link
Author

vladr commented Aug 2, 2023

If I read the documentation correctly, altering beta_probability_simulations should only affect the "Probability of being Winner" calculation on the dashboard, but not the "Confidence" (default) calculation. Is this correct?

In addition to the dashboard, beta_distribution_rng is, I believe, also used by the Whiplash (multi-armed bandit) algorithm; we aren't using Whiplash at the moment, but could there also be an impact there? (potentially in excess of 1 millisecond per trial for large participant counts, based on the benchmark above).

@andrehjr
Copy link
Member

andrehjr commented Aug 2, 2023

Yes, that's correct. Thanks for digging up more into this. I couldn't get some quality time yet. :(

The default confidence calculation is done via ZScore and should be fast.

The only impact for altering beta_probability_simulations is the % for the "Probability of being Winner" shown on the dashboard. We try to cache the result for the probability on Redis as it can be slow to calculate that every time the dashboard open... still, I think this should be improved.

@vladr
Copy link
Author

vladr commented Aug 2, 2023

Thanks for confirming, and for reminding me of the Redis cache! If anyone else stumbles over this ticket because of running into a Rack::Timeout::RequestTimeoutException or equivalent (i.e. preventing the calculation from completing) when loading the console, I can confirm that manually "priming" the cache first in Rails console and then accessing the Dashboard also works as a palliative measure:

irb(main):001:0> Split::ExperimentCatalog.all_active_first.each { |x| print "#{Time.now} - #{x.name} ...\n" ; x.calc_winning_alternatives }

@maciesielka
Copy link

@andrehjr 👋🏻 I'm facing the same problem. Any progress on this issue or any best guess about when it might be addressed? The workaround above works, but ideally my team doesn't need to know about this gotcha in future versions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants