Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kludge: doubly `ensure` clearing outermost realtime_block on Timeout #263

Merged
merged 3 commits into from Sep 11, 2017

Conversation

@cben
Copy link
Contributor

cben commented Aug 22, 2017

Based on #269

Fixes ManageIQ/manageiq#9642.
Timeout::Error is just one regular exception, it can be caught. But it can strike anywhere.
ensure is not broken, if it hits inside begin..ensure that's fine, the cleanup runs.
BUT if it strikes after ensure, it can abort the cleanup!
Also it can strike between performing current_realtime= and even entering begin..ensure.
(Instrumenting the test to log stacktrace where exception hit confirms these are the 2 problematic points:
https://gist.github.com/cben/d5ea134a1f556023905488f1a6a756bf)

The Hobbit - Bain gives Bard the Black Arrow "Timeout Errow"

Good news is there is only one Timeout::ArrowError!
(Unless we do nested Timeouts, and they expire in same millisecond.)
In this code the cleanup is safe to attempt twice. At first I wanted to do something like this, which still had weak spots:

      self.current_realtime = hash if outermost
# 🏹 ➸ Weak spot here!
      begin
        ret = realtime_store(hash, key, &block)
        return ret, hash
      ensure
# 🏹   ➸ Weak spot here!  not covered by either begin..ensure
+       begin
          delete_current_realtime if outermost
+       ensure
+         delete_current_realtime if outermost && self.in_realtime_block?
+       end
      end

This is why dragons have overlapping scales. A second begin..ensure covering the whole thing works, at least one ensure will complete uninterrupted. :-)
Successfully run 10M iterations βœ”οΈ

Spec shamelessly stolen from @jrafanie's POC in ManageIQ/manageiq#9642.
100k iterations take ~7seconds, doubling this repo's test suite :-(
10k usually missed the bug, it averages ~1/30k for me.
Can't narrow the time range down much (see gist above, wide time distribution to hit either spot).

@miq-bot add-label bug

@miq-bot miq-bot added the bug label Aug 22, 2017
@cben cben force-pushed the cben:overlapping-timeout-armor branch 2 times, most recently from 2243ea5 to 5c490c2 Aug 22, 2017
@Fryguy

This comment has been minimized.

Copy link
Member

Fryguy commented Aug 24, 2017

"Here be dragons" 😬

I'm good with this, and I don't mind that it adds 7 seconds. Since this happen 1/30k can you make the number 50k iteration?

ensure
delete_current_realtime if outermost
delete_current_realtime if outermost && in_realtime_block?

This comment has been minimized.

Copy link
@Fryguy

Fryguy Aug 24, 2017

Member

I'm trying to grok this, but I don't understand. outermost = !in_realtime_block, so this code is saying delete_current_realtime if !in_realtime_block? && in_realtime_block?, which is always, right?

This comment has been minimized.

Copy link
@cben

cben Aug 24, 2017

Author Contributor

outermost is cached !in_realtime_block? at begining; in_realtime_block? is a call checking current state (original had self.in_realtime_block? which might have stressed it, but rubocop objected to redundant self.).
At this point it should have original value restored, yes, so this && should never happen β€” unless Timeout::Error...

Rewritten as 2 branches β€” nested / outermost β€” and got rid of outermost var.
IMO reads clearer (?), PTAL.

@cben cben force-pushed the cben:overlapping-timeout-armor branch from 5c490c2 to 04610e7 Aug 24, 2017
@Fryguy

This comment has been minimized.

Copy link
Member

Fryguy commented Aug 31, 2017

@jrafanie Would like you πŸ‘€ on this as well.

# A second layer of protection in case TimeoutError struck right after
# setting self.current_realtime, or right before `delete_current_realtime`.
# In those cases, current_realtime might (wrongly) still exist.
delete_current_realtime && in_realtime_block?

This comment has been minimized.

Copy link
@jrafanie

jrafanie Aug 31, 2017

Member

@cben I don't understand what the && in_realtime_block? is doing.

Is that needed?

This comment has been minimized.

Copy link
@jrafanie

jrafanie Aug 31, 2017

Member

Was it intended to be if in_realtime_block? ?

This comment has been minimized.

Copy link
@cben

cben Aug 31, 2017

Author Contributor

Oops bad rewrite, yes, s/&&/if/.
This means it's safe to do without checking :-)
Huh. TIL ruby Hash#delete just returns nil if no such key. I should have guessed but I keep getting surprised...

IMHO it's more readable with the check, but don't mind either way.

This comment has been minimized.

Copy link
@jrafanie

jrafanie Aug 31, 2017

Member

Ok, cool. I'm fine with keeping the if in_realtime_block? check.

@jrafanie

This comment has been minimized.

Copy link
Member

jrafanie commented Aug 31, 2017

@cben This looks good to me. It's a simple solution to a very complicated problem. You're 100% right that it's perfectly fine for us to try to delete_current_realtime twice and at least one of them will always complete.

@cben cben force-pushed the cben:overlapping-timeout-armor branch from 04610e7 to 7b807af Aug 31, 2017
it "Timeout raising within .realtime_block" do
expect(Benchmark.in_realtime_block?).to be_falsey
50_000.times do |i|
timeout = 10**Random.rand(-13.0..-10.0)

This comment has been minimized.

Copy link
@cben

cben Aug 31, 2017

Author Contributor

Wait this is funny, 1e-10 is already 0.1ns, less than a cpu cycle! (Moore proof haha).
My "distribution" showed that timeouts in this range do trigger the bug, sometimes, but I think it just expired immediately and depended on thread context switch...

I'll measure with longer values.

@cben cben force-pushed the cben:overlapping-timeout-armor branch from 7b807af to 177aea1 Sep 5, 2017
@cben

This comment has been minimized.

Copy link
Contributor Author

cben commented Sep 6, 2017

Turned out, whatever the timeout number, less than 1 in 1000 Timeouts were hitting the realtime_block at all, most blocks were completing sucessfully.
=> Rewrote test to loop entering and exiting blocks until the Timeout hits. Now about 1 in 10 Timeouts catches the bug, but due to loop count it's only mildly faster (1st and 2nd commit have details). Whatever.
=> Prettier stacks tally https://gist.github.com/cben/d5ea134a1f556023905488f1a6a756bf but nothing new.

What I'm stuck on now, is that running the full suite rather than just this test, the infinite loop never finishes. I could limit the loop but then the test is ineffective. Anyway, how come Timeout doesn't work?!
It's an order dependency, I've bisected it to Tempfile.new. If I add Tempfile.new("x") at the start of this spec, and run just this spec, it loops forever.
Any ideas?

@jrafanie

This comment has been minimized.

Copy link
Member

jrafanie commented Sep 6, 2017

It's an order dependency, I've bisected it to Tempfile.new. If I add Tempfile.new("x") at the start of this spec, and run just this spec, it loops forever.
Any ideas?

@cben Weird. I recreated it "sporadically" and ran it with debug and it's definitely the cleanup of the tempfile that is hanging...

Here's a good run:

11:57:14 ~/Code/manageiq-gems-pending (cben-overlapping-timeout-armor) (2.3.4) - RUBYOPT=-d be rspec spec/util/extensions/miq-benchmark_spec.rb:69
Exception `LoadError' at /Users/joerafaniello/.rubies/ruby-2.3.4/lib/ruby/2.3.0/rubygems.rb:1241 - cannot load such file -- rubygems/defaults/operating_system
Exception `LoadError' at /Users/joerafaniello/.rubies/ruby-2.3.4/lib/ruby/2.3.0/rubygems.rb:1250 - cannot load such file -- rubygems/defaults/ruby
Exception `LoadError' at /Users/joerafaniello/.rubies/ruby-2.3.4/lib/ruby/2.3.0/rubygems/core_ext/kernel_require.rb:55 - cannot load such file -- did_you_mean
Exception `LoadError' at /Users/joerafaniello/.rubies/ruby-2.3.4/lib/ruby/2.3.0/rubygems/core_ext/kernel_require.rb:55 - cannot load such file -- bundler
Exception `TypeError' at /Users/joerafaniello/.gem/ruby/2.3.4/gems/bundler-1.15.4/lib/bundler/vendor/thor/lib/thor/base.rb:637 - can't dup FalseClass
Exception `TypeError' at /Users/joerafaniello/.gem/ruby/2.3.4/gems/bundler-1.15.4/lib/bundler/vendor/thor/lib/thor/base.rb:637 - can't dup FalseClass
Exception `TypeError' at /Users/joerafaniello/.gem/ruby/2.3.4/gems/bundler-1.15.4/lib/bundler/vendor/thor/lib/thor/base.rb:637 - can't dup FalseClass
...
Exception `Errno::EEXIST' at /Users/joerafaniello/.rubies/ruby-2.3.4/lib/ruby/2.3.0/fileutils.rb:254 - File exists @ dir_s_mkdir - /Users/joerafaniello/.gem/ruby/2.3.4
Exception `NameError' at /Users/joerafaniello/.gem/ruby/2.3.4/gems/bundler-1.15.4/lib/bundler/rubygems_integration.rb:558 - method `gem' not defined in #<Class:Kernel>
Exception `NameError' at /Users/joerafaniello/.gem/ruby/2.3.4/gems/bundler-1.15.4/lib/bundler/rubygems_integration.rb:556 - undefined method `find_spec_for_exe' for class `#<Class:Gem>'
Exception `NameError' at /Users/joerafaniello/.gem/ruby/2.3.4/gems/bundler-1.15.4/lib/bundler/rubygems_integration.rb:556 - undefined method `activate_bin_path' for class `#<Class:Gem>'
Exception `NameError' at /Users/joerafaniello/.gem/ruby/2.3.4/gems/rspec-core-3.5.4/lib/rspec/core/pending.rb:20 - uninitialized constant RSpec::Core::Pending::Test
Using Ext extension for JSON.
Run options: include {:locations=>{"./spec/util/extensions/miq-benchmark_spec.rb"=>[69]}}

Randomized with seed 7990

Benchmark
Exception `Timeout::Error' at /Users/joerafaniello/.rubies/ruby-2.3.4/lib/ruby/2.3.0/timeout.rb:112 - execution expired
removing /var/folders/fq/blrz820d3qz7nm7vj8mbtfs40000gq/T/x20170906-57250-s683fs...
done
Exception `Timeout::Error' at /Users/joerafaniello/.rubies/ruby-2.3.4/lib/ruby/2.3.0/timeout.rb:112 - execution expired
Exception `Timeout::Error' at /Users/joerafaniello/.rubies/ruby-2.3.4/lib/ruby/2.3.0/timeout.rb:112 - execution expired
Exception `Timeout::Error' at /Users/joerafaniello/.rubies/ruby-2.3.4/lib/ruby/2.3.0/timeout.rb:112 - execution expired
Exception `Timeout::Error' at /Users/joerafaniello/.rubies/ruby-2.3.4/lib/ruby/2.3.0/timeout.rb:112 - execution expired
...
  Timeout raising within .realtime_block

Finished in 1.96 seconds (files took 0.15031 seconds to load)
1 example, 0 failures

Here's when it hangs:

11:57:22 ~/Code/manageiq-gems-pending (cben-overlapping-timeout-armor) (2.3.4) - RUBYOPT=-d be rspec spec/util/extensions/miq-benchmark_spec.rb:69
Exception `LoadError' at /Users/joerafaniello/.rubies/ruby-2.3.4/lib/ruby/2.3.0/rubygems.rb:1241 - cannot load such file -- rubygems/defaults/operating_system
Exception `LoadError' at /Users/joerafaniello/.rubies/ruby-2.3.4/lib/ruby/2.3.0/rubygems.rb:1250 - cannot load such file -- rubygems/defaults/ruby
Exception `LoadError' at /Users/joerafaniello/.rubies/ruby-2.3.4/lib/ruby/2.3.0/rubygems/core_ext/kernel_require.rb:55 - cannot load such file -- did_you_mean
Exception `LoadError' at /Users/joerafaniello/.rubies/ruby-2.3.4/lib/ruby/2.3.0/rubygems/core_ext/kernel_require.rb:55 - cannot load such file -- bundler
Exception `TypeError' at /Users/joerafaniello/.gem/ruby/2.3.4/gems/bundler-1.15.4/lib/bundler/vendor/thor/lib/thor/base.rb:637 - can't dup FalseClass
Exception `TypeError' at /Users/joerafaniello/.gem/ruby/2.3.4/gems/bundler-1.15.4/lib/bundler/vendor/thor/lib/thor/base.rb:637 - can't dup FalseClass
...
Exception `TypeError' at /Users/joerafaniello/.gem/ruby/2.3.4/gems/bundler-1.15.4/lib/bundler/vendor/thor/lib/thor/base.rb:637 - can't dup FalseClass
Exception `Errno::EEXIST' at /Users/joerafaniello/.rubies/ruby-2.3.4/lib/ruby/2.3.0/fileutils.rb:254 - File exists @ dir_s_mkdir - /Users/joerafaniello/.gem/ruby/2.3.4
Exception `NameError' at /Users/joerafaniello/.gem/ruby/2.3.4/gems/bundler-1.15.4/lib/bundler/rubygems_integration.rb:558 - method `gem' not defined in #<Class:Kernel>
Exception `NameError' at /Users/joerafaniello/.gem/ruby/2.3.4/gems/bundler-1.15.4/lib/bundler/rubygems_integration.rb:556 - undefined method `find_spec_for_exe' for class `#<Class:Gem>'
Exception `NameError' at /Users/joerafaniello/.gem/ruby/2.3.4/gems/bundler-1.15.4/lib/bundler/rubygems_integration.rb:556 - undefined method `activate_bin_path' for class `#<Class:Gem>'
Exception `NameError' at /Users/joerafaniello/.gem/ruby/2.3.4/gems/rspec-core-3.5.4/lib/rspec/core/pending.rb:20 - uninitialized constant RSpec::Core::Pending::Test
Using Ext extension for JSON.
Run options: include {:locations=>{"./spec/util/extensions/miq-benchmark_spec.rb"=>[69]}}

Randomized with seed 61986

Benchmark
Exception `Timeout::Error' at /Users/joerafaniello/.rubies/ruby-2.3.4/lib/ruby/2.3.0/timeout.rb:112 - execution expired
removing /var/folders/fq/blrz820d3qz7nm7vj8mbtfs40000gq/T/x20170906-57351-jfeu6l...
@jrafanie

This comment has been minimized.

@jrafanie

This comment has been minimized.

Copy link
Member

jrafanie commented Sep 6, 2017

@cben So, tl;dr: Timeout is horrible and appears to be raising the exception in the finalizer when the tempfile object is being GC'd.

I'm able to pretty reliably recreate it on osx with this script:

require 'tempfile'
require 'timeout'
Tempfile.new("x")

def stuff
  ""
end

20.times do |i|
  begin
    Timeout.timeout(1e-9) do
      loop do
        begin
          "result"
        ensure
        end
      end
    end
  rescue Timeout::Error => e
  end
end

So, I'm leaning towards not testing this explicitly.

@jrafanie

This comment has been minimized.

Copy link
Member

jrafanie commented Sep 6, 2017

@cben I don't expect they'll find a way to reliably fix this, since it's Timeout's thread unsafe code at fault but I opened a bug anyway: https://bugs.ruby-lang.org/issues/13876

@cben

This comment has been minimized.

Copy link
Contributor Author

cben commented Sep 7, 2017

@cben cben force-pushed the cben:overlapping-timeout-armor branch 2 times, most recently from 7907ecf to c7ef5bf Sep 7, 2017
@cben

This comment has been minimized.

Copy link
Contributor Author

cben commented Sep 7, 2017

Rebased on #269. Added GC.start, made loop bounded. PTAL.

@Fryguy

This comment has been minimized.

Copy link
Member

Fryguy commented Sep 8, 2017

Merged #269

@Fryguy

This comment has been minimized.

Copy link
Member

Fryguy commented Sep 8, 2017

Why fail, as opposed to raise?

end
end
end
fail("Completed without Timeout!? Tip: may result from unclosed Tempfile object.")

This comment has been minimized.

Copy link
@jrafanie

jrafanie Sep 8, 2017

Member

Good πŸ‘ @Fryguy, yeah, I'm not sure why we fail here. From testing this, we get to this line when the block is fast and the timeout is never raised OR when it's slow and the tempfile is cleaned up without the exception being raised.

This comment has been minimized.

Copy link
@jrafanie

This comment has been minimized.

Copy link
@cben

cben Sep 9, 2017

Author Contributor

My intent was get error from this test when it's stuck due to tempfile, instead of having Travis kill job at 10min.
When the block is fast β€” I tried to set enough loops so it never happens. For me timeout normally hits around 10,000 loops.
But 100,000 is <2sec, so yeah, bumped to 2,000,000 which is ~30sec.
Anyway if for any reason it completes without Timeout, that means this test is not testing anything, and I wanted an indication. Is failing too harsh?

["fail" because that's how stack overflow told me how to fail an rspec test, sounded right :-) changed to "raise".]

Slow.  Prints detailed success/fail stats per execution point.
Test doesn't actually fail.
To be replaced with simpler test based on this data.
@cben cben force-pushed the cben:overlapping-timeout-armor branch from c7ef5bf to ed1efda Sep 9, 2017
cben added 2 commits Sep 5, 2017
Dropped statistics printing from previous commit.
On my machine, each loop does ~9000 benchmark_block's until Timeout hits.
Approximately 1 in 10 loops triggers the failure.
Set to 20 loops - around 2 seconds - overall around 90% to catch the bug.
Fixes ManageIQ/manageiq#9642.
Timeout::Error is just one regular exception, it can be caught,
but it can strike anywhere.
If it hits begin..ensure that's fine, cleanup is done.
But if it happens between performing `current_realtime=` and entering
begin..ensure, or inside ensure before cleaning up,
outermost realtime_block could exit uncleanly,
and from that points ALL timings would accumulate forever
(as if we're still in a giant block).

In this case the cleanup is safe to attempt twice, so a 2nd `ensure`
works: at least one `ensure` will complete uninterrupted.
(Could still be defeated by Timeout within Timeout that elapse nearly
simultaneously.)
@cben cben force-pushed the cben:overlapping-timeout-armor branch from ed1efda to c5be670 Sep 10, 2017
@miq-bot

This comment has been minimized.

Copy link
Member

miq-bot commented Sep 10, 2017

Checked commits cben/manageiq-gems-pending@b98146b~...c5be670 with ruby 2.2.6, rubocop 0.47.1, and haml-lint 0.20.0
2 files checked, 0 offenses detected
Everything looks fine. πŸͺ

@jrafanie

This comment has been minimized.

Copy link
Member

jrafanie commented Sep 11, 2017

LGTM. Let's see how it goes πŸ‘

@jrafanie jrafanie merged commit 9f5aff0 into ManageIQ:master Sep 11, 2017
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jrafanie jrafanie self-assigned this Sep 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.