Deterministic timing for DeferrableTestCase #146

kaste · 2018-11-28T14:25:51Z

Sublime's worker thread is a task queue. If you put
tasks on it, they will run ordered.

We can use this to make the yield timing deterministic.
If we just yield (without an argument) we enqueue a task
to be run ASAP. That means it will wait for tasks in the
queue and then yield back to the test code.

E.g. from the test we invoke the unit-under-test which in turn
puts task (A) and (B) on the worker. We then yield which puts
itself as (C) on the queue and waits for it. Sublime will now run
(A), (B), and finally (C) ordered. The test runner now yields
control back to the test.

If (A) in turn enqueued a task (D) on the queue, it will not have
run yet, and in fact will not run at all if we not yield again.

See the test included here.

Using this, I could bring down the execution time of the
test in timbrel/GitSavvy#1056 from ~7.7s
to 2.7s. I could replace all yield <condition> expressions with
single or double yields, making the test more reasonable. Bc
I know that a command e.g. 'hops' two times, I know exactly that
I have to yield two times.

This comes with the big caveat that coverage doesn't work on the
worker thread. This is already true right now, see
timbrel/GitSavvy#1056 (comment)
and a big annoyance for writing functional tests. E.g. as soon
as we just issue a command which itself is implemented to run
'async' (t.i. on the worker thread) and then just yield <condition>
to wait for it, the code under test will not be marked as 'covered'.

Sublime's worker thread is a task queue. If you put tasks on it, they will run *ordered*. We can use this to make the `yield` timing deterministic. If we just `yield` (without an argument) we enqueue a task to be run ASAP. That means it will wait for tasks in the queue and then yield back to the test code. E.g. from the test we invoke the unit-under-test which in turn puts task (A) and (B) on the worker. We then `yield` which puts itself as (C) on the queue and waits for it. Sublime will now run (A), (B), and finally (C) ordered. The test runner now yields control back to the test. If (A) in turn enqueued a task (D) on the queue, it will not have run yet, and in fact will not run at all if we not `yield` again. See the test included here. Using this, I could bring down the execution time of the test in timbrel/GitSavvy#1056 from ~7.7s to 2.7s. I could replace all `yield <condition>` expressions with single or double `yield`s, making the test more reasonable. Bc I know that a command e.g. 'hops' two times, I know exactly that I have to `yield` two times. This comes with the big caveat that coverage doesn't work on the worker thread. This is already true right now, see timbrel/GitSavvy#1056 (comment) and a big annoyance for writing functional tests. E.g. as soon as we just issue a command which itself is implemented to run 'async' (t.i. on the worker thread) and then just `yield <condition>` to wait for it, the code under test will not be marked as 'covered'.

randy3k · 2018-11-28T16:26:10Z

Do this resolve the coverage issue in GitSavvy?

randy3k · 2018-11-28T18:39:27Z

Um, didn’t read your post carefully at the first glance. Coverage indeed doesn’t work with the sublime worker thread, I don’t have any solution except using threading library

In the first series of commits we tried to run all `yield` blocks, and by that a lot of the code under test, on the worker thread. This gives us more predictable timing bc a developer can count how many times their code 'hops' and then yield that many times. Generally, the FIFO principle is also true for the main thread. However, choosing the main thread is a bad option. (a) Usually the code under tests defers some work to the worker thread, and we really want to await that work. (b) The main thread runs faster. We choosed the slower thread bc when the worker yields a task, the main thread is usually done. There is one major caveat when we run on the worker thread t.i. coverage doesn't work anymore. This commit introduces a switch which will *globally* for the time of the tests swap `set_timeout_async` with `set_timeout`. By doing that (a) everything runs super fast, (b) every scheduled work/task runs on one thread, the main thread, so the FIFO principle holds and ensures deterministic timing, (c) all code under test gets coverage. This comes with one known caveat: Sublime puts '_async' events on the worker thread on its own, t.i. it doesn't use the patched API. For example, you cannot await such an event by just `yield`ing. You have to use `yield <delay>` or `yield <condition>` as before.

kaste · 2018-11-29T20:03:58Z

I pushed this PR. There is a long explainer in 63b08cc

Goals:

Overall faster tests
Predictable async tests. If the developer knows their code 'hops' twice, we need to yield twice
Correct coverage for deferred work

I compare this PR and current master. The unit under test are (a) basically GitSavvy/#1056 and (b) the SublimeLinter test suite.

Note: controlled_timing is the wrong name. controlled_timing=True means
everything runs on the main thread, and set_timeout_async is patched.
controlled_timing=False means everything runs on the worker thread, t.i.
the framework 'unittesting' runs on the worker thread, nothing is patched.


SublimeLinter
-------------

current master  test_user_args  93 tests   6,3s
                package        194 tests  14s      40% coverage       (A)

this PR controlled_timing=True
                test_user_args  93 tests   2s      
                package        194 tests   4s      40% coverage       (B)

this PR controlled_timing=False
                test_user_args  93 tests   4s       
                package        194 tests  11s      29% coverage       (C)

Note: test suite is focused on backend stuff, not really view related


GitSavvy/#1056
--------------

current master  test_graph_view  9 tests  7s      50%/52% coverage    (D)

this PR controlled_timing=True
                test_graph_view  9 tests  2.4s    69%/95% coverage    (E)

this PR controlled_timing=False
                test_graph_view  9 tests  4.3s    47%/52% coverage    (F)

Note: the unit under test are in two files so I show x%/y% coverage.
The correct coverage is 69/95


GitSavvy/#1056 optimized*
-------------------------

this PR controlled_timing=True
                test_graph_view  9 tests  1.2s    69%/95% coverage    (G)
          
this PR controlled_timing=False
                test_graph_view  9 tests  2.3s    47%/52% coverage    (H)

* E.g. instead of `yield <condition>` I just `yield` twice.

Summary: this PR runs against the GitSavvy and the SublimeLinter test suite
unmodified. So it is not breaking wildly. controlled_timing=True gives
the correct coverage, at the same time giving best performance.

Notable: 

(D) slow and wrong coverage
(C) (F) (H)  controlled_timing=False  all have wrong coverage; they 
intentionally run only on the worker
~(C) is slower than (A)~

EDIT wrong numbers for (B) and (C); (B) runs in 4 not 11; (C) runs in 11 not 16

kaste · 2018-11-29T20:55:41Z

I want to make another comment. Between (E) and (G). The tests in the GitSavvy PR (E) mostly use yield <condition> or yield <delay>. (G) uses plain yield (None) and yield <delay>. Generally with async code, we have this three bad options.

With just yield we depend on an implementation detail in that we know how often we 'hop'. It is not generally advisable to rely on that but it is good to have.

yield <condition> depends on the actual side-effect to be observable. If the view changes usually this condition breaks. E.g. we depend on a concrete rendering but we only want to know that we rendered at all.

With yield <delay> we depend on the CPU. It's basically just number guessing and is known to be flaky.

randy3k · 2018-11-30T04:21:40Z

Wow, thank you for working on this matter. I currently don't have enough time to review this. Does @Thom1729 want to help?

Thom1729 · 2018-11-30T04:43:54Z

I can probably take a look this weekend, or possibly tomorrow.

I am concerned about patching the sublime module. Surely there must be another way to accomplish the same end.

Thom1729 · 2018-12-02T16:27:19Z

If the goal is to choose set_timeout/set_async_timeout based on self.controlled_timing, then why not simply make defer an instance method?

def defer(self, delay, callback, *args, **kwargs):
    if self.controlled_timing:
        schedule = sublime.set_timeout_async
    else:
        schedule = sublime.set_timeout

    schedule(partial(callback, *args, **kwargs), delay)

kaste · 2018-12-02T19:28:16Z

No that's not really the goal. There are three goals see above #146 (comment)

To get all three is tricky. Personally, from a testers standpoint I would have chosen, controlled_timing=False but it still slow, and coverage is completely off even worse than current master.

controlled_timing=True is a good choice. It's the fastest and has correct (not only best, but correct) coverage. The only concern I have is that I cannot await Sublime's (native) events bc Sublime puts the event handler just on the worker queue obviously not using the sublime.set_timeout(_async) API.

E.g. you place a cursor somewhere, you know this will trigger on_selection_modified_async but you cannot yield once and know that after that the event has been handled. (With controlled_timing=False this works.) Instead you have to use yield <condition> or yield <delay> to await an observable side-effect. (This gets unfortunate if there actually is no side-effect.)

Now, okay, the tests run fast, but you will have such async events in the queue, you cannot await them, so they pretty sure leak into the next test. (We would need to write a test to prove that.) Should we or must we take care of this?

For GitSavvy, I think I saw such outdated events, and in some sense they make the code better bc after the test I close the view, and so the events runs with a detached view (e.g. view.window() is None, view.buffer_id() == 0) and I actually have to guard against these kinds of nulls otherwise Sublime throws. This makes the code more defensive than what we usually hack in in such event handlers.

Still, if we want turn based tasks, there is no concern to mock/stub out the main en-queuing function (here set_timeout(_async). I think that's the common way of doing this.

Thom1729 · 2018-12-02T19:57:00Z

So is the purpose of patching sublime.set_timeout to change the behavior of the tested code (to run everything on the main thread)? I'm not sure that this is a good idea for a unit test. The test should run the code in the way that it will be run in practice or as closely as possible.

Notwithstanding the above, patching the sublime module is probably a bad idea. That module is shared by every package, and patching it is likely to cause problems that are very hard to detect, troubleshoot, and fix.

I recognize the coverage issue. I would suggest that breaking the code into smaller pieces might make it easier to unit-test those pieces without worrying about async issues. Alternatively, you could use dependency injection — pass a defer method into your code instead of calling sublime.set_timeout_async.

kaste · 2018-12-03T19:43:47Z

We should explore the design space we have here. Your 'concerns' are very broad and general opinions.

We have at least two problems/issues to solve:

a) The tests are too slow. (E.g. the GitSavvy tests: 9 tests in 7 seconds doesn't scale, and you basically can't practice TDD in a meaningful way.)

b) The coverage is completely off. (The workaround would be to write tests for the coverage not around the features you actually want to ensure and protect.)

As a plus, there is a third issue. This is a reasonable issue bc we already have a DefferableTestCase with yield which promises to make testing 'async' code better/easier/feasable. So the issue is:

c) Plain yield (aka yield None) has no meaningful behavior. (Currently, it directly translates to yield 10 [milliseconds], but 10ms has no meaningful semantics for a developer.) We should be able to tick the clock in a controlled way using yield.

For this PR:

Is there a minimal commit/changeset we can ship for all users which has a positive impact without breaking anything.
Is there a way to apply some of these changes a la carte? Does it make sense to turn these on/off for the scope of a class or a single test?
Can we still have a meaningful yield 'AWAIT_WORKER_THREAD'?
How can we deploy something like this so that travis can run these changes? E.g. can I teach travis to run with master and a next version of unittesting so we can create some confidence over time?

Thom1729 · 2018-12-03T20:13:36Z

As you say, it's tricky to try to solve all of those problems at once. So let's look at them one by one:

a) The tests are too slow. (E.g. the GitSavvy tests: 9 tests in 7 seconds doesn't scale, and you basically can't practice TDD in a meaningful way.)

Why are the GitSavvy tests slow? Would a less drastic solution make them faster? (For instance, using set_timeout in the runner instead of set_timeout_async, but not patching sublime to do that globally.)

b) The coverage is completely off. (The workaround would be to write tests for the coverage not around the features you actually want to ensure and protect.)

This is a general problem with coverage. It could be partly ameliorated by using set_timeout in the runner, but that doesn't help if the GitSavvy code uses set_timeout_async directly. The approach you propose would cause GitSavvy to use set_timeout instead.

However, you could achieve the same result by having an explicit switch inside GitSavvy that chooses which function to use. This would be much safer than patching the shared sublime object. There is a risk to having your code behave differently during testing, but that risk is no less if it behaves differently due to a patched sublime object.

c) Plain yield (aka yield None) has no meaningful behavior. (Currently, it directly translates to yield 10 [milliseconds], but 10ms has no meaningful semantics for a developer.) We should be able to tick the clock in a controlled way using yield.

This sounds less like a problem and more like a place that a feature could be added. I'm not sure that this is really related to the other questions.

I don't mean to be dismissive, but I don't think that we can "create some confidence over time" for monkeypatching the sublime object. That might be an absolute last-resort measure if it were impossible to implement core functionality without it. Here, though, it seems that there are workarounds.

kaste · 2018-12-03T21:45:05Z

What do you actually think 'patching sublime' does here? It is during a test. You also mock getters like sublime.active_window() or window.folders() or side-effects like view.add_regions in such a context.

The SublimeLinter tests are just as slow, and these are more-or-less narrow unit tests for the backend. They don't involve drawing at least.

How many programs do you have at hand that actually observe on what thread or process they run? If a function offloads some tasks A and B, the function usually falls off and terminates after that and actually doesn't care if A and B run now or later or at all. And you basically use that in unit tests all the time. The functional test delegates/'offloads' to tasks, but a different unittest can simply invoke such a task on the main thread. E.g. usually you can just call MyEventListener().on_selection_modified_async(view) to test it.

In a unit test, you usually forbid and catch (aka patch) set_timeout_(async) calls, maybe taking a note and asserting that two tasks have been enqueued. In a functional test, you run them in controlled fashion, for example you still don't run them on a different computer or 'in a week'.

Important note: the runner currently uses set_timeout.

Thom1729 · 2018-12-03T22:28:43Z

You also mock getters like sublime.active_window() or window.folders() or side-effects like view.add_regions in such a context.

Mocking is one thing; patching the real objects is something else.

The SublimeLinter tests are just as slow, and these are more-or-less narrow unit tests for the backend. They don't involve drawing at least.

I'm not familiar with the SublimeLinter tests. I can say that the 171 sublime_lib tests just ran in 14.237 seconds. This includes a lot of tests that open windows, write files, and perform other relatively slow operations.

Looking at the diff, I'm not sure that the performance gains you saw were actually due to patching sublime.set_timeout_async. It looks like you also changed the default timings in the runner. What happens if you use UnitTesting as-is, but explicitly specify these timings when you yield?

kaste · 2018-12-04T16:59:23Z

Yes, of course, the performance gains come from changing the timings in the runner. However, if you just lower these values you get sync problems and your tests get flaky. It is basically a feature that when you run your tests single threaded, you can actually push these values because you have deterministic timing. The tasks run ordered and in a controlled way.

Changing the unit tests in GitSavyy or SublimeLinter has only some limited effect. I used in the example above SublimeLinter's test_user_args.dsl.py https://github.com/SublimeLinter/SublimeLinter/blob/master/tests/test_user_args_dsl.py which never yields and actually is 'pure' in that it doesn't interact with the Sublime API at all. As I said above, I can bring down this test from 6s to 2s for 93 tests.

For comparison since you mentioned sublime_lib.

sublime_lib master
------------------
                                           EDIT*
current master   171 tests        35s        20s
this PR controlled_timing=True     9s         9s
this PR controlled_timing=False   55s        16s

Note: 5 failing tests regardless how I run these tests. 2 look like bugs, 3 of them are async issues. 

this PR controlled_timing=True  & fixed and optimized tests 
                                  7,5s

* EDIT A small change in the test suite can produce these values.

sublime_lib maybe has a handful yields so there aren't just like in SublimeLinter much options to make the unit tests itself faster. As it stands, the framework itself is slow.

Thom1729 · 2018-12-05T16:29:28Z

Right now, test_user_args_dsl.py runs in about 4.5s on this machine. I modified runner.py to remove the 10ms delay in _continue_testing and the tests took about 1s instead.

Running all of the SublimeLinter tests takes about 9s as-is. Removing the 10ms delay lowered the time to about 2.5s. (There were some failures, the same in either case.) The change also sped up the sublime_lib tests from 14s to 7s.

Based on these results, it seems like a very simple and unintrusive change will produce the performance gains you're looking for. I don't see why this change should break any tests unless those tests relied on race conditions.

Can you reproduce these results?

randy3k · 2018-12-05T18:17:51Z

To be honest, I didn't read the whole thread but I am excited to see contributions to UnitTesting.

@kaste feel free to merge any PRs to UnitTesting at your will. I don't have enough time to take care of all the projects.

kaste · 2018-12-05T22:24:35Z

@randy3k Thanks for the kind words, but you actually didn't gave me push rights and I don't want to push something without some consensus.

I already rolled back the more controversial changes. For now, I only want to make sure a user can make such drastic changes as patching sublime.

So the current state would be:

Reduce the wait times to make the tests as fast as possible
Ensure via test that a user can patch sublime.set_timeout_async to run its program single-threadeda dn with the promised properties: ordered tasks, hopping using plain yield
Add yield AWAIT_WORKER 💃 this can be used to await handling async Sublime events

@Thom1729 Optimizing the timing produces flaky and then failing tests in GitSavvy. Now, that's because GitSavvy's code is shitty, but most Sublime plugins are shitty. And these would be super annoying failures bc maybe a project gets a PR and suddenly some tests fail, and we can't even produce good fail messages for it. So the first draft of this PR was not failing for the usual test but of course as a drastic change would have been opt-in anyway.

Okay, since this is breaking, I thought I add another option fast_timings (Default: False 😢). We can in the README recommend its usage. And it is really recommended as I see it because it can probably make your code better. At least all failings in GitSavvy are actually bugs and non-defensive coding.

Thom1729 · 2018-12-06T01:06:47Z

What is the purpose of AWAIT_WORKER?

To clarify, do the GitSavvy tests fail if you defer with a timeout of zero, or only if you remove defer in those places? I didn't test removing those defer calls altogether; the speedups I cited came from changing a single 10 to a 0. (By observation, the speedup is significantly more than 10ms times the number of delays. I suppose that non-zero delays must hit a different code path with more overhead. I'll see if I can run some benchmarks this weekend.)

I think that we should start with the smallest possible change rather than trying to do it all in one PR. In this case, I think that the smallest possible change would be changing the 10ms delay to zero. We can make this an option if necessary. (If it causes tests to fail, that probably means that the behavior was not correct and that the tests should fail; however, this could also be interpreted as a breaking change that should wait for a major release.)

I'm not sure that changing the condition timeout is necessary, or changing it quite that much. This part, at least, is already user-customizable.

I can push changes when they're ready, if @randy3k approves in principle.

randy3k · 2018-12-06T02:23:28Z

@kaste you just need to join the SublimeText community.

@Thom1729 you have my approvals (don't have time to review everything). As a general guideline, a PR could be merged if someone reviews it (in addition to the original author)

kaste · 2018-12-06T22:26:47Z

@randy3k I didn't received any invitation to join that community. I only care about unittesting anyway, I think :-)

@Thom1729 The title of the PR is 'Deterministic timing', just changing numbers and make it faster is not my goal, it's a side-effect of it. I want some meaningful timing here. I initially wanted single-threaded execution of the unit tests, now this PR is about allowing single-threaded execution, and I have a test now in this PR that ensure some properties when a user opts-in to single threaded testing.

Changing the timing is a breaking change. It breaks more the faster you get. (Unless you constrain yourself to single-threadness. That's what I did but you're against it.) You start flaky, and then if you still lower the timing you get again consistent fail-behavior.

Why is that? You want async test execution. If you do that you get synchronization problems, and the system under test only gets eventually consistent. That's why we yield 500 for example. We just wait for the system to reach a stable, consistent state again.

Now in fact it seems unittesting on master yields way more time than a developer explicitly tells it to do. For example the SublimeLinter test never yields explicitly bc it only runs pure tests. Still 2/3 of the execution time is wasted. We can actually go from 6s to 2s execution time without breaking anything. 4s are implicit waiting time introduced by unittesting.

For GitSavvy, we don't have 'pure' tests but feature tests which actually do require waiting time, bc Sublime must draw a view or panel etc. We start here with 7s, and already know that we can do the same in 2.4s bc I proved that above. (But the 2.4s have constraints.) If the math is roughly correct unittesting master has an implicit yield 500 per test here. Can you just cut the waiting time in half by a random timing change? No, because you don't know how much headroom the test developer has calculated.

Is changing a timing from 20 to 10 [ms] the same as changing it from 10 to 0? Usually not, 0 should usually put the task as immediate in the queue. T.i. if you put it with 0, you actually add it to the end of the current turn, and the system doesn't go idle (or yields back to the scheduler) for a short period. (Which very likely means async events do not get handled.)

Generally, if you have less waiting time it is more likely that events or tasks waiting in the worker queue leak into the next test. yield AWAIT_WORKER is a utility to await exactly these tasks. (It does a cheap double-hop right now (main -> worker -> main [thread]) which I'm not happy about but otherwise the test would continue execution on the worker thread which is surprising and breaks coverage again.)

kaste added 3 commits November 29, 2018 14:06

Further reduce waiting time

95a993a

Use functools.partial for safety and performance

bf6640d

Fix lint errors

5613062

kaste added 2 commits December 5, 2018 17:04

Remove temporary controlled_timing setting

d80ba2d

Switch back to running on the main thread

3879512

kaste added 2 commits December 5, 2018 23:04

Add yield AWAIT_WORKER to await the worker thread

fa0df47

Fix and showcase single threaded testing

8019d74

kaste mentioned this pull request Dec 6, 2018

Better yield condition #147

Closed

kaste mentioned this pull request Dec 10, 2018

Next #150

Merged

randy3k closed this in #150 Dec 20, 2018

kaste deleted the deterministic-timing branch December 20, 2018 10:33

Deterministic timing for DeferrableTestCase #146

Deterministic timing for DeferrableTestCase #146

Uh oh!

Conversation

kaste commented Nov 28, 2018

Uh oh!

randy3k commented Nov 28, 2018

Uh oh!

randy3k commented Nov 28, 2018

Uh oh!

kaste commented Nov 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kaste commented Nov 29, 2018

Uh oh!

randy3k commented Nov 30, 2018

Uh oh!

Thom1729 commented Nov 30, 2018

Uh oh!

Thom1729 commented Dec 2, 2018

Uh oh!

kaste commented Dec 2, 2018

Uh oh!

Thom1729 commented Dec 2, 2018

Uh oh!

kaste commented Dec 3, 2018

Uh oh!

Thom1729 commented Dec 3, 2018

Uh oh!

kaste commented Dec 3, 2018

Uh oh!

Thom1729 commented Dec 3, 2018

Uh oh!

kaste commented Dec 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Thom1729 commented Dec 5, 2018

Uh oh!

randy3k commented Dec 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kaste commented Dec 5, 2018

Uh oh!

Thom1729 commented Dec 6, 2018

Uh oh!

randy3k commented Dec 6, 2018

Uh oh!

kaste commented Dec 6, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kaste commented Nov 29, 2018 •

edited

Loading

kaste commented Dec 4, 2018 •

edited

Loading

randy3k commented Dec 5, 2018 •

edited

Loading