Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cythonize some hotspots #55

Merged
merged 21 commits into from Jul 3, 2018
Merged

Cythonize some hotspots #55

merged 21 commits into from Jul 3, 2018

Conversation

jamadden
Copy link
Member

Fixes #52

Some of the changes also make pure python mode faster as well.

I tried to stay compatible, and did a number of searches of our github repos to see if I was breaking anything. See CHANGES.rst for notes on the few things I know I broke (but that didn't appear to be used). I'm encouraged by the fact that there are no substantial test changes and we still have full coverage.

I need to do some refactoring to bring module sizes under control, but this PR is big enough as it is. Refactoring of these modules is next. Then I'll look at other modules to see if they might need some cython love (suggestions for known hotspots welcome).

Now the numbers. These are simple benchmarks, but for more complex types, I expect the speedups to scale well (types with lots of Singleton decorators could see a nice benefit). Comparing this to the pure-python version on CPython 2.7:

Benchmark 27_plain_ds_ext_checkpoint19 27_cython_ds_ext_checkpoint19
Construct Singleton 298 ns 138 ns: 2.17x faster (-54%)
Construct Singleton with args 328 ns 159 ns: 2.06x faster (-51%)
.bm_simple_iface: toExternalObject 57.5 us 41.9 us: 1.37x faster (-27%)
.bm_simple_iface: find factory 4.10 us 4.00 us: 1.02x faster (-2%)
.bm_simple_iface: fromExternalObject 93.8 us 75.0 us: 1.25x faster (-20%)
.bm_simple_registered_class: toExternalObject 49.8 us 34.0 us: 1.46x faster (-32%)
.bm_simple_registered_class: fromExternalObject 22.3 us 13.5 us: 1.66x faster (-40%)

…on and singleton. No optimizations yet. All tests pass.
…naive version so far. Hopefully we can get better still.
This was a pretty big win, getting us 9-13% faster on the
externalization benchmarks compared with the last commit.
…place) so it can be a tuple. This is again slightly faster.
…eading underscores, rename our private module back to private again.
This nets us a 4% improvement or so in externalization, maybe a tiny
slowdown in internalization (but I expect that to be fixed when I type
that module).

We could do better, but the class attribute _ext_excluded_out_ and the
like are slowing us down, and we have some code that doesn't keep
their usual types (mostly in housing). A metaclass may be able to fix
that, because they all seem to be defined at the class level.
Define them as an instance, make an accessor function, type the result
of the accessor in the pxd. This avoids needing to get the type in the
python file, which is what causes the leading underscore problems.

This avoids the need for re-enumerating all the constants while still
being direct access (yay!). Speeds things up by another percent or two
in datastructures.py.
…ning bottlenecks

Two changes of note:
- notify_modified alias was removed. I couldn't find any uses of it on github
- eventFactory was removed from notifyModified. I couldn't find any
  uses of it on github.

These are both easy to add back if needed.
@cutz
Copy link

cutz commented Jun 28, 2018

Looking specifically at CHANGES:

I do see some usage of to_standard_external_dictionary that have **kwargs passed to it.

nti.dataserver
nti.conenttypes.completion
nti.contentlibrary
nti.app.products.courseware_ims

I haven't traced back far enough to tell if those contain args that would be unexpected but it's something to look in to.

@jamadden
Copy link
Member Author

jamadden commented Jun 28, 2018

I do see some usage of to_standard_external_dictionary that have **kwargs passed to it.

I later developed a pattern to deal with that, so it's easy enough to restore **kwargs.

(I came across that one very early and intended to go back and deal with it, search the repos, but I didn't get there. Thanks.)

@cutz cutz requested review from papachoco and jzuech3 June 28, 2018 18:49
Copy link

@jzuech3 jzuech3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jamadden jamadden merged commit f769190 into master Jul 3, 2018
@jamadden jamadden deleted the issue52 branch July 3, 2018 14:47
@jamadden
Copy link
Member Author

jamadden commented Jul 5, 2018

I've pushed this to PyPI as 1.0a2 so we can start finding out what I unintentionally broke 😄

@cutz
Copy link

cutz commented Jul 5, 2018

I know there were some other associated PRs that went along with this. Are the benchmarks in the original post still accurate?

@jamadden
Copy link
Member Author

jamadden commented Jul 5, 2018

Running the current benchmarks now and comparing against 1.0a1, the results are either no different than originally posted, or notably faster (in the last case you need to look at absolute time; for whatever reason the pure-python version was suddenly faster):

Benchmark 27_pure_10a1 27_cython_10a2-2
Construct Singleton 281 ns 136 ns: 2.07x faster (-52%)
Construct Singleton with args 295 ns 157 ns: 1.88x faster (-47%)
.bm_simple_iface: toExternalObject 59.7 us 37.5 us: 1.59x faster (-37%)
.bm_simple_iface: fromExternalObject 96.9 us 70.3 us: 1.38x faster (-27%)
.bm_simple_registered_class: toExternalObject 48.9 us 30.5 us: 1.60x faster (-38%)
.bm_simple_registered_class: fromExternalObject 19.8 us 13.8 us: 1.43x faster (-30%)

Compared to the benchmarks posted originally here, quite a bit of progress was made (again, that last one is within margin of error; I need more substantial benchmarks of a larger object tree):

Benchmark 27_cython_ds_ext_checkpoint19 27_cython_10a2-2
.bm_simple_iface: toExternalObject 41.9 us 37.5 us: 1.12x faster (-11%)
.bm_simple_iface: fromExternalObject 75.0 us 70.3 us: 1.07x faster (-6%)
.bm_simple_registered_class: toExternalObject 34.0 us 30.5 us: 1.11x faster (-10%)
.bm_simple_registered_class: fromExternalObject 13.5 us 13.8 us: 1.03x slower (+3%)

@jamadden
Copy link
Member Author

jamadden commented Jul 5, 2018

Of course, PyPy is still dramatically faster. We're talking order of magnitude or more:

Benchmark 27_cython_10a2-2 pypy27_cython_10a2-2
Construct Singleton 136 ns 1.72 ns: 78.93x faster (-99%)
Construct non-Singleton 188 ns 1.75 ns: 107.66x faster (-99%)
Construct Singleton with args 157 ns 1.89 ns: 82.95x faster (-99%)
Construct non-Singleton with args 200 ns 1.73 ns: 115.78x faster (-99%)
.bm_simple_iface: toExternalObject 37.5 us 7.79 us: 4.82x faster (-79%)
.bm_simple_iface: find factory 3.95 us 150 ns: 26.33x faster (-96%)
.bm_simple_iface: fromExternalObject 70.3 us 18.2 us: 3.87x faster (-74%)
.bm_simple_registered_class: toExternalObject 30.5 us 4.99 us: 6.11x faster (-84%)
.bm_simple_registered_class: fromExternalObject 13.8 us 1.11 us: 12.49x faster (-92%)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants