Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nuitka binaries slow performance with Python3 in-place string add #150

Closed
geozeke opened this issue Sep 24, 2018 · 21 comments
Closed

nuitka binaries slow performance with Python3 in-place string add #150

geozeke opened this issue Sep 24, 2018 · 21 comments
Assignees
Labels
Milestone

Comments

@geozeke
Copy link

geozeke commented Sep 24, 2018

Nuitka version:

0.6.0rc4
Python: 3.6.6 (default, Sep 12 2018, 18:26:19) 
Executable: /usr/bin/python3
OS: Linux
Arch: x86_64

Installed via: apt

Greetings,

Really love this project. Thanks for all your hard work on it. My nuitka binaries generally perform really well, even for complex projects. The one exception is when I implement code that generates a large volume of random numbers. The code below is much slower in a nuitka build when compared to running it in the python3 interpreter:

chars = []
for i in range(256):
   chars += [chr(i)]
wrapper = ""
for i in range(10000000):
   wrapper += random.choice(chars)

@kayhayen
Copy link
Member

You are right, check this: http://speedcenter.nuitka.net/constructs/construct-inplaceoperationstringadd.html

For Python2 is is barely faster, for Python3 actually slower. Since this is mostly copying memory around, there is not much gain to be had except for faster loop iteration maybe. I will check this how, in-place unicode operations (Python3) must have at least one trick I am not aware of yet.

Yours,
Kay

@geozeke
Copy link
Author

geozeke commented Sep 24, 2018

Ah! Thank you. So perhaps it has nothing to do with the random library, but rather the string concatenation. Very interesting. I'll try a different approach to see if it can be made faster.

@kayhayen kayhayen changed the title nuitka binaries slow performance with python's random library nuitka binaries slow performance with in-place string add Sep 25, 2018
@kayhayen
Copy link
Member

Going the string join would be faster for your code either way. You will take references of them in a list of generator, and then only assemble one final string, where I think the string join computes the length ahead of time.

In this in-place scenario, the worst thing is that a lot of copying the already large string to a new place is happening, when the over-allocation is exceeded. Nuitka used to be incredibly slower with this kind of thing, doing that for every in-place operation instead of only every once in a while, when the re-alloc fails.

But right now I couldn't find where that re-alloc happens in CPython, so I can't say how often it is really, probably involves luck too.

My finding is that this code here:

    /* copy 'w' into the newly allocated area of 'v' */
    memcpy(PyUnicode_AS_UNICODE(*operand1) + operand1_size, PyUnicode_AS_UNICODE(operand2),
           operand2_size * sizeof(Py_UNICODE));

Is much more sophisticated in at least Python 3.5 now. Depending on the unicode sub-type, PyUnicode_AS_UNICODE will make a call to a function, whereas CPython has a funky function "copy_characters" to achieve that, which probably doesn't involve conversions, and will handle string of different kinds way better.

So what is probably happening, is that by calling that function, Nuitka may be forcing "ASCII" strings that you have there to go WCHAR, and then use more memory, and more time to make the copying as a result. Actually I think Nuitka may be understating the performance loss in the benchmark, because it only does it once per loop iteration, and does not re-use the string, where more copying to do will give a larger penalty at re-alloc time.

I need to mirror the approach that CPython has to catch up in this test case definitely. Just need to check which version introduced those sub types of Unicode again. Very probably it once was good, but CPython optimized behind my back.

Thanks for making me aware, this is probably critical performance stuff, that had escaped my attention so far.

Should also critically examine other uses of PyUnicode_AS_UNICODE. There seems to be one in print statement of Python2 (probably not an issue, there were no unicode subtypes there yet, probably not converting there). And one for ord built-in of Python3, very probably also forcing conversions.

I will be aiming to make this part of 0.6.1 or even a hotfix to 0.6.0 which I mean to release today.

Yours,
Kay

@kayhayen kayhayen changed the title nuitka binaries slow performance with in-place string add nuitka binaries slow performance with Python3 in-place string add Sep 25, 2018
@kayhayen kayhayen self-assigned this Sep 25, 2018
@kayhayen kayhayen added the bug label Sep 25, 2018
@geozeke
Copy link
Author

geozeke commented Sep 25, 2018

Thank you again! I can see how creating strings (which are immutable in python) would be a waste. I re-factored the code to use lists rather than strings, with a final .join() at the end, and it is indeed much faster in nuitka. It now looks like this:

n = 10000000

chars = []
for i in range(256):
   chars += [chr(i)]

wrapper = ['*'] * n
for i in range(n):
   wrapper[i] = random.choice(chars)

s = "".join(wrapper)

@kayhayen
Copy link
Member

Look at it, I noticed that the code I was assuming guilty was in fact also Python2 only code that effectively remained unused so far, and has nothing to do with Python3 here. I need to find another explanation for what is going on.

@kayhayen
Copy link
Member

As for your example, creating a large list ahead of time will be faster indeed, but use a lot of memory. I would have done something like this:

s = "".join(random.choice(chars) for i in range(n))

But I am not so sure if not that join operation will build a temporary list and increase its size over and over, giving you back the original problem. Maybe use a bytearray which is mutable, and convert to a string only at the very end? That should save a lot of unicode overhead.

@geozeke
Copy link
Author

geozeke commented Sep 26, 2018

Hummm.....I'll run some performance comparisons to see if there's a memory -vs- speed "sweet spot"

@kayhayen
Copy link
Member

@geozeke Well, bytearray will be compact like the unicode, but mutable in-place, so you can create it ahead of time too, then modify it, and no trade-off needed.

@geozeke
Copy link
Author

geozeke commented Sep 26, 2018

Great point!

@kayhayen
Copy link
Member

What I found out so far, is that Nuitka apparently uses PyUnicode_Append() but somehow the input values are leading to different variants of inplace extension, where CPython run time has a cheaper one. In fact, using a function on it, that forces internal changes of the string on the right hand side, makes things go faster, and use the other function.

So this basically now comes down to the question of why Nuitka ends up with different kinds of unicode strings. It could well mean that the creation of them, or certain operations like multiply, come from or end up with different unicode types, and not the best ones.

If I e.g. change the right hand side of the in-place operation, i.e. the value added, each time, then Nuitka is faster. This also lends credence to that CPython has some advantage when adding the same string over and over, because it probably matches internally better.

But when I tried to look, I did so far not find the actual difference that makes code paths diverge at run time. This is very annoying stuff. I will fight on however. :)

@geozeke
Copy link
Author

geozeke commented Sep 29, 2018

Thanks the for hard work! Nuitka is turning out to be a really valuable tool :-)

@kayhayen
Copy link
Member

kayhayen commented Oct 1, 2018

@geozeke so following up on my idea of unicode strings in Nuitka somehow being different from those in CPython, I noticed that its marshal module, which is used for loading constants, handles "ascii" strings different. So I though, hell yeah and changed Nuitka to do that too. But it didn't make a difference.

It will be slightly faster at startup, because it saves the extra step of checking if a string is ascii or not for almost every constant unicode string in Nuitka. But that is about it.

Not sure now if I am not looking at a ghost. It seems memory usage pattern of Nuitka may just somehow lend itself less to off beat resizing existing strings, and that I should re-formulate the test to make more in-place additions in a loop instead of over function calls which maybe spoil memory usage. But on the other hand, why would it, might be worth finding where that happens.

@geozeke
Copy link
Author

geozeke commented Oct 1, 2018

Wow! This is turning into quite a project. I appreciate you chasing this down. It is curious why in-place string operations are so quick in the Python interpreter. It must be doing something under the hood to optimize / cache things to make it faster.

@kayhayen
Copy link
Member

kayhayen commented Oct 1, 2018

I think I am down to very similar code flow. However, when a "realloc" is happening, it can just use free space after the string, and mark it as used (very fast) or it might have to copy the whole string into a larger buffer, because there is no free space after it.

And what I seem to observe is that running the code compiled, with a compiled loop iteration, compiled function call, previous compiled object uses, it tends to not find free space after the unicode string, which may have to do with other objects allocated in the mean time.

So this is rather esoteric. I will try and prove that in a tight loop there is no difference. And the larger scale loop, not sure if it's a valid test case anyway, due to memory manager fragmentation happening differently. But if Nuitka were allocating objects and not releasing them more often, that would be bad, which is what I meant.

This is also validated by adding more in-place operations immediately. I went to 100% speed once by carefully adjusting the in-place operations and operands. I feel however, this may well indicate a larger performance problem if Nuitka fragments more.

@geozeke
Copy link
Author

geozeke commented Oct 2, 2018

That's puzzling. It seems like realloc() should be happening, as long as the memory for the string isn't "freed", recreated, and copied with each each iteration. I'm pretty sure realloc() only works if the original pointer hasn't been freed. Could be some heap thrashing going on.

@Kobzol
Copy link

Kobzol commented Oct 15, 2018

String concatenation is actually quite fast in Python 3 because it has a heuristical optimization that changes the string in-place if there are no other references to it (http://blog.mclemon.io/python-efficient-string-concatenation-in-python-2016-edition).

You can test this by introducing an artificial reference to the appended string:

chars = []
for i in range(256):
   chars += [chr(i)]
wrapper = ""
for i in range(10000000):
   ref = wrapper
   wrapper += random.choice(chars)

Suddenly it will become much slower in Python 3.

@kayhayen
Copy link
Member

Nuitka does those for a long time already. It also this to module variables in the develop version, and once I get around to it, also for closure variables.

In this issue I am more confused how differences come about, as to how the in-place realloc works or not. My guesstimate being that Nuitka when it comes back for another function call from an outside loop, has put something on the heap that blocks the re-allocation and that would be nice to find and/or avoid.

In-place in principle appears as fast as with CPython normally. But obviously if the hit rate is lower, then performance can still break down in comparison.

Also not entirely sure I am not hunting a measurement problem.

@kayhayen
Copy link
Member

As discovered in issue #177 this has everything to do with CPython and using dynamic linking there which makes it slow when embedding in a binary.

There is nothing we can do really, except to inline PyUnicode_Append() which will make it go away for compiled code, but doing that ought to solve the issue even for CPython.

@kayhayen
Copy link
Member

So I am pushing to factory code for 3.5 and higher that does the unicode differently, with a specialized version of PyUnicode_Add which requires a fair bit of code duplication with CPython, but apparently that is required to get this under control.

Let's see how performance graphs react to this. I was seeing a gain of 110% instead of a 75%, so Nuitka should now be consistently faster. Of course, the memcpy will normally dominate this benchmark, and it should be closer to 100% the more this happens. However, the specialized version does away with a fair bit of overhead necessary for PyUnicode_Add and that should give slight improvements still.

@geozeke
Copy link
Author

geozeke commented Oct 17, 2018

Thanks for the hard work!

@kayhayen
Copy link
Member

This has long been resolved by recent changes that avoid the CPython API for in-place operations on these types. In Nuitka 0.6.13 this should no longer happen.

@Nuitka Nuitka locked and limited conversation to collaborators Oct 25, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants