Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast xoring in python #686

Closed
socketpair opened this issue Dec 17, 2015 · 33 comments
Closed

Fast xoring in python #686

socketpair opened this issue Dec 17, 2015 · 33 comments
Labels

Comments

@socketpair
Copy link
Contributor

instead of

def _websocket_mask_python(mask, data):
    return bytes(b ^ mask[i % 4] for i, b in enumerate(data))

please use this:

native_byteorder = sys.byteorder

def _websocket_mask_python(mask, data):
    assert len(mask) == 4
    datalen = len(data)
    if datalen == 0:
        return b''  #  everything work without this, but may be changed later in python.
    data = int.from_bytes(data, native_byteorder)
    mask = int.from_bytes(mask * (datalen // 4) + mask[: datalen % 4], native_byteorder)
    return (data ^ mask).to_bytes(datalen, native_byteorder)
@kxepal
Copy link
Member

kxepal commented Dec 17, 2015

Why?

@socketpair
Copy link
Contributor Author

this is MUCH faster

@kxepal
Copy link
Member

kxepal commented Dec 17, 2015

How much?

@jettify
Copy link
Member

jettify commented Dec 17, 2015

Could you elaborate on this? Could you post benchmark results or insights why this works faster?

@socketpair
Copy link
Contributor Author

depending on datasize, on my computer, from 1.1 to 31 times. The most "optimal" size for benchmarking is data size of 3 MB

@socketpair
Copy link
Contributor Author

#!/usr/bin/python3.5
import sys
import itertools
from time import monotonic

native_byteorder = sys.byteorder

def _websocket_mask_python(mask, data):
    assert len(mask) == 4
    datalen = len(data)
    if datalen == 0:
        return b''  #  everything work without this, but may be changed later in python.
    data = int.from_bytes(data, native_byteorder)
    mask = int.from_bytes(mask * (datalen // 4) + mask[: datalen % 4], native_byteorder)
    #mask = int.from_bytes(itertools.islice(itertools.cycle(mask), 0, datalen), native_byteorder)
    return (data ^ mask).to_bytes(datalen, native_byteorder)

def _websocket_mask_python1(mask, data):
    return bytes(b ^ mask[i % 4] for i, b in enumerate(data))


data = b'f' * (1024*1024*3)

a = monotonic()
res = _websocket_mask_python(b'1234', data)
b = monotonic()

a1 = monotonic()
res1 = _websocket_mask_python1(b'1234', data)
b1 = monotonic()

print (res==res1, (b1-a1)/(b-a))

@socketpair
Copy link
Contributor Author

it is faster even on one-byte message :)

@kxepal
Copy link
Member

kxepal commented Dec 17, 2015

Looks not oblivious why more operations(and code) is faster than shorter version:

Python 3.4.3 (default, Nov 12 2015, 20:43:56)
...
In [14]: %timeit websocket_mask(b'1234', data)
1 loops, best of 3: 967 ms per loop

In [15]: %timeit websocket_mask2(b'1234', data)
10 loops, best of 3: 22.6 ms per loop

Do you have an explanation for this?

@socketpair
Copy link
Contributor Author

sure, original implementation is executed in python, while my implementation is executed in C.

@asvetlov
Copy link
Member

I don't care too much because cythonized version 7 times faster than your optimized python one.

You may make a pull request though.

@kxepal
Copy link
Member

kxepal commented Dec 17, 2015

@socketpair But enumeration and bytes operations are not made in Python, right? Math is the same. What I see in yours is more function/method calls which means more work to do. Still curious.

@socketpair
Copy link
Contributor Author

@asvetlov about cython, what you compare with what ?

@jashandeep-sohi
Copy link
Contributor

@kxepal b ^ mask[i % 4] for i, b in enumerate(data) creates a new int for every byte.
@socketpair's version just creates one long int from the bytes and xors them.

@socketpair
Copy link
Contributor Author

This is a rule: block operations are always faster than byte-by-byte.

@kxepal
Copy link
Member

kxepal commented Dec 17, 2015

True, thanks!

@socketpair
Copy link
Contributor Author

would this be merged so ?

@socketpair
Copy link
Contributor Author

also, constructing big integer is cheap operation. xoring two bigintegers with same size is cheap too.

The most time (as I think) this function spent in constructing mask...I tried 4 approaches, maybe you know faster one ?

@asvetlov
Copy link
Member

@socketpair I've compared your _websocket_mask_python with aiohttp.websocket._websocket_mask_python and aiohttp.websocket._websocket_mask_cython.
The last is 7 faster than your code.

Sorry, I cannot merge still nonexistent pull request :)

@socketpair
Copy link
Contributor Author

will make pull request. Also I have improved performance of cython-based implementation :)

@socketpair
Copy link
Contributor Author

I have installed pytest from pip...

AttributeError: module 'pytest' has no attribute 'raises_regexp' :(

@asvetlov
Copy link
Member

Please install pytest-raisesregexp manually.
I don't want adding it to requirements.txt because I have a wish to drop raises_regexp usage at all -- the librry is not so useful as I expected.

@asvetlov
Copy link
Member

Fixed by 66c4234

asvetlov added a commit that referenced this issue Dec 18, 2015
Weboscket XOR performance improved. Fixes #686
@socketpair
Copy link
Contributor Author

oops :) race-condition :) :) I have some small improvements on that, but applied 15 second later then you merge PR

@socketpair
Copy link
Contributor Author

Also, you have replaced size_t with uintmax_t it is not right.

uintmax_t is the largest unsigned type provided by the implementation.
size_t is the type of the result of the sizeof operator, big enough to hold the size of any object

And also, you did not import size_t :(. Please ask me to change something, instead of silently fixing in that in own commits.

@socketpair
Copy link
Contributor Author

Rebased and updated branch

@asvetlov
Copy link
Member

size_t == 8 for 64-bit systems.
uintmax_t is an alias for unsigned long int on 64-bit OS.
For 32-bit OS it's still defined and is an alias for unsigned long long int.
In other words it's always uint64 on both 32bit and 64bit systems.

@asvetlov
Copy link
Member

If your latest update are missed in master branch -- please create new PR.
But I don't buy your latest changes.

I wrote about uintmax_t above.
Regarding forcing little endianess: I believe keeping native byteorder makes sense. Keep in mind hypothetical machine with big endian order. Forcing to reconvert all orders looks like performance hurt.

@socketpair
Copy link
Contributor Author

I have check Python sources. endiannes in that function just change order in which bytes are read from bytearray, and nothing more. Since little-endiannes mean reading from begin to end, this is more optimal.

Since no interpretation to actual number values is done (i.e. no reconversion), there will be no performance hurt, but instead, since memory read left to right, performance improvements will be acieved :)

@asvetlov
Copy link
Member

Memory is read by cache lines, not byte-by-byte.
I don't think it worth committing anyway.

@socketpair
Copy link
Contributor Author

OK, agree, changes are not so significant

@socketpair
Copy link
Contributor Author

Same issue: http://bugs.python.org/issue19251

@cowlicks
Copy link

FYI I wrote a patch for that python issue @socketpair linked to, to allow things like b'abc' ^ b'xyz'. But it needs more feedback if it is to be included.

@lock
Copy link

lock bot commented Oct 29, 2019

This thread has been automatically locked since there has not been
any recent activity after it was closed. Please open a new issue for
related bugs.

If you feel like there's important points made in this discussion,
please include those exceprts into that new issue.

@lock lock bot added the outdated label Oct 29, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Oct 29, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

6 participants