Fix thread safety in http pool manager #102

wabscale · 2023-01-25T18:20:57Z

Problem

There is a thread safety issue in the way the HttpClient initializes. Here is a snippet to trigger it:

import multiprocessing

import clickhouse_connect


def get_client() -> clickhouse_connect.driver.client.Client:
    return clickhouse_connect.get_client(host='localhost')


def test_client(*_):
    client = get_client()
    client.command('SELECT version();')
    client.close()


def test_concurrency():
    # Call test client in original thread, before creating new
    # process. This ensures that any globals that may be initialized
    # in the library are initialized before forking next
    test_client()

    # Fork 8 processes, each initializing a fresh client. Any globals
    # from this process that end up being used in child processes may
    # very well become broken.
    with multiprocessing.Pool(8) as pool:
        pool.map(test_client, range(100))


if __name__ == '__main__':
    test_concurrency()

We initialize a client in the parent process, make a request for the version and close the client
We fork 8 times, and run the same version request all with fresh clients that are closed at the end

This gives us this lovely error:

Traceback (most recent call last):
  File "/home/jc/vola/git/experimental/data1backup/tests/clickhouse-connect/clickhouse_connect/driver/httpclient.py", line 322, in _raw_request
    response: HTTPResponse = self.http.request(method, url,
  File "/opt/conda/envs/py310_64/lib/python3.10/site-packages/urllib3/request.py", line 78, in request
    return self.request_encode_body(
  File "/opt/conda/envs/py310_64/lib/python3.10/site-packages/urllib3/request.py", line 170, in request_encode_body
    return self.urlopen(method, url, **extra_kw)
  File "/opt/conda/envs/py310_64/lib/python3.10/site-packages/urllib3/poolmanager.py", line 376, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "/opt/conda/envs/py310_64/lib/python3.10/site-packages/urllib3/connectionpool.py", line 785, in urlopen
    retries = retries.increment(
  File "/opt/conda/envs/py310_64/lib/python3.10/site-packages/urllib3/util/retry.py", line 550, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/opt/conda/envs/py310_64/lib/python3.10/site-packages/urllib3/packages/six.py", line 769, in reraise
    raise value.with_traceback(tb)
  File "/opt/conda/envs/py310_64/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/opt/conda/envs/py310_64/lib/python3.10/site-packages/urllib3/connectionpool.py", line 449, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/opt/conda/envs/py310_64/lib/python3.10/site-packages/urllib3/connectionpool.py", line 444, in _make_request
    httplib_response = conn.getresponse()
  File "/opt/conda/envs/py310_64/lib/python3.10/http/client.py", line 1374, in getresponse
    response.begin()
  File "/opt/conda/envs/py310_64/lib/python3.10/http/client.py", line 318, in begin
    version, status, reason = self._read_status()
  File "/opt/conda/envs/py310_64/lib/python3.10/http/client.py", line 300, in _read_status
    raise BadStatusLine(line)
urllib3.exceptions.ProtocolError: ('Connection aborted.', BadStatusLine('\r\n'))

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/envs/py310_64/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/opt/conda/envs/py310_64/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/home/jc/vola/git/experimental/data1backup/tests/test-concurrency.py", line 11, in test_client
    client = get_client()
  File "/home/jc/vola/git/experimental/data1backup/tests/test-concurrency.py", line 7, in get_client
    return clickhouse_connect.get_client(host='localhost')
  File "/home/jc/vola/git/experimental/data1backup/tests/clickhouse-connect/clickhouse_connect/__init__.py", line 8, in get_client
    return create_client(**kwargs)
  File "/home/jc/vola/git/experimental/data1backup/tests/clickhouse-connect/clickhouse_connect/driver/__init__.py", line 69, in create_client
    return HttpClient(interface, host, port, username, password, database, settings=settings, **kwargs)
  File "/home/jc/vola/git/experimental/data1backup/tests/clickhouse-connect/clickhouse_connect/driver/httpclient.py", line 150, in __init__
    super().__init__(database=database, query_limit=coerce_int(query_limit), uri=self.url)
  File "/home/jc/vola/git/experimental/data1backup/tests/clickhouse-connect/clickhouse_connect/driver/client.py", line 45, in __init__
    tuple(self.command('SELECT version(), timezone(), database()', use_database=False))
  File "/home/jc/vola/git/experimental/data1backup/tests/clickhouse-connect/clickhouse_connect/driver/httpclient.py", line 288, in command
    response = self._raw_request(payload, params, headers, method)
  File "/home/jc/vola/git/experimental/data1backup/tests/clickhouse-connect/clickhouse_connect/driver/httpclient.py", line 341, in _raw_request
    raise OperationalError(f'Error executing HTTP request {self.url}') from ex
clickhouse_connect.driver.exceptions.OperationalError: Error executing HTTP request http://localhost:8123

The problem is the global clickhouse_connect.driver.httputil.default_pool_manager, which is used only as the default http pool in clickhouse_connect.driver.httpclient.HttpClient.init

When this global is initialized in one process, then referenced in another it completely breaks any attempt at concurrency.

Solution

Remove the clickhouse_connect.driver.httputil.default_pool_manager global, and initialize a new pool manager on each new HttpClient.

Unless there is some huge perfomance cost to initializing new pool managers that I am unaware of, this should be the way it is done.

wabscale · 2023-01-25T19:40:04Z

I have identified another issue with thread safety. The default session_id is generated from uuid.uuid1 which is not generated in a multiprcessing safe way. This means that if two threads call the function at very near times, you can get the same uuid. This breaks the session locking stuff server side. Due to the nature of this, it can be a bit annoying to trigger.

I have switched the uuid.uuid1 to uuid.uuid4, which has significantly higher entropy.

genzgd · 2023-01-25T20:20:03Z

Thanks for your investigations into multiprocessing! It's an area that we simply haven't had the time to test the limits.

I'm very curious about the error you see when using the default_pool_manager. It's my understanding the urllib3 pool manager is thread safe and even before that PR the window where it was not thread safe was tiny (and would raise a different exception). In particular I'm fairly sure the actual connections are never shared between threads, although it's possible that the actual Python http_client somehow gets shared. Is it possible that the "bad status" line is just the result of some other ClickHouse server or loopback issue, possibly related to the load of 100 clients hitting the same ClickHouse server at once?

In any case the choice to use a single PoolManager by default was intended for the simple and I assume most common use case of having a few clients always hitting the same host. Reusing connections in that case is desirable, especially if they can reuse "keep alive" HTTP connections. (However, looking through the code I did notice that the original maxsize of the pool was lost when removing the requests library - it was 8, unfortunately now I think it's defaulting to 1 - so that might be worth fixing.)

In the case of a large number of clients/threads my intended solution was to construct those clients with their own PoolManager, preferable obtained by the get_pool_manager method. I consider that an advanced use case but it should fix your particular issue without a code change. Just change this line:

client = get_client()

to

client = get_client(pool_mgr=get_pool_manager())

You could also use that mechanism to share a few different pool managers among multiple hosts, instead of having a 1 to 1 assignment.

As for uuid1 vs uuid4, there's still 14 bits of random entropy on a single host in the same timestamp, vs 64 bits of entropy across all possible clients hitting the same ClickHouse server. I think there's no practical difference, although I can see uuid4 as being marginally "safer". But this is also something you can solve without a code change by settings the session_id parameter in the get_client factory method call.

genzgd · 2023-01-25T21:55:30Z

btw the way your concurrency code snippit does not trigger any exceptions on my M1 Mac -- Python 3.9.15, urllib 1.26.13

wabscale · 2023-01-25T22:27:50Z

It is occuring every single run on arch linux with my 3950x (16 cores 32 threads). I also confirmed this issue on my XPS 9710 running Manjaro linux.

If you are running your python through the hacky emulation layer thing they made for M1s, I suspect that python multiprocessing in general may be handled very differently.

genzgd · 2023-01-25T22:30:24Z

It's native ARM64, not Rosetta, but again that doesn't actually look like a concurrency error.

wabscale · 2023-01-26T01:05:13Z

As it turns out I was sort of right in my assumption about macOS doing multiprocessing differently:

https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods

Changed in version 3.8: On macOS, the spawn start method is now the default. The fork start method should be considered unsafe as it can lead to crashes of the subprocess. See bpo-33725.

Updating the my example:

import multiprocessing

import clickhouse_connect


def get_client() -> clickhouse_connect.driver.client.Client:
    return clickhouse_connect.get_client(host='localhost')


def test_client(*_):
    client = get_client()
    client.command('SELECT version();')
    client.close()


def test_concurrency():
    # Force non-default spwan start method 
    multiprocessing.set_start_method('spawn') 

    # Call test client in original thread, before creating new
    # process. This ensures that any globals that may be initialized
    # in the library are initialized before forking next
    test_client()

    # Fork 8 processes, each initializing a fresh client. Any globals
    # from this process that end up being used in child processes may
    # very well become broken.
    with multiprocessing.Pool(8) as pool:
        pool.map(test_client, range(100))


if __name__ == '__main__':
    test_concurrency()

I would not consider doing this multiprocessing.set_start_method as an acceptable solution.

If this library is being developed only for macOS, then you need to disclose this up front and in the documentation. My org is simply not going to use clickhouse if this is the case.

genzgd · 2023-01-26T01:38:59Z

We build and test on linux, but I happen to develop on a Mac. And as I alluded to above, we don't currently have automated tests for large parallel processing.

ClickHouse Connect is still very much beta and open source, so you know everything there is to know about tests and builds. Again, the urllib3 PoolManager is considered thread safe, but it appears you have found an exception. If you can reproduce the problem independent of ClickHouse Connect (which shouldn't be difficult, it really amounts to a bunch of parallel HTTP calls) I would suggest you open an issue in that project.

It also seems like if you're going to use multiprocessing on Linux/Unix, you should not use a start method that is "problematic" according to the documentation and then blame the library.

In any case, there's a simple workaround for the issue as I described above for your use case that doesn't require a code change and doesn't disable the current default connection sharing, and you have identified yet another workaround. As we have time and resources to update the documentation and move toward production status, those things will be improved. In the meantime, you are of course free to fork this project, use another Python client (clickhouse-driver is well established) or use other database software entirely.

wabscale · 2023-01-26T14:20:42Z

For the record, the python documentation is clear in saying that the "problematic" fork start method is only problematic on macOS. It is still the standard on linux because linux can handle posix forking.

The issue they reference is clear it is only a macOS problem:

Python crashes on macOS after fork with no exec python/cpython#77906

genzgd · 2023-01-26T15:16:20Z

https://bugs.python.org/issue40379 -- not just MacOS

wabscale added 2 commits January 25, 2023 13:05

Fix thread safety in http pool manager

b29d03e

Fix use uuid.uuid4 for session_id for greater entropy

0515c51

genzgd closed this Jan 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix thread safety in http pool manager #102

Fix thread safety in http pool manager #102

wabscale commented Jan 25, 2023

wabscale commented Jan 25, 2023 •

edited

Loading

genzgd commented Jan 25, 2023

genzgd commented Jan 25, 2023

wabscale commented Jan 25, 2023

genzgd commented Jan 25, 2023

wabscale commented Jan 26, 2023

genzgd commented Jan 26, 2023

wabscale commented Jan 26, 2023 •

edited

Loading

genzgd commented Jan 26, 2023

Fix thread safety in http pool manager #102

Fix thread safety in http pool manager #102

Conversation

wabscale commented Jan 25, 2023

Problem

Solution

wabscale commented Jan 25, 2023 • edited Loading

genzgd commented Jan 25, 2023

genzgd commented Jan 25, 2023

wabscale commented Jan 25, 2023

genzgd commented Jan 25, 2023

wabscale commented Jan 26, 2023

genzgd commented Jan 26, 2023

wabscale commented Jan 26, 2023 • edited Loading

genzgd commented Jan 26, 2023

wabscale commented Jan 25, 2023 •

edited

Loading

wabscale commented Jan 26, 2023 •

edited

Loading