New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle Broken HTTP Pooled Connections For Replication #5287
Comments
@alexey-milovidov I'm curious what your thoughts on this are! I wouldn't mind taking a stab at fixing it but it might take me a while as I'm pretty unfamiliar with C++. :-) Thanks! |
Related to: ClickHouse#5287 According to the Poco docs, we should be resetting the connection when sendRequest or readResponse throws. https://pocoproject.org/docs/Poco.Net.HTTPClientSession.html#21613
Alright, I believe I've found it. Here we're creating a normal session: PooledReadWriteBufferFromHTTP Initializer calls makePooledHTTPSession makePooledHTTPSession tries to get the pool from the singleton by key designated by uri (which is still the hostname, not the IP). Upon the first time seeing this hostname, we create a new pool and add it to the singleton's cache: Initializer for SingleEndpointHTTPSessionPool calls makeSessionImpl: makeHTTPSessionImpl calls setHost after resolving DNS! However, that means that we've got a cache keyed by hostname, but pooled on the IP address. When we go to call that again based on that hostname, it pulls up the pool talking to the IP address. Resets will of course do nothing in this situation because it'll once again get the pool from the cache by hostname and fail to connect to the pool since it's still referencing the IP address. When that hostname changes IP addresses, we lookup in the cache the hostname and get the old IP address. I think this is what's needed: I don't see any reason why the pool can't resolve the address itself. |
We were able to reproduce it with Altinity Operator, @Enmk will help with the fix. |
Where the line was last changed was when |
@alexey-milovidov @alex-zaitsev @Enmk , is there anything I can do to help move this along? Just say the words and I'll get on it. :-) |
This was fixed in a recent release 🎉 |
While investigating #4970 I noticed that replicas are using a connection pool to manage their connections to the mater nodes.
https://github.com/yandex/ClickHouse/blob/98359a6a0978dfa3d7e171ad8c30db9e814a8b9d/dbms/src/Storages/StorageReplicatedMergeTree.cpp#L2766-L2770
https://github.com/yandex/ClickHouse/blob/98359a6a0978dfa3d7e171ad8c30db9e814a8b9d/dbms/src/Storages/MergeTree/DataPartsExchange.cpp#L189-L197
We're fetching the pool from a cache keyed by the host:
https://github.com/yandex/ClickHouse/blob/98359a6a0978dfa3d7e171ad8c30db9e814a8b9d/dbms/src/IO/HTTPCommon.cpp#L153
But on that connection,
reset
never gets called, even though it should be:https://pocoproject.org/docs/Poco.Net.HTTPClientSession.html#21613
In the code, both
receiveResponse
andsendRequest
get called many times without being wrapped in a try/catch block or otherwise doing some error checking that would need to be followed-up with areset()
call.I think this may be why people are seeing issues with Kubernetes when restarting pods or in any dynamic dns deployment. My C++ is not that great though, so please let me know if I'm barking up the wrong tree. 🐶
Thanks for the awesome work!
The text was updated successfully, but these errors were encountered: