Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle Broken HTTP Pooled Connections For Replication #5287

Closed
abraithwaite opened this issue May 15, 2019 · 6 comments
Closed

Handle Broken HTTP Pooled Connections For Replication #5287

abraithwaite opened this issue May 15, 2019 · 6 comments
Labels
bug Confirmed user-visible misbehaviour in official release

Comments

@abraithwaite
Copy link

While investigating #4970 I noticed that replicas are using a connection pool to manage their connections to the mater nodes.

https://github.com/yandex/ClickHouse/blob/98359a6a0978dfa3d7e171ad8c30db9e814a8b9d/dbms/src/Storages/StorageReplicatedMergeTree.cpp#L2766-L2770
https://github.com/yandex/ClickHouse/blob/98359a6a0978dfa3d7e171ad8c30db9e814a8b9d/dbms/src/Storages/MergeTree/DataPartsExchange.cpp#L189-L197

We're fetching the pool from a cache keyed by the host:
https://github.com/yandex/ClickHouse/blob/98359a6a0978dfa3d7e171ad8c30db9e814a8b9d/dbms/src/IO/HTTPCommon.cpp#L153

But on that connection, reset never gets called, even though it should be:
https://pocoproject.org/docs/Poco.Net.HTTPClientSession.html#21613

In the code, both receiveResponse and sendRequest get called many times without being wrapped in a try/catch block or otherwise doing some error checking that would need to be followed-up with a reset() call.

I think this may be why people are seeing issues with Kubernetes when restarting pods or in any dynamic dns deployment. My C++ is not that great though, so please let me know if I'm barking up the wrong tree. 🐶

Thanks for the awesome work!

@abraithwaite abraithwaite added the bug Confirmed user-visible misbehaviour in official release label May 15, 2019
@abraithwaite abraithwaite changed the title Handle Broken HTTP Pooled Connections Handle Broken HTTP Pooled Connections For Replication May 15, 2019
@abraithwaite
Copy link
Author

@alexey-milovidov I'm curious what your thoughts on this are! I wouldn't mind taking a stab at fixing it but it might take me a while as I'm pretty unfamiliar with C++. :-)

Thanks!

abraithwaite pushed a commit to segmentio/ClickHouse that referenced this issue May 22, 2019
Related to:
ClickHouse#5287

According to the Poco docs, we should be resetting the connection when
sendRequest or readResponse throws.

https://pocoproject.org/docs/Poco.Net.HTTPClientSession.html#21613
@abraithwaite
Copy link
Author

abraithwaite commented May 23, 2019

Alright, I believe I've found it.

Here we're creating a normal session:
https://github.com/yandex/ClickHouse/blob/98359a6a0978dfa3d7e171ad8c30db9e814a8b9d/dbms/src/Storages/MergeTree/DataPartsExchange.cpp#L189-L197

PooledReadWriteBufferFromHTTP Initializer calls makePooledHTTPSession
https://github.com/yandex/ClickHouse/blob/98359a6a0978dfa3d7e171ad8c30db9e814a8b9d/dbms/src/IO/ReadWriteBufferFromHTTP.h#L121

makePooledHTTPSession tries to get the pool from the singleton by key designated by uri (which is still the hostname, not the IP).
https://github.com/yandex/ClickHouse/blob/98359a6a0978dfa3d7e171ad8c30db9e814a8b9d/dbms/src/IO/HTTPCommon.cpp#L190-L193

Upon the first time seeing this hostname, we create a new pool and add it to the singleton's cache:
https://github.com/yandex/ClickHouse/blob/98359a6a0978dfa3d7e171ad8c30db9e814a8b9d/dbms/src/IO/HTTPCommon.cpp#L156

Initializer for SingleEndpointHTTPSessionPool calls makeSessionImpl:
https://github.com/yandex/ClickHouse/blob/98359a6a0978dfa3d7e171ad8c30db9e814a8b9d/dbms/src/IO/HTTPCommon.cpp#L96-L106

makeHTTPSessionImpl calls setHost after resolving DNS!
https://github.com/yandex/ClickHouse/blob/98359a6a0978dfa3d7e171ad8c30db9e814a8b9d/dbms/src/IO/HTTPCommon.cpp#L68-L83

However, that means that we've got a cache keyed by hostname, but pooled on the IP address. When we go to call that again based on that hostname, it pulls up the pool talking to the IP address. Resets will of course do nothing in this situation because it'll once again get the pool from the cache by hostname and fail to connect to the pool since it's still referencing the IP address.

When that hostname changes IP addresses, we lookup in the cache the hostname and get the old IP address.

I think this is what's needed:
#5383

I don't see any reason why the pool can't resolve the address itself.

@alex-zaitsev
Copy link
Contributor

We were able to reproduce it with Altinity Operator, @Enmk will help with the fix.

@abraithwaite
Copy link
Author

Where the line was last changed was when system drop dns cache was added:

48f5d8f

@abraithwaite
Copy link
Author

@alexey-milovidov @alex-zaitsev @Enmk , is there anything I can do to help move this along? Just say the words and I'll get on it. :-)

@abraithwaite
Copy link
Author

This was fixed in a recent release 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Confirmed user-visible misbehaviour in official release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants