Handle Broken HTTP Pooled Connections For Replication #5287

abraithwaite · 2019-05-15T15:12:22Z

While investigating #4970 I noticed that replicas are using a connection pool to manage their connections to the mater nodes.

https://github.com/yandex/ClickHouse/blob/98359a6a0978dfa3d7e171ad8c30db9e814a8b9d/dbms/src/Storages/StorageReplicatedMergeTree.cpp#L2766-L2770
https://github.com/yandex/ClickHouse/blob/98359a6a0978dfa3d7e171ad8c30db9e814a8b9d/dbms/src/Storages/MergeTree/DataPartsExchange.cpp#L189-L197

We're fetching the pool from a cache keyed by the host:
https://github.com/yandex/ClickHouse/blob/98359a6a0978dfa3d7e171ad8c30db9e814a8b9d/dbms/src/IO/HTTPCommon.cpp#L153

But on that connection, reset never gets called, even though it should be:
https://pocoproject.org/docs/Poco.Net.HTTPClientSession.html#21613

In the code, both receiveResponse and sendRequest get called many times without being wrapped in a try/catch block or otherwise doing some error checking that would need to be followed-up with a reset() call.

I think this may be why people are seeing issues with Kubernetes when restarting pods or in any dynamic dns deployment. My C++ is not that great though, so please let me know if I'm barking up the wrong tree. 🐶

Thanks for the awesome work!

The text was updated successfully, but these errors were encountered:

abraithwaite · 2019-05-17T15:40:50Z

@alexey-milovidov I'm curious what your thoughts on this are! I wouldn't mind taking a stab at fixing it but it might take me a while as I'm pretty unfamiliar with C++. :-)

Thanks!

Related to: ClickHouse#5287 According to the Poco docs, we should be resetting the connection when sendRequest or readResponse throws. https://pocoproject.org/docs/Poco.Net.HTTPClientSession.html#21613

abraithwaite · 2019-05-23T05:22:10Z

Alright, I believe I've found it.

Here we're creating a normal session:
https://github.com/yandex/ClickHouse/blob/98359a6a0978dfa3d7e171ad8c30db9e814a8b9d/dbms/src/Storages/MergeTree/DataPartsExchange.cpp#L189-L197

PooledReadWriteBufferFromHTTP Initializer calls makePooledHTTPSession
https://github.com/yandex/ClickHouse/blob/98359a6a0978dfa3d7e171ad8c30db9e814a8b9d/dbms/src/IO/ReadWriteBufferFromHTTP.h#L121

makePooledHTTPSession tries to get the pool from the singleton by key designated by uri (which is still the hostname, not the IP).
https://github.com/yandex/ClickHouse/blob/98359a6a0978dfa3d7e171ad8c30db9e814a8b9d/dbms/src/IO/HTTPCommon.cpp#L190-L193

Upon the first time seeing this hostname, we create a new pool and add it to the singleton's cache:
https://github.com/yandex/ClickHouse/blob/98359a6a0978dfa3d7e171ad8c30db9e814a8b9d/dbms/src/IO/HTTPCommon.cpp#L156

Initializer for SingleEndpointHTTPSessionPool calls makeSessionImpl:
https://github.com/yandex/ClickHouse/blob/98359a6a0978dfa3d7e171ad8c30db9e814a8b9d/dbms/src/IO/HTTPCommon.cpp#L96-L106

makeHTTPSessionImpl calls setHost after resolving DNS!
https://github.com/yandex/ClickHouse/blob/98359a6a0978dfa3d7e171ad8c30db9e814a8b9d/dbms/src/IO/HTTPCommon.cpp#L68-L83

However, that means that we've got a cache keyed by hostname, but pooled on the IP address. When we go to call that again based on that hostname, it pulls up the pool talking to the IP address. Resets will of course do nothing in this situation because it'll once again get the pool from the cache by hostname and fail to connect to the pool since it's still referencing the IP address.

When that hostname changes IP addresses, we lookup in the cache the hostname and get the old IP address.

I think this is what's needed:
#5383

I don't see any reason why the pool can't resolve the address itself.

alex-zaitsev · 2019-05-23T08:01:18Z

We were able to reproduce it with Altinity Operator, @Enmk will help with the fix.

abraithwaite · 2019-05-23T16:21:22Z

Where the line was last changed was when system drop dns cache was added:

48f5d8f

abraithwaite · 2019-05-24T16:36:29Z

@alexey-milovidov @alex-zaitsev @Enmk , is there anything I can do to help move this along? Just say the words and I'll get on it. :-)

abraithwaite · 2019-07-08T16:46:43Z

This was fixed in a recent release 🎉

abraithwaite added the bug Confirmed user-visible misbehaviour in official release label May 15, 2019

abraithwaite changed the title ~~Handle Broken HTTP Pooled Connections~~ Handle Broken HTTP Pooled Connections For Replication May 15, 2019

jameshartig mentioned this issue May 16, 2019

Attempt to read after eof error when running distributed query #5295

Closed

abraithwaite mentioned this issue May 22, 2019

replication: reset stale http pooled connections #5381

Closed

abraithwaite mentioned this issue May 24, 2019

session->setHost: Let the host resolve the address #5383

Closed

alesapin mentioned this issue Jun 5, 2019

Fix bug in pooled sessions and host ip change #5534

Merged

filimonov removed comp-net Network related labels Jun 11, 2019

abraithwaite closed this as completed Jul 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle Broken HTTP Pooled Connections For Replication #5287

Handle Broken HTTP Pooled Connections For Replication #5287

abraithwaite commented May 15, 2019

abraithwaite commented May 17, 2019

abraithwaite commented May 23, 2019 •

edited

alex-zaitsev commented May 23, 2019

abraithwaite commented May 23, 2019

abraithwaite commented May 24, 2019

abraithwaite commented Jul 8, 2019

Handle Broken HTTP Pooled Connections For Replication #5287

Handle Broken HTTP Pooled Connections For Replication #5287

Comments

abraithwaite commented May 15, 2019

abraithwaite commented May 17, 2019

abraithwaite commented May 23, 2019 • edited

alex-zaitsev commented May 23, 2019

abraithwaite commented May 23, 2019

abraithwaite commented May 24, 2019

abraithwaite commented Jul 8, 2019

abraithwaite commented May 23, 2019 •

edited