Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The sink-connector lightweight stops replicating #380

Open
aadant opened this issue Nov 17, 2023 · 20 comments · Fixed by #414 or #436
Open

The sink-connector lightweight stops replicating #380

aadant opened this issue Nov 17, 2023 · 20 comments · Fixed by #414 or #436
Assignees
Labels
bug Something isn't working GA-1 All the issues that are issues in release(Scheduled Dec 2023) high-priority lightweight Issues related to Lightweight version p1 qa-verified label to mark issues that were verified by QA
Milestone

Comments

@aadant
Copy link
Collaborator

aadant commented Nov 17, 2023

last log before it stops :

154629410 2023-11-16 22:27:30.332 [pool-15502-thread-5] INFO com.altinity.clickhouse.sink.connector.db.DbWriter  - *************** EXECUTED BATCH Successfully Records: 12************** task(0) Thread ID: pool-15502-thread-5
Nov 16, 2023 10:29:15 PM com.github.shyiko.mysql.binlog.BinaryLogClient$5 run
INFO: Keepalive: Trying to restore lost connection to dbl1:3306

Restarting the container fixes the issue.

@aadant aadant added high-priority bug Something isn't working labels Nov 17, 2023
@subkanthi
Copy link
Collaborator

Steps to reproduce:
Stop MySQL server when sink connector is already running and replicating,
The exception is not handled.

@aadant
Copy link
Collaborator Author

aadant commented Nov 27, 2023

I suspect that there is a deadlock and that the connection waits forever to close. This can be confirmed by a java thread dump.

The solution found for such a situation is to restart debezium when no changes are detected for a while :
airbytehq/airbyte@e48492e

@aadant
Copy link
Collaborator Author

aadant commented Nov 27, 2023

@subkanthi
Copy link
Collaborator

Store the timestamp of the last record and if its more than 5/10 mins then restart Debezium Engine.

@aadant
Copy link
Collaborator Author

aadant commented Nov 29, 2023

root cause traced to this issue osheroff/mysql-binlog-connector-java#133 (reporter)

@aadant
Copy link
Collaborator Author

aadant commented Dec 1, 2023

looks like a OpenJDK 11 bug, @subkanthi let us upgrade !

@aadant
Copy link
Collaborator Author

aadant commented Dec 8, 2023

Reproduced with OpenJDK 17 see osheroff/mysql-binlog-connector-java#133

@subkanthi subkanthi added the GA-1 All the issues that are issues in release(Scheduled Dec 2023) label Dec 18, 2023
@aadant
Copy link
Collaborator Author

aadant commented Dec 24, 2023

The problem continues after the fix ...

@aadant aadant reopened this Dec 24, 2023
@subkanthi
Copy link
Collaborator

subkanthi commented Jan 12, 2024

From analyzing the thread dump provided, the cause seems to be because of high number of GC threads

 There are 179 Garbage Collection(GC) Threads. This is a very high number of GC Threads. It may affect the application's performance. You may consider reducing GC threads.

https://blog.fastthread.io/2015/09/02/thread-dump-analysis-pattern-several-scavengers/

As shown in the image there are only 15 threads that are created by sink connector
image
image

@subkanthi
Copy link
Collaborator

There are also 19 threads that are blocked in TIMED_WAITING state

main
PRIORITY : 5

THREAD ID : 0X00007FFFF0024C40

NATIVE ID : 0X7

NATIVE ID (DECIMAL) : 7

STATE : TIMED_WAITING


stackTrace:
java.lang.Thread.State: TIMED_WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@17.0.2/Native Method)
- parking to wait for <0x00007fdfd03b7508> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(java.base@17.0.2/LockSupport.java:252)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(java.base@17.0.2/AbstractQueuedSynchronizer.java:1757)
at io.debezium.connector.base.ChangeEventQueue.poll(ChangeEventQueue.java:275)
at io.debezium.connector.mysql.MySqlConnectorTask.doPoll(MySqlConnectorTask.java:239)
at io.debezium.connector.common.BaseSourceTask.poll(BaseSourceTask.java:180)
at io.debezium.embedded.EmbeddedEngine.pollRecords(EmbeddedEngine.java:695)
at io.debezium.embedded.EmbeddedEngine.run(EmbeddedEngine.java:475)
at io.debezium.embedded.ConvertingEngineBuilder$2.run(ConvertingEngineBuilder.java:229)
at com.altinity.clickhouse.debezium.embedded.cdc.DebeziumChangeEventCapture.setupDebeziumEventCapture(DebeziumChangeEventCapture.java:483)
at com.altinity.clickhouse.debezium.embedded.cdc.DebeziumChangeEventCapture.setup(DebeziumChangeEventCapture.java:512)
at com.altinity.clickhouse.debezium.embedded.ClickHouseDebeziumEmbeddedApplication.start(ClickHouseDebeziumEmbeddedApplication.java:212)
at com.altinity.clickhouse.debezium.embedded.ClickHouseDebeziumEmbeddedApplication.main(ClickHouseDebeziumEmbeddedApplication.java:88)

@subkanthi
Copy link
Collaborator

Workaround Solution:
These two configuration variables restart.event.loop and restart.event.loop.timeout.period.secs define the restart logic.
The debezium event loop will be restarted when the last timestamp of the record received exceeds the restart.event.loop.timeout.period.secs

Also the debezium event loop is restarted when the timestamp of the record stored in altinity_sink_connector.replica_source_info exceeds the threshold.

@Selfeer
Copy link
Collaborator

Selfeer commented Feb 26, 2024

The issue was manually verified by the Altinity QA team and marked as qa-verified.

Build used for testing: altinityinfra/clickhouse-sink-connector:464-c979a8c2583ff701dd5b63efb8f36f7fff8111e7-lt

We've verified that setting restart.event.loop to true and specifying seconds values in restart.event.loop.timeout.period.secs configuration results in a CDC event loop restart with a corresponding log message being shown when the timeout is reached.

Values specified in restart.event.loop.timeout.period.secs configuration are respected. The usage of these configurations doesn't seem to break any other clickhouse-sink-connector-lightweight functionality.

@Selfeer Selfeer added the qa-verified label to mark issues that were verified by QA label Feb 26, 2024
@aadant
Copy link
Collaborator Author

aadant commented Apr 1, 2024

@subkanthi a workaround is in place but this still impacts the sink-connector latency (up to restart.event.loop.timeout.period.secs), IMHO the root cause should be chased down with Debezium folks and possibly the binary log library they use.

@aadant aadant reopened this Apr 1, 2024
@subkanthi
Copy link
Collaborator

Analyzing the latest thread dump, all the 8 threads in change-event-source-coordinator are in TIMED_WAITING state.

debezium-mysqlconnector-embeddedconnector-change-event-source-coordinator
PRIORITY : 5

THREAD ID : 0X00007FDDF07899B0

NATIVE ID : 0X10A

NATIVE ID (DECIMAL) : 266

STATE : TIMED_WAITING


stackTrace:
java.lang.Thread.State: TIMED_WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@17.0.2/Native Method)
- parking to wait for <0x00007fdfd04475f0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(java.base@17.0.2/LockSupport.java:252)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(java.base@17.0.2/AbstractQueuedSynchronizer.java:1672)
at java.util.concurrent.ThreadPoolExecutor.awaitTermination(java.base@17.0.2/ThreadPoolExecutor.java:1464)
at java.util.concurrent.Executors$DelegatedExecutorService.awaitTermination(java.base@17.0.2/Executors.java:743)
at com.github.shyiko.mysql.binlog.BinaryLogClient.awaitTerminationInterruptibly(BinaryLogClient.java:1337)
at com.github.shyiko.mysql.binlog.BinaryLogClient.terminateKeepAliveThread(BinaryLogClient.java:1329)
at com.github.shyiko.mysql.binlog.BinaryLogClient.disconnect(BinaryLogClient.java:1314)
at io.debezium.connector.mysql.MySqlStreamingChangeEventSource.execute(MySqlStreamingChangeEventSource.java:1001)
at io.debezium.connector.mysql.MySqlStreamingChangeEventSource.execute(MySqlStreamingChangeEventSource.java:74)
at io.debezium.pipeline.ChangeEventSourceCoordinator.streamEvents(ChangeEventSourceCoordinator.java:272)
at io.debezium.pipeline.ChangeEventSourceCoordinator.executeChangeEventSources(ChangeEventSourceCoordinator.java:197)
at io.debezium.pipeline.ChangeEventSourceCoordinator.lambda$start$0(ChangeEventSourceCoordinator.java:137)
at io.debezium.pipeline.ChangeEventSourceCoordinator$$Lambda$727/0x00000008010f77b8.run(Unknown Source)
at java.util.concurrent.Executors$RunnableAdapter.call(java.base@17.0.2/Executors.java:539)
at java.util.concurrent.FutureTask.run(java.base@17.0.2/FutureTask.java:264)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.2/ThreadPoolExecutor.java:1136)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.2/ThreadPoolExecutor.java:635)
at java.lang.Thread.run(java.base@17.0.2/Thread.java:833)

@subkanthi
Copy link
Collaborator

subkanthi commented Apr 2, 2024

Similar issue in debezium zulipchat
https://debezium.zulipchat.com/#narrow/stream/348104-community-mysql/topic/2.2E5.2E1.2C.20stuck.20indefinitely.20until.20connector.20restart

Alexandru Cristu
3:46 AM
seems to happen after we see this: com.github.shyiko.mysql.binlog.BinaryLogClient$5 run
INFO: Keepalive: Trying to restore lost connection to ...

Task thread is waiting in doPoll:

"task-thread-source_65afd9cf169951285b70c4a1-0" #79816 prio=5 os_prio=0 cpu=2977135.32ms elapsed=79038.87s tid=0x00007f69b4205540 nid=0x13922 waiting on condition  [0x00007f6958504000]
   java.lang.Thread.State: TIMED_WAITING (parking)
    at jdk.internal.misc.Unsafe.park(java.base@17.0.9/Native Method)
    - parking to wait for  <0x00000005ad829900> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.parkNanos(java.base@17.0.9/LockSupport.java:252)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(java.base@17.0.9/AbstractQueuedSynchronizer.java:1757)
    at io.debezium.connector.base.ChangeEventQueue.poll(ChangeEventQueue.java:275)
    at io.debezium.connector.mysql.MySqlConnectorTask.doPoll(MySqlConnectorTask.java:252)
    at io.debezium.connector.common.BaseSourceTask.poll(BaseSourceTask.java:182)
    at org.apache.kafka.connect.runtime.AbstractWorkerSourceTask.poll(AbstractWorkerSourceTask.java:469)
    at org.apache.kafka.connect.runtime.AbstractWorkerSourceTask.execute(AbstractWorkerSourceTask.java:357)
    at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:204)
    at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:259)
    at org.apache.kafka.connect.runtime.AbstractWorkerSourceTask.run(AbstractWorkerSourceTask.java:77)
    at org.apache.kafka.connect.runtime.isolation.Plugins.lambda$withClassLoader$1(Plugins.java:181)
    at org.apache.kafka.connect.runtime.isolation.Plugins$$Lambda$953/0x00007f6a65b1cc50.run(Unknown Source)
    at java.util.concurrent.Executors$RunnableAdapter.call(java.base@17.0.9/Executors.java:539)
    at java.util.concurrent.FutureTask.run(java.base@17.0.9/FutureTask.java:264)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.9/ThreadPoolExecutor.java:1136)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.9/ThreadPoolExecutor.java:635)
    at java.lang.Thread.run(java.base@17.0.9/Thread.java:840)
The "change-event-source-coordinator" thread is stuck in BinaryLogClient.disconnect, which in turn is stuck waiting for the keepalive thread to finish:

"debezium-mysqlconnector-source_65afd9cf169951285b70c4a1-change-event-source-coordinator" #397 prio=5 os_prio=0 cpu=63726.27ms elapsed=482672.61s tid=0x00007f6a54458510 nid=0x1fd waiting on condition  [0x00007f6961e9c000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at jdk.internal.misc.Unsafe.park(java.base@17.0.9/Native Method)
        - parking to wait for  <0x000000050c001a90> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.parkNanos(java.base@17.0.9/LockSupport.java:252)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(java.base@17.0.9/AbstractQueuedSynchronizer.java:1672)
        at java.util.concurrent.ThreadPoolExecutor.awaitTermination(java.base@17.0.9/ThreadPoolExecutor.java:1464)
        at java.util.concurrent.Executors$DelegatedExecutorService.awaitTermination(java.base@17.0.9/Executors.java:743)
        at com.github.shyiko.mysql.binlog.BinaryLogClient.awaitTerminationInterruptibly(BinaryLogClient.java:1337)
        at com.github.shyiko.mysql.binlog.BinaryLogClient.terminateKeepAliveThread(BinaryLogClient.java:1329)
        at com.github.shyiko.mysql.binlog.BinaryLogClient.disconnect(BinaryLogClient.java:1314)
        at io.debezium.connector.mysql.MySqlStreamingChangeEventSource.execute(MySqlStreamingChangeEventSource.java:1001)
        at io.debezium.connector.mysql.MySqlStreamingChangeEventSource.execute(MySqlStreamingChangeEventSource.java:74)
        at io.debezium.pipeline.ChangeEventSourceCoordinator.streamEvents(ChangeEventSourceCoordinator.java:272)
        at io.debezium.pipeline.ChangeEventSourceCoordinator.executeChangeEventSources(ChangeEventSourceCoordinator.java:197)
        at io.debezium.pipeline.ChangeEventSourceCoordinator.lambda$start$0(ChangeEventSourceCoordinator.java:137)
        at io.debezium.pipeline.ChangeEventSourceCoordinator$$Lambda$1777/0x00007f6a660beac0.run(Unknown Source)
        at java.util.concurrent.Executors$RunnableAdapter.call(java.base@17.0.9/Executors.java:539)
        at java.util.concurrent.FutureTask.run(java.base@17.0.9/FutureTask.java:264)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.9/ThreadPoolExecutor.java:1136)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.9/ThreadPoolExecutor.java:635)
        at java.lang.Thread.run(java.base@17.0.9/Thread.java:840)
The keepalive thread is stuck waiting for SSLSocket to close:

"blc-keepalive-production-aurora-80-cluster.cluster-cfqdxwvdz9bt.us-east-1.rds.amazonaws.com:3306" #401 prio=5 os_prio=0 cpu=56.99ms elapsed=569119.58s tid=0x00007f69d8079dc0 nid=0x201 waiting on condition  [0x00007f69624a3000]
   java.lang.Thread.State: WAITING (parking)
    at jdk.internal.misc.Unsafe.park(java.base@17.0.9/Native Method)
    - parking to wait for  <0x000000050c032238> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
    at java.util.concurrent.locks.LockSupport.park(java.base@17.0.9/LockSupport.java:211)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.9/AbstractQueuedSynchronizer.java:715)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.9/AbstractQueuedSynchronizer.java:938)
    at java.util.concurrent.locks.ReentrantLock$Sync.lock(java.base@17.0.9/ReentrantLock.java:153)
    at java.util.concurrent.locks.ReentrantLock.lock(java.base@17.0.9/ReentrantLock.java:322)
    at sun.security.ssl.SSLSocketImpl.closeNotify(java.base@17.0.9/SSLSocketImpl.java:735)
    at sun.security.ssl.TransportContext.closeNotify(java.base@17.0.9/TransportContext.java:287)
    at sun.security.ssl.TransportContext.initiateOutboundClose(java.base@17.0.9/TransportContext.java:598)
    at sun.security.ssl.TransportContext.closeOutbound(java.base@17.0.9/TransportContext.java:582)
    at sun.security.ssl.SSLSocketImpl.shutdownOutput(java.base@17.0.9/SSLSocketImpl.java:875)
    at com.github.shyiko.mysql.binlog.network.protocol.PacketChannel.close(PacketChannel.java:125)
    at com.github.shyiko.mysql.binlog.BinaryLogClient.disconnectChannel(BinaryLogClient.java:1361)
    at com.github.shyiko.mysql.binlog.BinaryLogClient.terminateConnect(BinaryLogClient.java:1345)
    at com.github.shyiko.mysql.binlog.BinaryLogClient.access$800(BinaryLogClient.java:96)
    at com.github.shyiko.mysql.binlog.BinaryLogClient$5.run(BinaryLogClient.java:894)
    at java.util.concurrent.Executors$RunnableAdapter.call(java.base@17.0.9/Executors.java:539)
    at java.util.concurrent.FutureTask.run(java.base@17.0.9/FutureTask.java:264)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.9/ThreadPoolExecutor.java:1136)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.9/ThreadPoolExecutor.java:635)
    at java.lang.Thread.run(java.base@17.0.9/Thread.java:840)
Restarting the connector caused streaming to resume, the above threads remain stuck. Restart is currently needed every few hours, but it seems to become more frequent, at some point will need to restart the JVM to clean up these leaked threads.

Investigating potentially related JDK issue https://bugs.openjdk.org/browse/JDK-8241239?focusedId=14646711&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14646711

Any suggestions how to troubleshoot this further? Have you seen this before?

@subkanthi
Copy link
Collaborator

Debezium issue - https://issues.redhat.com/browse/DBZ-7570

@subkanthi
Copy link
Collaborator

subkanthi commented Apr 2, 2024

said:

It's on my radar for 2.7.0.Alpha1
Thanks Chris, is there any workarounds for now, like mysql version change maybe as discussed in the other thread?

Chris Cranford (Naros): said:

Thanks Chris, is there any workarounds for now, like mysql version change maybe as discussed in the other thread?

Unfortunately nothing really at this time.

@svb-alt svb-alt added the lightweight Issues related to Lightweight version label Apr 4, 2024
@subkanthi subkanthi added this to the 2.2.0 milestone May 26, 2024
@subkanthi
Copy link
Collaborator

Even though debezium team addressed the issue in 2.7.0 https://issues.redhat.com/browse/DBZ-7570, they have introduced another bug
https://issues.redhat.com/browse/DBZ-7873

@subkanthi subkanthi added the p1 label May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working GA-1 All the issues that are issues in release(Scheduled Dec 2023) high-priority lightweight Issues related to Lightweight version p1 qa-verified label to mark issues that were verified by QA
Projects
None yet
4 participants