Application hang after Neo4j connection did not respond #1504

pattanmzy · 2023-11-14T09:50:00Z

Hi,

I have a Spring Boot application with Neo4j AuraDB. The application was working fine initially. However, occasionally we observed that queries to the Neo4j AuraDB hang and did not respond after a long period of time, e.g. 10 - 20 minutes. When this happened, we saw the following message in the logs.

[Neo4jDriverIO-2-1] - Closing connection pool towards p-4dc8adc8-7ed2-7.production-orch-0361.neo4j.io:7687, it has no active connections and is not in the routing table registry.

We generated a few thread dumps every 30s. We observed that those threads still hang and the number of threads waiting for response from neo4j connection kept piling up.
There are no other related errors in the log.

Connection uri: neo4j+s://.databases.neo4j.io
Spring Boot starter version: 3.0.8
Spring Data Neo4j version: 7.0.7
Neo4j java driver version: 5.9.0
Neo4j version: 5
Neo4j Aura Professional on AWS
Operating System: Alpine Linux v3.18 on Kubernetes EKS
Java version: openjdk version "17.0.9"

Expected behavior

Database query should execute successfully

Actual behavior

Connection to AuraDB did not respond causing thread to hang.

Thread dumps
ThreadDumps.zip

Could you please advise what could be the root cause of this issue.

Do let me know if you need more info.

Thank you.

injectives · 2023-11-15T14:32:25Z

@pattanmzy, thanks for your message. Could you let us know if you are using reactive API by any chance please?

pattanmzy · 2023-11-20T06:41:49Z

Thanks @injectives for your reply.

The application does not use reactive API but it has dependency on Spring Boot Actuator for monitoring purpose. From the thread dumps, I can see that the Actuator HealthEndpoint uses the ReactiveHealthContributors to perform health check. There were many similar threads hanging waiting for response from Actuator health check. Is this the cause of the problem?

Extracted from thread dump
"http-nio-8080-exec-1" #52 daemon prio=5 os_prio=0 cpu=2079.61ms elapsed=84690.92s tid=0x00007f8bc5150fd0 nid=0x3a waiting on condition [0x00007f8bbe453000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@17.0.9/Native Method)
- parking to wait for <0x00000000db562688> (a java.util.concurrent.CountDownLatch$Sync)
at java.util.concurrent.locks.LockSupport.park(java.base@17.0.9/LockSupport.java:211)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.9/AbstractQueuedSynchronizer.java:715)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@17.0.9/AbstractQueuedSynchronizer.java:1047)
at java.util.concurrent.CountDownLatch.await(java.base@17.0.9/CountDownLatch.java:230)
at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:90)
at reactor.core.publisher.Mono.block(Mono.java:1712)
at org.springframework.boot.actuate.autoconfigure.health.HealthEndpointConfiguration$AdaptedReactiveHealthContributors$1.getHealth(HealthEndpointConfiguration.java:160)
at org.springframework.boot.actuate.health.HealthEndpointWebExtension.getHealth(HealthEndpointWebExtension.java:94)
at org.springframework.boot.actuate.health.HealthEndpointWebExtension.getHealth(HealthEndpointWebExtension.java:47)
at org.springframework.boot.actuate.health.HealthEndpointSupport.getLoggedHealth(HealthEndpointSupport.java:172)
at org.springframework.boot.actuate.health.HealthEndpointSupport.getContribution(HealthEndpointSupport.java:145)
at org.springframework.boot.actuate.health.HealthEndpointSupport.getAggregateContribution(HealthEndpointSupport.java:156)
at org.springframework.boot.actuate.health.HealthEndpointSupport.getContribution(HealthEndpointSupport.java:141)
at org.springframework.boot.actuate.health.HealthEndpointSupport.getHealth(HealthEndpointSupport.java:110)
at org.springframework.boot.actuate.health.HealthEndpointSupport.getHealth(HealthEndpointSupport.java:81)
at org.springframework.boot.actuate.health.HealthEndpointWebExtension.health(HealthEndpointWebExtension.java:80)
at org.springframework.boot.actuate.health.HealthEndpointWebExtension.health(HealthEndpointWebExtension.java:69)
at jdk.internal.reflect.GeneratedMethodAccessor104.invoke(Unknown Source)
at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(java.base@17.0.9/DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(java.base@17.0.9/Method.java:568)
at org.springframework.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:281)
at org.springframework.boot.actuate.endpoint.invoke.reflect.ReflectiveOperationInvoker.invoke(ReflectiveOperationInvoker.java:74)
at org.springframework.boot.actuate.endpoint.annotation.AbstractDiscoveredOperation.invoke(AbstractDiscoveredOperation.java:60)
at org.springframework.boot.actuate.endpoint.web.servlet.AbstractWebMvcEndpointHandlerMapping$ServletWebOperationAdapter.handle(AbstractWebMvcEndpointHandlerMapping.java:321)

injectives · 2023-11-21T23:55:17Z

Is this the cause of the problem?

If I remember correctly, the healthcheck uses the driver session run method. If subscription is cancelled (for whatever reason), it might be a cause of such issue, but this may also be something else.

Neo4j java driver version: 5.9.0

I suggest trying the latest driver version 5.14.0 as 5.11.0 introduced the cancellation support, see the release notes.

pattanmzy · 2023-11-24T08:10:00Z

Thanks for the feedback. I have updated the application to use the latest driver version. I will continue to monitor the application and will provide an update here if the problem still persist.

pattanmzy · 2023-12-05T06:13:23Z

Hi @injectives ,

The connection issue happened again a few days after I upgraded the Neo4j Java driver to the latest version, i.e. 5.14.0. From the attached thread dumps that were generated at 30s to 1 min intervals, we observed that there were application initiated queries to Neo4j AuraDB queries as well as Actuator health check threads hang and did not respond.

There were errors such as connection to the database terminated, connection read timed out, server is no longer available, failed to obtain connection towards WRITE server, etc. found in the application logs. Please find the details in the application and trace logs attached.

Spring Boot starter version: 3.0.12
Spring Data Neo4j version: 7.0.11
Neo4j java driver version: 5.14.0

Thread dumps
ThreadDumps.zip

Logs
application-log.txt
trace-log.txt

Could you please look into this? Thank you.

injectives · 2023-12-06T10:26:49Z

Hello.

Sure. The thread dumps show the http threads are in WAITING state, while the driver threads are actually in RUNNABLE state. It also seems like the main issue is around begin transaction area.

The application-log.txt and trace-log.txt show mostly connectivity issues, which could be because of some server-side maintenance or some other factor. Although, I am a little concerned about this one: org.neo4j.driver.exceptions.ConnectionReadTimeoutException: Connection read timed out due to it taking longer than the server-supplied timeout value via configuration hint. - it basically means that the server did not respond within the timeout it suggested to the driver upon connection.

The driver debug logs might reveal more context on this issue. Would you be happy to enable and supply those please? The driver logging can be configured though the following config option.

As a side question, could you let me know if you set the target database explicitly in the driver session please? See the following config option for more details. If possible, it would be better to set this to narrow down the issue search area a little bit.

And one more general question please. Has your application been using Aura for a while before the issue started happening on a certain date or is it a new deployment? I am just trying to understand if this issue just started happening on a certain date, which could mean some server change.

Thank you.

pattanmzy · 2023-12-08T12:14:06Z

@injectives

The driver debug logs might reveal more context on this issue. Would you be happy to enable and supply those please?

I have set the logging level to debug. I will provide the logs when the issue happens again.

As a side question, could you let me know if you set the target database explicitly in the driver session please?

The target database was not set previously. It's now set.

Has your application been using Aura for a while before the issue started happening on a certain date or is it a new deployment?

It is a new deployment of less than 2 months. This issue started happening since the beginning.

pattanmzy · 2024-01-02T05:52:04Z

Hi @injectives ,

The issue happened again on December 26th. The thread dumps show similar issue as before. There's no error found in the application logs. However, there were warning messages on Actuator Health contributor taking long time to respond in the console log as below. There were also no application activity related log messages after this time. It appears that the issue started to happen around this time.

2023-12-26 11:20:34 WARN  [http-nio-8080-exec-2] - Health contributor org.springframework.boot.actuate.autoconfigure.health.HealthEndpointConfiguration$AdaptedReactiveHealthContributors$1 (neo4j) took 30988ms to respond
2023-12-26 11:20:34 WARN  [http-nio-8080-exec-3] - Health contributor org.springframework.boot.actuate.autoconfigure.health.HealthEndpointConfiguration$AdaptedReactiveHealthContributors$1 (neo4j) took 61001ms to respond

The driver debug logs are attached.

Thread dumps
ThreadDumps_20231226.zip

Driver Debug Log
neo4j.log.zip

Console log
console.log

Appreciate if you could take a look at it. Thank you.

injectives · 2024-01-12T12:39:45Z

@pattanmzy, thanks a lot for providing the vital details. We will review soon and will get back to you.

shenhj2016 · 2024-03-21T03:51:42Z

After view the neo4j.log.zip，i have found the same issue in my application.The application is running in AWS.
Log detail:{
"log": "Suppressed: org.neo4j.driver.exceptions.DiscoveryException: Failed to update routing table with server '58534a99.databases.neo4j.io(54.216.115.14):7687'.",
"stream": "stdout",
"time": "2024-02-21T08:51:21.223046235Z"
}
neo4j-driver version is :5.15.0, and it have been update to 5.18.0 which is the latest version.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Application hang after Neo4j connection did not respond #1504

Application hang after Neo4j connection did not respond #1504

pattanmzy commented Nov 14, 2023 •

edited

injectives commented Nov 15, 2023

pattanmzy commented Nov 20, 2023

injectives commented Nov 21, 2023 •

edited

pattanmzy commented Nov 24, 2023

pattanmzy commented Dec 5, 2023

injectives commented Dec 6, 2023

pattanmzy commented Dec 8, 2023

pattanmzy commented Jan 2, 2024

injectives commented Jan 12, 2024

shenhj2016 commented Mar 21, 2024

Application hang after Neo4j connection did not respond #1504

Application hang after Neo4j connection did not respond #1504

Comments

pattanmzy commented Nov 14, 2023 • edited

Expected behavior

Actual behavior

injectives commented Nov 15, 2023

pattanmzy commented Nov 20, 2023

injectives commented Nov 21, 2023 • edited

pattanmzy commented Nov 24, 2023

pattanmzy commented Dec 5, 2023

injectives commented Dec 6, 2023

pattanmzy commented Dec 8, 2023

pattanmzy commented Jan 2, 2024

injectives commented Jan 12, 2024

shenhj2016 commented Mar 21, 2024

pattanmzy commented Nov 14, 2023 •

edited

injectives commented Nov 21, 2023 •

edited