Spring Boot 3.1.6 health endpoint latency is very high when other query is running. #1898

patiramyadav · 2023-12-05T15:24:00Z

Library Version
7.0.2

Describe the bug
Health endpoint is very slow when other continuous queries are running to fetch the data.

With Other Query

 ✓ **health query is 200**

✓ checks.........................: 100.00% ✓ 3 ✗ 0
data_received..................: 2.4 kB 94 B/s
data_sent......................: 344 B 14 B/s
http_req_blocked...............: avg=564.33µs min=3µs med=4µs max=1.68ms p(90)=1.34ms p(95)=1.51ms
http_req_connecting............: avg=55.33µs min=0s med=0s max=166µs p(90)=132.8µs p(95)=149.4µs
http_req_duration..............: avg=7.02s min=6s med=6.06s max=8.99s p(90)=8.41s p(95)=8.7s
{ expected_response:true }...: avg=7.02s min=6s med=6.06s max=8.99s p(90)=8.41s p(95)=8.7s //slow
http_req_failed................: 0.00% ✓ 0 ✗ 3
http_req_receiving.............: avg=84.66µs min=45µs med=48µs max=161µs p(90)=138.4µs p(95)=149.69µs
http_req_sending...............: avg=35.66µs min=19µs med=24µs max=64µs p(90)=56µs p(95)=60µs
http_req_tls_handshaking.......: avg=0s min=0s med=0s max=0s p(90)=0s p(95)=0s
http_req_waiting...............: avg=7.02s min=6s med=6.06s max=8.99s p(90)=8.41s p(95)=8.7s
http_reqs......................: 3 0.118047/s
iteration_duration.............: avg=7.02s min=6s med=6.06s max=8.99s p(90)=8.41s p(95)=8.7s
iterations.....................: 3 0.118047/s
vus............................: 1 min=1 max=1
vus_max........................: 1 min=1 max=1

query:
✓ account query is 200

✓ checks.........................: 100.00% ✓ 754 ✗ 0
data_received..................: 143 kB 4.6 kB/s
data_sent......................: 507 kB 16 kB/s
http_req_blocked...............: avg=94.44µs min=1µs med=3µs max=2.82ms p(90)=352.4µs p(95)=451.35µs
http_req_connecting............: avg=47.11µs min=0s med=0s max=557µs p(90)=298µs p(95)=377.4µs
http_req_duration..............: avg=3.78s min=534.59ms med=3.43s max=13.88s p(90)=6.61s p(95)=7.79s
{ expected_response:true }...: avg=3.78s min=534.59ms med=3.43s max=13.88s p(90)=6.61s p(95)=7.79s
http_req_failed................: 0.00% ✓ 0 ✗ 754
http_req_receiving.............: avg=54.57µs min=19µs med=50µs max=372µs p(90)=77.7µs p(95)=94.4µs
http_req_sending...............: avg=33.74µs min=11µs med=24µs max=407µs p(90)=50µs p(95)=66.35µs
http_req_tls_handshaking.......: avg=0s min=0s med=0s max=0s p(90)=0s p(95)=0s
http_req_waiting...............: avg=3.78s min=534.15ms med=3.43s max=13.88s p(90)=6.61s p(95)=7.79s
http_reqs......................: 754 24.427578/s
iteration_duration.............: avg=3.81s min=538.34ms med=3.44s max=13.93s p(90)=6.61s p(95)=7.83s
iterations.....................: 754 24.427578/s
vus............................: 100 min=100 max=100
vus_max........................: 100 min=100 max=100

Without Query
✓ health query is 200

✓ checks.........................: 100.00% ✓ 317 ✗ 0
data_received..................: 253 kB 68 kB/s
data_sent......................: 27 kB 7.4 kB/s
http_req_blocked...............: avg=8.18µs min=1µs med=3µs max=1.74ms p(90)=4µs p(95)=4µs
http_req_connecting............: avg=577ns min=0s med=0s max=183µs p(90)=0s p(95)=0s
http_req_duration..............: avg=1.88ms min=637µs med=1.03ms max=252.7ms p(90)=1.37ms p(95)=1.5ms
{ expected_response:true }...: avg=1.88ms min=637µs med=1.03ms max=252.7ms p(90)=1.37ms p(95)=1.5ms
http_req_failed................: 0.00% ✓ 0 ✗ 317
http_req_receiving.............: avg=34.28µs min=18µs med=32µs max=147µs p(90)=46µs p(95)=56.2µs
http_req_sending...............: avg=15.51µs min=6µs med=14µs max=446µs p(90)=18µs p(95)=19.19µs
http_req_tls_handshaking.......: avg=0s min=0s med=0s max=0s p(90)=0s p(95)=0s
http_req_waiting...............: avg=1.83ms min=606µs med=988µs max=252.19ms p(90)=1.32ms p(95)=1.45ms
http_reqs......................: 317 85.200396/s
iteration_duration.............: avg=11.71ms min=1.38ms med=5.28ms max=255.35ms p(90)=24.4ms p(95)=40.3ms
iterations.....................: 317 85.200396/s
vus............................: 1 min=1 max=1
vus_max........................: 1 min=1 max=1

To Reproduce
Steps to reproduce the behavior. Please provide:

I have bunch of dependencies, not sure if it is side effects of any but all are up to date, here are majors.

    // GraphQL/Webflux
    implementation("com.expediagroup:graphql-kotlin-spring-server:7.0.2")
    implementation("com.graphql-java:graphql-java-extended-scalars:21.0")

    // kotlin deps
    implementation("org.jetbrains.kotlin:kotlin-reflect")
    implementation("org.jetbrains.kotlin:kotlin-stdlib-jdk8")
    implementation("org.json:json:20231013")

    // arrow kt
    val arrowVersion = "1.0.0"
    implementation(platform("io.arrow-kt:arrow-stack:$arrowVersion"))
    implementation("io.arrow-kt:arrow-core")
    implementation("io.arrow-kt:arrow-optics")
    implementation("io.arrow-kt:arrow-meta")
    implementation("io.arrow-kt:arrow-fx-coroutines")

    // coroutines
    implementation("org.jetbrains.kotlinx:kotlinx-coroutines-core:$coroutinesVersion")
    implementation("org.jetbrains.kotlinx:kotlinx-coroutines-slf4j:$coroutinesVersion")
    testImplementation("org.jetbrains.kotlinx:kotlinx-coroutines-test:$coroutinesVersion")

@Component
class AccountQuery : Query {
    @Auth(minRole = Role.ADMIN)
    suspend fun account(
        customerId: String,
        accountId: String
    ): Account {
            return Account(customerId, accountId)
}

class Account(
    private val customerId: String,
    private val accountId: String
) {
    suspend fun activities(
        fromDate: String? = null,
        toDate: String? = null,
        @GraphQLIgnore @Autowired
        activityService: ActivityService
    ): List<Activity> {
        return activityService.getTransactions(
            customerId,
            accountId,
            startDate = fromDate,
            endDate = toDate
        )
    }

query AccountActivities {
      account(accountId: "${accountId}", customerId:"${customerId}") {
        statuses {
            isDelinquent
        }
      }
    }

Expected behavior
Actually /health endpoint shouldn't be blocked by normal traffic in reactive framework.

The text was updated successfully, but these errors were encountered:

samuelAndalon · 2023-12-05T15:39:30Z

please provide a repository link where issue can be reproduced

patiramyadav · 2023-12-05T15:41:34Z

please provide a repository link where issue can be reproduced

It is on private project, I will try to create public project and add there.

patiramyadav · 2023-12-06T05:17:56Z

In heavy IO mode this happens, hard to upload whole project. I don't have 24hour live mock server. If you guys have mock server and open api spec, I will generate client, I need at least two mock servers(micro services).

This problem happens when all the reactive threads are busy and the time acts like blocking. My management port is different 9999, does it share same thread pool with port 8080?

Something like this should solve the problem:

@OptIn(ExperimentalCoroutinesApi::class)
val customDispatcher = Dispatchers.IO.limitedParallelism(300)
suspend fun <T> withIOContext(block: suspend () -> T) = withContext(customDispatcher) {
    block()
}

But I am not sure how to configure this globally singleton way.

I used k6 to generate the traffic:
query.ts

import { check, sleep } from 'k6';
import http from 'k6/http';

// Use istio url
const BASE_URL = 'http://localhost:8080/graphql'

// Add customerId and accountId so that each request will pick randomly from here.
const listOfAccountDetails = [
  { 'accountId': '1', 'customerId': '2' },
  { 'accountId': '3', 'customerId': '4' },
  { 'accountId': '5', 'customerId': '6' },
  { 'accountId': '7', 'customerId': '8' },
  { 'accountId': '9', 'customerId': '10' },
  { 'accountId': '11', 'customerId': '12' },
];

export const options = {
  duration: '10000d', // Run forever
  vus: 500,
  thresholds: {
    checks: ['rate>0.9'] // Ensures > 90% of function checks are successful
  }
};


function account_query2(accountId, customerId) {
  return `
      query fetchAccount1{
        account1(accountId: "${accountId}", customerId:"${customerId}"){
          activities(fromDate: "2010-01-01", toDate:"2023-12-02"){
            amount
          }
        }
      }`;
}

// Main function to execute the request
export default function () {
  const randomIndex = Math.floor(Math.random() * listOfAccountDetails.length);
  const accDetails = listOfAccountDetails[randomIndex];
  const request = account_query2(accDetails.accountId, accDetails.customerId)
  console.log(JSON.stringify(accDetails));
  const accountResponse = http.post(BASE_URL, JSON.stringify({ query: request }), {
    headers:
    {
      "Content-Type": "application/json",
      'requester-id': "12345666666",
      'role': 'agent',
      "x-b3-traceid": 'xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx'.replace(/[xy]/g, function (c) {
        let r = Math.random() * 16 | 0, v = c == 'x' ? r : (r & 0x3 | 0x8);
        return Math.random() * 16 | 0, v = c == 'x' ? r : (r & 0x3 | 0x8).toString(16);
      }),
      "x-b3-spanid": "xxxxxxx",
      "x-b3-parentspanid": "1242314314",
      "x-b3-sampled": "1"
    }
  });
  console.log(accountResponse.body, accountResponse.status);
  console.log(JSON.parse(accountResponse.body));
  check(accountResponse, {
    'account query is 200': (r) => { return r.status === 200 && !("errors" in JSON.parse(r.body)) }
  });
}

health.ts

import { check, sleep } from 'k6';
import http from 'k6/http';

// Use istio url
const BASE_URL = 'http://localhost:9999/health'


export const options = {
  duration: '10000d', // Run forever
  vus: 1,
  thresholds: {
    checks: ['rate>0.9'] // Ensures > 90% of function checks are successful
  }
};

// Main function to execute the request
export default function () {
  const accountResponse = http.get(BASE_URL);
  console.log(accountResponse.body, accountResponse.status);
  console.log(JSON.parse(accountResponse.body));
  check(accountResponse, {
    'health query is 200': (r) => { return r.status === 200 && !("errors" in JSON.parse(r.body)) }
  });
  // sleep(3)
}

patiramyadav · 2023-12-06T05:40:26Z

Screen.Recording.2023-12-05.at.11.29.57.PM.mov

For security reason I can't show whole screen, Left side: calling /health endpoint single user(health in cloud usually have single call). Right side: actual query with 500 users.

query throughput:

 http_reqs......................: 4720   37.184813/s
     http_req_duration..............: avg=8.6s     min=0s       med=7.53s max=1m0s   p(90)=9.94s  p(95)=32.46s
       { expected_response:true }...: avg=8.09s    min=678.83ms med=7.86s max=59.76s p(90)=9.61s  p(95)=13.52s

health endpoint throughput:

     http_reqs......................: 7       0.062976/s
     http_req_duration..............: avg=15.4s    min=11.85s med=16.29s max=17.46s p(90)=17.07s  p(95)=17.26s  
       { expected_response:true }...: avg=15.4s    min=11.85s med=16.29s max=17.46s p(90)=17.07s  p(95)=17.26s

health check timeout could be increased but not ideally more than 5 sec.
I am not expert on writing core framework but health check if enabled should have higher priority or configurable or dedicated thread in reactive framework.

patiramyadav · 2023-12-06T05:50:55Z

When I am wrapping like this problem solved, but I have lots of query. Service1 is actually service, hiding name.

dariuszkuc · 2023-12-06T15:34:12Z

Spring Boot Webflux defaults to reactor Netty threadpool which in turns defaults to max(cpu, 4). By default, we don't switch graphql-java execution to other threads (as it is expensive) and it should generally be up to the clients to configure long running queries to use different dispatchers... but maybe we should.

You have few options

per Netty documentation you could try providing dedicated IO worker thread by specifying -Dreactor.netty.ioSelectCount=1 and also potentially increasing worker threads
- haven't tried it but it does sound like it should alleviate your health endpoint issue
run graphql-java requests in a different thread by wrapping it in a coroutine dispatcher*
- extend default SpringGraphQLServer and just wrap its execution with default dispatcher

class MySpringGraphQLServer(
    requestParser: SpringGraphQLRequestParser,
    contextFactory: SpringGraphQLContextFactory,
    requestHandler: GraphQLRequestHandler
) : SpringGraphQLServer(requestParser, contextFactory, requestHandler) {
    override suspend fun execute(request: ServerRequest): GraphQLServerResponse? = withContext(Dispatchers.Default) {
        super.execute(request)
    }
}

*we may end up pushing this change to the repo as well

patiramyadav · 2023-12-07T16:28:14Z

@dariuszkuc Something like this doesn't work.

java -Dreactor.netty.ioSelectCount=200 -jar application/build/libs/application-0.0.1-SNAPSHOT.jar com.company.orchestrator.MainKt

This worked for me, thanks lot.

class IOSpringGraphQLServer(
    requestParser: SpringGraphQLRequestParser,
    contextFactory: SpringGraphQLContextFactory,
    requestHandler: GraphQLRequestHandler
) : SpringGraphQLServer(requestParser, contextFactory, requestHandler) {
    @OptIn(ExperimentalCoroutinesApi::class)
    override suspend fun execute(request: ServerRequest): GraphQLServerResponse? {
        val withContext = withContext(Dispatchers.IO.limitedParallelism(300)) {
            super.execute(request)
        }
        return withContext
    }
}

And config:

@Configuration
class GraphQLConfig(
    private val requestParser: SpringGraphQLRequestParser,
    private val contextFactory: SpringGraphQLContextFactory,
    private val requestHandler: GraphQLRequestHandler
) {
    @Bean
    fun ioSpringGraphQLServer(): IOSpringGraphQLServer {
        return IOSpringGraphQLServer(requestParser, contextFactory, requestHandler)
    }
}

dariuszkuc · 2023-12-07T17:42:55Z

I'm guessing you should be increasing the workers instead of IO selector thread, per Netty docs it should be -Dreactor.netty.ioSelectCount=1 -Dreactor.netty.ioWorkerCount=200.

That being said.... since graphql-kotlin is based on the reactive paradigm, IMHO 200/300 threads is waaaaaay too many (this looks kind of like config you would use for blocking servlet Tomcat based configuration) as most of them will never be used. If you are non-blocking (i.e. using coroutines and switching work as necessary) then you generally only need couple compute threads -> Dispatchers.Default should be plenty. You would then switch to Dispatchers.IO for the IO calls or to some other thread for some heavy computation work.

Also note that the switch there in the GraphQLServer is only for the overall graphql-java engine orchestration, individual data fetchers will run in the default coroutine scope (assuming you are using coroutines) unless you provide custom context with a different dispatcher.

patiramyadav · 2023-12-07T18:16:21Z

Agreed: spring-cloud/spring-cloud-gateway#1844 I see here. It works with those.

samuelAndalon · 2024-02-02T16:29:44Z

Will create a PR to execute operations out of the reactor http epoll

patiramyadav added the type: bug Something isn't working label Dec 5, 2023

patiramyadav changed the title ~~Spring Boot 3.1.6 health endpoint latency very high when other query is running.~~ Spring Boot 3.1.6 health endpoint latency is very high when other query is running. Dec 5, 2023

dariuszkuc mentioned this issue Dec 7, 2023

Deadlock when using subscriptions with two Ktor worker threads #1895

Open

samuelAndalon closed this as completed Feb 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spring Boot 3.1.6 health endpoint latency is very high when other query is running. #1898

Spring Boot 3.1.6 health endpoint latency is very high when other query is running. #1898

patiramyadav commented Dec 5, 2023 •

edited

Loading

samuelAndalon commented Dec 5, 2023

patiramyadav commented Dec 5, 2023

patiramyadav commented Dec 6, 2023 •

edited

Loading

patiramyadav commented Dec 6, 2023 •

edited

Loading

patiramyadav commented Dec 6, 2023 •

edited

Loading

dariuszkuc commented Dec 6, 2023

patiramyadav commented Dec 7, 2023 •

edited

Loading

dariuszkuc commented Dec 7, 2023

patiramyadav commented Dec 7, 2023 •

edited

Loading

samuelAndalon commented Feb 2, 2024

Spring Boot 3.1.6 health endpoint latency is very high when other query is running. #1898

Spring Boot 3.1.6 health endpoint latency is very high when other query is running. #1898

Comments

patiramyadav commented Dec 5, 2023 • edited Loading

samuelAndalon commented Dec 5, 2023

patiramyadav commented Dec 5, 2023

patiramyadav commented Dec 6, 2023 • edited Loading

patiramyadav commented Dec 6, 2023 • edited Loading

patiramyadav commented Dec 6, 2023 • edited Loading

dariuszkuc commented Dec 6, 2023

patiramyadav commented Dec 7, 2023 • edited Loading

dariuszkuc commented Dec 7, 2023

patiramyadav commented Dec 7, 2023 • edited Loading

samuelAndalon commented Feb 2, 2024

patiramyadav commented Dec 5, 2023 •

edited

Loading

patiramyadav commented Dec 6, 2023 •

edited

Loading

patiramyadav commented Dec 6, 2023 •

edited

Loading

patiramyadav commented Dec 6, 2023 •

edited

Loading

patiramyadav commented Dec 7, 2023 •

edited

Loading

patiramyadav commented Dec 7, 2023 •

edited

Loading