Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spring Boot 3.1.6 health endpoint latency is very high when other query is running. #1898

Closed
patiramyadav opened this issue Dec 5, 2023 · 10 comments
Labels
type: bug Something isn't working

Comments

@patiramyadav
Copy link

patiramyadav commented Dec 5, 2023

Library Version
7.0.2

Describe the bug
Health endpoint is very slow when other continuous queries are running to fetch the data.

With Other Query

 ✓ **health query is 200**

✓ checks.........................: 100.00% ✓ 3 ✗ 0
data_received..................: 2.4 kB 94 B/s
data_sent......................: 344 B 14 B/s
http_req_blocked...............: avg=564.33µs min=3µs med=4µs max=1.68ms p(90)=1.34ms p(95)=1.51ms
http_req_connecting............: avg=55.33µs min=0s med=0s max=166µs p(90)=132.8µs p(95)=149.4µs
http_req_duration..............: avg=7.02s min=6s med=6.06s max=8.99s p(90)=8.41s p(95)=8.7s
{ expected_response:true }...: avg=7.02s min=6s med=6.06s max=8.99s p(90)=8.41s p(95)=8.7s //slow
http_req_failed................: 0.00% ✓ 0 ✗ 3
http_req_receiving.............: avg=84.66µs min=45µs med=48µs max=161µs p(90)=138.4µs p(95)=149.69µs
http_req_sending...............: avg=35.66µs min=19µs med=24µs max=64µs p(90)=56µs p(95)=60µs
http_req_tls_handshaking.......: avg=0s min=0s med=0s max=0s p(90)=0s p(95)=0s
http_req_waiting...............: avg=7.02s min=6s med=6.06s max=8.99s p(90)=8.41s p(95)=8.7s
http_reqs......................: 3 0.118047/s
iteration_duration.............: avg=7.02s min=6s med=6.06s max=8.99s p(90)=8.41s p(95)=8.7s
iterations.....................: 3 0.118047/s
vus............................: 1 min=1 max=1
vus_max........................: 1 min=1 max=1

query:
account query is 200

✓ checks.........................: 100.00% ✓ 754 ✗ 0
data_received..................: 143 kB 4.6 kB/s
data_sent......................: 507 kB 16 kB/s
http_req_blocked...............: avg=94.44µs min=1µs med=3µs max=2.82ms p(90)=352.4µs p(95)=451.35µs
http_req_connecting............: avg=47.11µs min=0s med=0s max=557µs p(90)=298µs p(95)=377.4µs
http_req_duration..............: avg=3.78s min=534.59ms med=3.43s max=13.88s p(90)=6.61s p(95)=7.79s
{ expected_response:true }...: avg=3.78s min=534.59ms med=3.43s max=13.88s p(90)=6.61s p(95)=7.79s
http_req_failed................: 0.00% ✓ 0 ✗ 754
http_req_receiving.............: avg=54.57µs min=19µs med=50µs max=372µs p(90)=77.7µs p(95)=94.4µs
http_req_sending...............: avg=33.74µs min=11µs med=24µs max=407µs p(90)=50µs p(95)=66.35µs
http_req_tls_handshaking.......: avg=0s min=0s med=0s max=0s p(90)=0s p(95)=0s
http_req_waiting...............: avg=3.78s min=534.15ms med=3.43s max=13.88s p(90)=6.61s p(95)=7.79s
http_reqs......................: 754 24.427578/s
iteration_duration.............: avg=3.81s min=538.34ms med=3.44s max=13.93s p(90)=6.61s p(95)=7.83s
iterations.....................: 754 24.427578/s
vus............................: 100 min=100 max=100
vus_max........................: 100 min=100 max=100

Without Query
health query is 200

✓ checks.........................: 100.00% ✓ 317 ✗ 0
data_received..................: 253 kB 68 kB/s
data_sent......................: 27 kB 7.4 kB/s
http_req_blocked...............: avg=8.18µs min=1µs med=3µs max=1.74ms p(90)=4µs p(95)=4µs
http_req_connecting............: avg=577ns min=0s med=0s max=183µs p(90)=0s p(95)=0s
http_req_duration..............: avg=1.88ms min=637µs med=1.03ms max=252.7ms p(90)=1.37ms p(95)=1.5ms
{ expected_response:true }...: avg=1.88ms min=637µs med=1.03ms max=252.7ms p(90)=1.37ms p(95)=1.5ms
http_req_failed................: 0.00% ✓ 0 ✗ 317
http_req_receiving.............: avg=34.28µs min=18µs med=32µs max=147µs p(90)=46µs p(95)=56.2µs
http_req_sending...............: avg=15.51µs min=6µs med=14µs max=446µs p(90)=18µs p(95)=19.19µs
http_req_tls_handshaking.......: avg=0s min=0s med=0s max=0s p(90)=0s p(95)=0s
http_req_waiting...............: avg=1.83ms min=606µs med=988µs max=252.19ms p(90)=1.32ms p(95)=1.45ms
http_reqs......................: 317 85.200396/s
iteration_duration.............: avg=11.71ms min=1.38ms med=5.28ms max=255.35ms p(90)=24.4ms p(95)=40.3ms
iterations.....................: 317 85.200396/s
vus............................: 1 min=1 max=1
vus_max........................: 1 min=1 max=1

To Reproduce
Steps to reproduce the behavior. Please provide:

I have bunch of dependencies, not sure if it is side effects of any but all are up to date, here are majors.

    // GraphQL/Webflux
    implementation("com.expediagroup:graphql-kotlin-spring-server:7.0.2")
    implementation("com.graphql-java:graphql-java-extended-scalars:21.0")

    // kotlin deps
    implementation("org.jetbrains.kotlin:kotlin-reflect")
    implementation("org.jetbrains.kotlin:kotlin-stdlib-jdk8")
    implementation("org.json:json:20231013")

    // arrow kt
    val arrowVersion = "1.0.0"
    implementation(platform("io.arrow-kt:arrow-stack:$arrowVersion"))
    implementation("io.arrow-kt:arrow-core")
    implementation("io.arrow-kt:arrow-optics")
    implementation("io.arrow-kt:arrow-meta")
    implementation("io.arrow-kt:arrow-fx-coroutines")

    // coroutines
    implementation("org.jetbrains.kotlinx:kotlinx-coroutines-core:$coroutinesVersion")
    implementation("org.jetbrains.kotlinx:kotlinx-coroutines-slf4j:$coroutinesVersion")
    testImplementation("org.jetbrains.kotlinx:kotlinx-coroutines-test:$coroutinesVersion")
@Component
class AccountQuery : Query {
    @Auth(minRole = Role.ADMIN)
    suspend fun account(
        customerId: String,
        accountId: String
    ): Account {
            return Account(customerId, accountId)
}

class Account(
    private val customerId: String,
    private val accountId: String
) {
    suspend fun activities(
        fromDate: String? = null,
        toDate: String? = null,
        @GraphQLIgnore @Autowired
        activityService: ActivityService
    ): List<Activity> {
        return activityService.getTransactions(
            customerId,
            accountId,
            startDate = fromDate,
            endDate = toDate
        )
    }
query AccountActivities {
      account(accountId: "${accountId}", customerId:"${customerId}") {
        statuses {
            isDelinquent
        }
      }
    }

Expected behavior
Actually /health endpoint shouldn't be blocked by normal traffic in reactive framework.

@patiramyadav patiramyadav added the type: bug Something isn't working label Dec 5, 2023
@patiramyadav patiramyadav changed the title Spring Boot 3.1.6 health endpoint latency very high when other query is running. Spring Boot 3.1.6 health endpoint latency is very high when other query is running. Dec 5, 2023
@samuelAndalon
Copy link
Contributor

please provide a repository link where issue can be reproduced

@patiramyadav
Copy link
Author

please provide a repository link where issue can be reproduced

It is on private project, I will try to create public project and add there.

@patiramyadav
Copy link
Author

patiramyadav commented Dec 6, 2023

In heavy IO mode this happens, hard to upload whole project. I don't have 24hour live mock server. If you guys have mock server and open api spec, I will generate client, I need at least two mock servers(micro services).

This problem happens when all the reactive threads are busy and the time acts like blocking. My management port is different 9999, does it share same thread pool with port 8080?

image

Something like this should solve the problem:

@OptIn(ExperimentalCoroutinesApi::class)
val customDispatcher = Dispatchers.IO.limitedParallelism(300)
suspend fun <T> withIOContext(block: suspend () -> T) = withContext(customDispatcher) {
    block()
}

But I am not sure how to configure this globally singleton way.

I used k6 to generate the traffic:
query.ts

import { check, sleep } from 'k6';
import http from 'k6/http';

// Use istio url
const BASE_URL = 'http://localhost:8080/graphql'

// Add customerId and accountId so that each request will pick randomly from here.
const listOfAccountDetails = [
  { 'accountId': '1', 'customerId': '2' },
  { 'accountId': '3', 'customerId': '4' },
  { 'accountId': '5', 'customerId': '6' },
  { 'accountId': '7', 'customerId': '8' },
  { 'accountId': '9', 'customerId': '10' },
  { 'accountId': '11', 'customerId': '12' },
];

export const options = {
  duration: '10000d', // Run forever
  vus: 500,
  thresholds: {
    checks: ['rate>0.9'] // Ensures > 90% of function checks are successful
  }
};


function account_query2(accountId, customerId) {
  return `
      query fetchAccount1{
        account1(accountId: "${accountId}", customerId:"${customerId}"){
          activities(fromDate: "2010-01-01", toDate:"2023-12-02"){
            amount
          }
        }
      }`;
}

// Main function to execute the request
export default function () {
  const randomIndex = Math.floor(Math.random() * listOfAccountDetails.length);
  const accDetails = listOfAccountDetails[randomIndex];
  const request = account_query2(accDetails.accountId, accDetails.customerId)
  console.log(JSON.stringify(accDetails));
  const accountResponse = http.post(BASE_URL, JSON.stringify({ query: request }), {
    headers:
    {
      "Content-Type": "application/json",
      'requester-id': "12345666666",
      'role': 'agent',
      "x-b3-traceid": 'xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx'.replace(/[xy]/g, function (c) {
        let r = Math.random() * 16 | 0, v = c == 'x' ? r : (r & 0x3 | 0x8);
        return Math.random() * 16 | 0, v = c == 'x' ? r : (r & 0x3 | 0x8).toString(16);
      }),
      "x-b3-spanid": "xxxxxxx",
      "x-b3-parentspanid": "1242314314",
      "x-b3-sampled": "1"
    }
  });
  console.log(accountResponse.body, accountResponse.status);
  console.log(JSON.parse(accountResponse.body));
  check(accountResponse, {
    'account query is 200': (r) => { return r.status === 200 && !("errors" in JSON.parse(r.body)) }
  });
}

health.ts

import { check, sleep } from 'k6';
import http from 'k6/http';

// Use istio url
const BASE_URL = 'http://localhost:9999/health'


export const options = {
  duration: '10000d', // Run forever
  vus: 1,
  thresholds: {
    checks: ['rate>0.9'] // Ensures > 90% of function checks are successful
  }
};

// Main function to execute the request
export default function () {
  const accountResponse = http.get(BASE_URL);
  console.log(accountResponse.body, accountResponse.status);
  console.log(JSON.parse(accountResponse.body));
  check(accountResponse, {
    'health query is 200': (r) => { return r.status === 200 && !("errors" in JSON.parse(r.body)) }
  });
  // sleep(3)
}

@patiramyadav
Copy link
Author

patiramyadav commented Dec 6, 2023

Screen.Recording.2023-12-05.at.11.29.57.PM.mov

For security reason I can't show whole screen, Left side: calling /health endpoint single user(health in cloud usually have single call). Right side: actual query with 500 users.

query throughput:

 http_reqs......................: 4720   37.184813/s
     http_req_duration..............: avg=8.6s     min=0s       med=7.53s max=1m0s   p(90)=9.94s  p(95)=32.46s
       { expected_response:true }...: avg=8.09s    min=678.83ms med=7.86s max=59.76s p(90)=9.61s  p(95)=13.52s

health endpoint throughput:

     http_reqs......................: 7       0.062976/s
     http_req_duration..............: avg=15.4s    min=11.85s med=16.29s max=17.46s p(90)=17.07s  p(95)=17.26s  
       { expected_response:true }...: avg=15.4s    min=11.85s med=16.29s max=17.46s p(90)=17.07s  p(95)=17.26s  

health check timeout could be increased but not ideally more than 5 sec.
I am not expert on writing core framework but health check if enabled should have higher priority or configurable or dedicated thread in reactive framework.

@patiramyadav
Copy link
Author

patiramyadav commented Dec 6, 2023

When I am wrapping like this problem solved, but I have lots of query. Service1 is actually service, hiding name.
image
image
image

@dariuszkuc
Copy link
Collaborator

Spring Boot Webflux defaults to reactor Netty threadpool which in turns defaults to max(cpu, 4). By default, we don't switch graphql-java execution to other threads (as it is expensive) and it should generally be up to the clients to configure long running queries to use different dispatchers... but maybe we should.

You have few options

  • per Netty documentation you could try providing dedicated IO worker thread by specifying -Dreactor.netty.ioSelectCount=1 and also potentially increasing worker threads
    • haven't tried it but it does sound like it should alleviate your health endpoint issue
  • run graphql-java requests in a different thread by wrapping it in a coroutine dispatcher*
    • extend default SpringGraphQLServer and just wrap its execution with default dispatcher
class MySpringGraphQLServer(
    requestParser: SpringGraphQLRequestParser,
    contextFactory: SpringGraphQLContextFactory,
    requestHandler: GraphQLRequestHandler
) : SpringGraphQLServer(requestParser, contextFactory, requestHandler) {
    override suspend fun execute(request: ServerRequest): GraphQLServerResponse? = withContext(Dispatchers.Default) {
        super.execute(request)
    }
}

*we may end up pushing this change to the repo as well

@patiramyadav
Copy link
Author

patiramyadav commented Dec 7, 2023

@dariuszkuc Something like this doesn't work.

java -Dreactor.netty.ioSelectCount=200 -jar application/build/libs/application-0.0.1-SNAPSHOT.jar com.company.orchestrator.MainKt

This worked for me, thanks lot.

class IOSpringGraphQLServer(
    requestParser: SpringGraphQLRequestParser,
    contextFactory: SpringGraphQLContextFactory,
    requestHandler: GraphQLRequestHandler
) : SpringGraphQLServer(requestParser, contextFactory, requestHandler) {
    @OptIn(ExperimentalCoroutinesApi::class)
    override suspend fun execute(request: ServerRequest): GraphQLServerResponse? {
        val withContext = withContext(Dispatchers.IO.limitedParallelism(300)) {
            super.execute(request)
        }
        return withContext
    }
}

And config:

@Configuration
class GraphQLConfig(
    private val requestParser: SpringGraphQLRequestParser,
    private val contextFactory: SpringGraphQLContextFactory,
    private val requestHandler: GraphQLRequestHandler
) {
    @Bean
    fun ioSpringGraphQLServer(): IOSpringGraphQLServer {
        return IOSpringGraphQLServer(requestParser, contextFactory, requestHandler)
    }
}

@dariuszkuc
Copy link
Collaborator

I'm guessing you should be increasing the workers instead of IO selector thread, per Netty docs it should be -Dreactor.netty.ioSelectCount=1 -Dreactor.netty.ioWorkerCount=200.

That being said.... since graphql-kotlin is based on the reactive paradigm, IMHO 200/300 threads is waaaaaay too many (this looks kind of like config you would use for blocking servlet Tomcat based configuration) as most of them will never be used. If you are non-blocking (i.e. using coroutines and switching work as necessary) then you generally only need couple compute threads -> Dispatchers.Default should be plenty. You would then switch to Dispatchers.IO for the IO calls or to some other thread for some heavy computation work.

Also note that the switch there in the GraphQLServer is only for the overall graphql-java engine orchestration, individual data fetchers will run in the default coroutine scope (assuming you are using coroutines) unless you provide custom context with a different dispatcher.

@patiramyadav
Copy link
Author

patiramyadav commented Dec 7, 2023

Agreed: spring-cloud/spring-cloud-gateway#1844 I see here. It works with those.

@samuelAndalon
Copy link
Contributor

Will create a PR to execute operations out of the reactor http epoll

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Development

No branches or pull requests

3 participants