Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fqdn perf test #32514

Merged
merged 1 commit into from
May 28, 2024
Merged

Add fqdn perf test #32514

merged 1 commit into from
May 28, 2024

Conversation

marseel
Copy link
Contributor

@marseel marseel commented May 13, 2024

Start measuing fqdn performance. This test runs 30 qps of DNS requests
to S3, which return different set of IPs with each request.
We keep track of client-side latency, latencies reported by cilium-agent
metrics and also cpu/memory usage while also gathering profiling
information.

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label May 13, 2024
@marseel marseel force-pushed the add_fqdn_perf_test branch 10 times, most recently from 555927d to ef5856d Compare May 15, 2024 11:07
@marseel
Copy link
Contributor Author

marseel commented May 15, 2024

Example results
Based on Cilium metrics:

{
  "version": "v1",
  "dataItems": [
    {
      "data": {
        "DNS Proxy dataplane time - Perc50": 0.0025064447433310916,
        "DNS Proxy dataplane time - Perc99": 0.0049627605917955606,
        "DNS Proxy policy check time - Perc50": 0.0025,
        "DNS Proxy policy check time - Perc99": 0.00495,
        "DNS Proxy policy generation time - Perc50": 0.0028914533229893974,
        "DNS Proxy policy generation time - Perc99": 0.09597826086956514,
        "DNS Proxy policy semaphore time - Perc50": 0.0025,
        "DNS Proxy policy semaphore time - Perc99": 0.00495,
        "DNS Proxy processing time - Perc50": 0.0028933238452581184,
        "DNS Proxy processing time - Perc99": 0.09811848958333333,
        "DNS Proxy total time - Perc50": 0.0029030897053096195,
        "DNS Proxy total time - Perc99": 0.11715346534653429,
        "DNS Proxy upstream time - Perc50": 0.002575138185168125,
        "DNS Proxy upstream time - Perc99": 0.021066249999999915
      },
      "unit": "s"
    }
  ]
}

Based on client-side metrics:

{
  "version": "v1",
  "dataItems": [
    {
      "data": {
        "DNS Error Count": 0,
        "DNS Error Percentage": 0,
        "DNS Lookup Count": 10734,
        "DNS Lookup Latency - Perc50": 0.00993109151047409,
        "DNS Lookup Latency - Perc99": 0.17374999999999893,
        "DNS Timeout Count": 0
      },
      "unit": "s"
    }
  ]
}

CPU/mem usage:

50th percentile
    {
      "Name": "cilium-pgg9c/cilium-agent",
      "CPU": 0.483284872,
      "Mem": 236933120
    },
    {
      "Name": "cilium-v5bq6/cilium-agent",
      "CPU": 0.634017295,
      "Mem": 228868096
    },
99th percentile
    {
      "Name": "cilium-pgg9c/cilium-agent",
      "CPU": 0.518770507,
      "Mem": 238755840
    },
    {
      "Name": "cilium-v5bq6/cilium-agent",
      "CPU": 0.716143043,
      "Mem": 243933184
    },

CPU pprof:
image

@marseel marseel force-pushed the add_fqdn_perf_test branch 2 times, most recently from 1932362 to 2ff9d7c Compare May 16, 2024 12:27
@marseel marseel added the release-note/ci This PR makes changes to the CI. label May 16, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label May 16, 2024
@marseel marseel force-pushed the add_fqdn_perf_test branch 4 times, most recently from 644162b to a71acee Compare May 16, 2024 14:09
@marseel marseel requested a review from squeed May 16, 2024 14:09
@marseel marseel marked this pull request as ready for review May 16, 2024 14:09
@marseel marseel requested review from a team as code owners May 16, 2024 14:09
@marseel marseel requested review from thorn3r, aanm and tklauser May 16, 2024 14:09
@marseel
Copy link
Contributor Author

marseel commented May 16, 2024

One interesting observation, when I increased the number of distinct DNS names from 10 to 100, without changing qps , most of the requests start to fail, timing out on policy generation time:

{
  "version": "v1",
  "dataItems": [
    {
      "data": {
        "DNS Proxy dataplane time - Perc50": 0.0027479629109300363,
        "DNS Proxy dataplane time - Perc99": 0.6157446808510627,
        "DNS Proxy policy check time - Perc50": 0.0025,
        "DNS Proxy policy check time - Perc99": 0.00495,
        "DNS Proxy policy generation time - Perc50": 10,
        "DNS Proxy policy generation time - Perc99": 10,
        "DNS Proxy policy semaphore time - Perc50": 0.0025,
        "DNS Proxy policy semaphore time - Perc99": 0.00495,
        "DNS Proxy processing time - Perc50": 10,
        "DNS Proxy processing time - Perc99": 10,
        "DNS Proxy total time - Perc50": 10,
        "DNS Proxy total time - Perc99": 10,
        "DNS Proxy upstream time - Perc50": 0.002936055238667067,
        "DNS Proxy upstream time - Perc99": 0.04322471910112352
      },
      "unit": "s"
    }
  ]
}
{
  "version": "v1",
  "dataItems": [
    {
      "data": {
        "DNS Error Count": 8127.043253333333,
        "DNS Error Percentage": 80.77929442324005,
        "DNS Lookup Count": 10060.8,
        "DNS Lookup Latency - Perc50": 10,
        "DNS Lookup Latency - Perc99": 10,
        "DNS Timeout Count": 8127.043253333333
      },
      "unit": "s"
    }
  ]
}

Copy link
Contributor

@thorn3r thorn3r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, A+ for modularizing the cl2 bits 😎

@marseel
Copy link
Contributor Author

marseel commented May 21, 2024

/test

Start measuing fqdn performance. This test runs 30 qps of DNS requests
to S3, which return different set of IPs with each request.
We keep track of client-side latency, latencies reported by cilium-agent
metrics and also cpu/memory usage while also gathering profiling
information.

Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>
@marseel
Copy link
Contributor Author

marseel commented May 28, 2024

rebased on main to pull change with EKS clusters not using preemtibles.

@marseel
Copy link
Contributor Author

marseel commented May 28, 2024

/test

@marseel
Copy link
Contributor Author

marseel commented May 28, 2024

Not sure why it doesn't get label ready-to-merge, all required tests passed, reviews are in, no pending comments or blocking labels. Marking as ready-to-merge.

@marseel marseel added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label May 28, 2024
@tklauser tklauser added this pull request to the merge queue May 28, 2024
Merged via the queue into main with commit fcc717d May 28, 2024
263 of 267 checks passed
@tklauser tklauser deleted the add_fqdn_perf_test branch May 28, 2024 14:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/ci This PR makes changes to the CI.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants