Skip to content

feat(packetparser): Allow sampling of packets #1628

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 0 commits into from

Conversation

mmckeen
Copy link
Contributor

@mmckeen mmckeen commented May 27, 2025

Description

For nodes with lots of forwarded packets packetparser can easily get overwhelmed and utilize a large amount of resources.

This allows for optional sampling of packetparser packets with best-effort weighting of metrics based on observed packets. This only applies to dataAggregationLevel: high.

There's only a minor change in the existing behavior when the default sampling rate of 1 is applied, this is that we only look at the current packet for deciding when to report for important TCP flags. Previously we would report if any important flag had been seen on the connection. In practice this would mean we'd always report packets since SYN was considered an important flag. Since we also take into account whether or not any new flags are seen on the connection the number of packets now being reported at the reporting interval versus always as before should be minimal.

Checklist

  • I have read the contributing documentation.
  • I signed and signed-off the commits (git commit -S -s ...). See this documentation on signing commits.
  • I have correctly attributed the author(s) of the code.
  • I have tested the changes locally.
  • I have followed the project's style guidelines.
  • I have updated the documentation, if necessary.
  • I have added tests, if applicable.

Screenshots (if applicable) or Testing Completed

At sampling rate of 100, forwarded bytes and packets remain similar to those reported by cadvisor.

Screenshot 2025-05-27 at 1 52 55 PM Screenshot 2025-05-27 at 1 49 37 PM

This correlation gets stronger the lower the sampling rate (below at 2).

Screenshot 2025-05-27 at 2 24 54 PM Screenshot 2025-05-27 at 2 25 14 PM

Additional Notes

I was unsure whether or not reporting packets for important TCP flags across the entire lifetime of the connection was intentional or not. Without looking only at the current packet sampling doesn't work. I can change the logic to keep the existing behavior for non-sampled packets if it makes sense.


Please refer to the CONTRIBUTING.md file for more information on how to contribute to this project.

@mmckeen mmckeen requested a review from a team as a code owner May 27, 2025 21:01
@mmckeen mmckeen requested review from mainred and byte-msft May 27, 2025 21:01
@mmckeen mmckeen force-pushed the packetParserSampling branch 4 times, most recently from 28b1e68 to 345f2aa Compare May 27, 2025 22:51
@nddq nddq requested review from nddq and SRodi and removed request for mainred and byte-msft May 28, 2025 18:34
Copy link
Member

@SRodi SRodi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for this contribution and the detailed context, @mmckeen! The sampling feature is a great idea for improving resource utilization on nodes with high traffic. I noticed there are also some improvements to existing packet reporting logic bundled with the new implementation. To keep reviews focused and changes easier to track, would you be open to splitting the new sampling functionality and the general code improvements into two separate PRs? This would help us ensure each change gets the attention it deserves.

Really appreciate your work and your openness to feedback. Let us know how we can support you with the split—happy to help however we can! cc @nddq

@mmckeen
Copy link
Contributor Author

mmckeen commented Jun 4, 2025

Thanks so much for this contribution and the detailed context, @mmckeen! The sampling feature is a great idea for improving resource utilization on nodes with high traffic. I noticed there are also some improvements to existing packet reporting logic bundled with the new implementation. To keep reviews focused and changes easier to track, would you be open to splitting the new sampling functionality and the general code improvements into two separate PRs? This would help us ensure each change gets the attention it deserves.

Really appreciate your work and your openness to feedback. Let us know how we can support you with the split—happy to help however we can! cc @nddq

@SRodi do you mind pointing out the code improvements?

If you mean the change to the reporting of important packets I can split that out. Without it though the sampling won't work (as mentioned in the PR description).

@SRodi
Copy link
Member

SRodi commented Jun 5, 2025

If you mean the change to the reporting of important packets I can split that out. Without it though the sampling won't work (as mentioned in the PR description).

@mmckeen I would prefer to see the change to packets reporting and the sampling split into 2 separate PRs.

@mmckeen
Copy link
Contributor Author

mmckeen commented Jun 6, 2025

I split those changes off into #1665, as noted that PR relies on this one to report accurate packet counts and maintain backwards compatibility.

I can also split the metric weighting and counting of packets between reports into that PR so it's isolated, let me know what makes sense 🙇.

@SRodi
Copy link
Member

SRodi commented Jun 9, 2025

I can also split the metric weighting and counting of packets between reports into that PR so it's isolated, let me know what makes sense 🙇.

Yes, that makes sense. Thanks. cc @mmckeen

@mmckeen mmckeen marked this pull request as draft June 11, 2025 23:02
@mmckeen
Copy link
Contributor Author

mmckeen commented Jun 11, 2025

@SRodi #1665 has been extended to include the weighting functionality, I'll put this into draft until I have the chance to rebase this on top of that if things look good 🙇

github-merge-queue bot pushed a commit that referenced this pull request Jul 17, 2025
…s, improve scalability (#1665)

# Description

Previously `packetparser` in `high` `dataAggregationLevel` would report
(mostly) every single packet since important flags were observed over
the lifetime of the connection.

This changes that behavior to only observe the important flags on
individual packets and report when necessary.

This will mean less packets are reported. However, it also adds back
weighting for bytes, packets, and TCP flags so that metrics remain
accurate versus before.

I also noticed the current docs for the TCP flags metrics are
inaccurate, we only report a subset of the supported flags. Not sure if
this is intentional, however supporting more flags will put more memory
pressure on both conntrack as well as performance pressure on packet
reporting. With sampling in place, this should be more than worth it but
there may be repercussions for the performance of `low`
`dataAggregationLevel`.

## Checklist

- [X] I have read the [contributing
documentation](https://retina.sh/docs/Contributing/overview).
- [X] I signed and signed-off the commits (`git commit -S -s ...`). See
[this
documentation](https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification)
on signing commits.
- [X] I have correctly attributed the author(s) of the code.
- [X] I have tested the changes locally.
- [X] I have followed the project's style guidelines.
- [X] I have updated the documentation, if necessary.
- [X] I have added tests, if applicable.

## Screenshots (if applicable) or Testing Completed

eBPF objects compile and load as expected.

# `main` Branch

<img width="1463" alt="tcpflags main"
src="https://github.com/user-attachments/assets/167908f0-7c37-4498-a7f6-20a41110d925"
/>
<img width="1463" alt="prometheus packets retina main"
src="https://github.com/user-attachments/assets/de8ad834-ed91-4673-8643-aa9cc51b3451"
/>
<img width="1463" alt="prometheus bytes retina main"
src="https://github.com/user-attachments/assets/420f54dc-fbbe-4a7b-a290-24c5d6666518"
/>

# This Branch

<img width="1463" alt="tcpflags patched"
src="https://github.com/user-attachments/assets/88460ee0-f769-4992-b9be-392b38a64b19"
/>
<img width="1463" alt="prometheus packets retina patched"
src="https://github.com/user-attachments/assets/304bbb97-8c57-477f-9c19-91bb1510b9c7"
/>
<img width="1463" alt="prometheus bytes retina patched"
src="https://github.com/user-attachments/assets/ef917562-487e-45be-8fbb-c9573fa708c1"
/>

## Additional Notes

#1628 will be a follow-up to
this to add additional sampling functionality.

---

Please refer to the [CONTRIBUTING.md](../CONTRIBUTING.md) file for more
information on how to contribute to this project.

Signed-off-by: Matthew McKeen <matthew.mckeen@fastly.com>
@mmckeen mmckeen force-pushed the packetParserSampling branch 2 times, most recently from 371e50c to 0e9cd0c Compare July 22, 2025 21:19
@mmckeen mmckeen closed this Jul 22, 2025
@mmckeen mmckeen force-pushed the packetParserSampling branch from 0e9cd0c to 1983b55 Compare July 22, 2025 21:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants