New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: add performance debugging section #4525
Conversation
Nice work Andreas. Can we rename this from 'debug' to 'analysis'? |
sure that is a better name. Will change it simply in another PR after any other feedback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
|
||
sudo perf top -p $(pidof suricata) | ||
|
||
If you see specific function calls at the top and red it's a hint that those |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/at top and red/at top in red/
sure you have it installed and also the debug symbols installed for suricata or | ||
the output won't be very helpful. This output is also helpful when you report | ||
performance issues as the Suricata Development team can narrow down possible | ||
bugs with that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/bugs/issues/ ?
eBPF/XDP. | ||
|
||
Another helpful tool is **perf** which helps to spot performance issues. Make | ||
sure you have it installed and also the debug symbols installed for suricata or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/suricata/Suricata/
If you see specific function calls at the top and red it's a hint that those | ||
are the bottlenecks. For example if you see **IPOnlyMatchPacket** it can be | ||
either a result of high drop rates or incomplete flows which result in | ||
decreased performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be helpful to add in text for checking out the perf top for a specific cpu and or a thread. (-t
, -c
i think )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will add those.
|
||
Another recommendation is to run Suricata without any rules to see if it's | ||
mainly related to the traffic. It can also be helpful to use rule-profiling | ||
and/or packet-profiling at this step. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be worth mentioning that for that part Suricata needs to be compiled with enable-profiling
and that has a perf impact so it is advised not to leave it like that in prod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will add this note
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job / endeavor :)
|
||
First steps to check are: | ||
|
||
- Check if the traffic is bidirectional, if it's mostly unidirectional you're missing relevant parts of the flow (see **tshark** example at the bottom) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also check if there is a big discrepancy between SYN vs SYN-ACKs and RSTs in the stats/eve logs.
- Check for encapsulated traffic, while GRE, MPLS etc. are supported they could also lead to performance issues. Especially if there are several layers of encapsulation | ||
- Use tools like **iftop** to spot elephant flows. Flows that have a rate of over 1Gbit/s for a long time can result in one cpu core at 100% all the time and increasing the droprate while it doesn't make sense to dig deep into this traffic. | ||
- If VLAN is used it might help to disable **vlan.use-for-tracking** especially in scenarios where only one direction of the flow has the VLAN tag | ||
- If VLAN QinQ (IEEE 802.1ad) is used be very cautious if you use **cluster_qm** in combinatin with Intel drivers. While the RFC expects ethertype 0x8100 and 0x88A8 in this case (see https://en.wikipedia.org/wiki/IEEE_802.1ad) most implementations only add 0x8100 on each layer. If the first seen layer has the same VLAN tag but the inner one has different VLAN tags it will still end up in the same queue in **cluster_qm** mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be mention what kernel level and specific Intel drivers (ex i40/ixgbe etc..) this is observed under. It may not be true for all Intell/all kernel versions.
Mentioning af-packet
might make it easier to differentiate the runmode used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I won't be able to test all old ones but I can at least add "up to version/firmware XY"
- Use tools like **iftop** to spot elephant flows. Flows that have a rate of over 1Gbit/s for a long time can result in one cpu core at 100% all the time and increasing the droprate while it doesn't make sense to dig deep into this traffic. | ||
- If VLAN is used it might help to disable **vlan.use-for-tracking** especially in scenarios where only one direction of the flow has the VLAN tag | ||
- If VLAN QinQ (IEEE 802.1ad) is used be very cautious if you use **cluster_qm** in combinatin with Intel drivers. While the RFC expects ethertype 0x8100 and 0x88A8 in this case (see https://en.wikipedia.org/wiki/IEEE_802.1ad) most implementations only add 0x8100 on each layer. If the first seen layer has the same VLAN tag but the inner one has different VLAN tags it will still end up in the same queue in **cluster_qm** mode. | ||
- Check for other unusual or complex protocols that aren't supported very well. In several cases we've seen that Cisco Fabric Path (ethertype 0x8903) causes performance issues. It's recommended to filter it, one option would be a bpf filter with **not ether proto 0x8903** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A useful addition could be mentioning that bulk debug targeting
could help in pinpointing an issue.
For example run suricata with bp filter - port 80
, or port 25
or not port 443
could help zeroing in on problematic protocol or rules category in combination with perf top
for a specific cpu or thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will add a section for this as well
Make sure these boxes are signed before submitting your Pull Request -- thank you.
Link to redmine ticket:
Describe changes:
PRScript output (if applicable):