Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add performance debugging section #4525

Closed
wants to merge 1 commit into from
Closed

Conversation

norg
Copy link
Member

@norg norg commented Feb 6, 2020

Make sure these boxes are signed before submitting your Pull Request -- thank you.

Link to redmine ticket:

Describe changes:

  • Add additional information about performance issue and how to debug those

PRScript output (if applicable):

@victorjulien
Copy link
Member

Nice work Andreas. Can we rename this from 'debug' to 'analysis'?

@norg
Copy link
Member Author

norg commented Feb 7, 2020

Nice work Andreas. Can we rename this from 'debug' to 'analysis'?

sure that is a better name. Will change it simply in another PR after any other feedback.

Copy link
Contributor

@jlucovsky jlucovsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!


sudo perf top -p $(pidof suricata)

If you see specific function calls at the top and red it's a hint that those
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/at top and red/at top in red/

sure you have it installed and also the debug symbols installed for suricata or
the output won't be very helpful. This output is also helpful when you report
performance issues as the Suricata Development team can narrow down possible
bugs with that.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/bugs/issues/ ?

eBPF/XDP.

Another helpful tool is **perf** which helps to spot performance issues. Make
sure you have it installed and also the debug symbols installed for suricata or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/suricata/Suricata/

If you see specific function calls at the top and red it's a hint that those
are the bottlenecks. For example if you see **IPOnlyMatchPacket** it can be
either a result of high drop rates or incomplete flows which result in
decreased performance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be helpful to add in text for checking out the perf top for a specific cpu and or a thread. (-t , -c i think )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add those.


Another recommendation is to run Suricata without any rules to see if it's
mainly related to the traffic. It can also be helpful to use rule-profiling
and/or packet-profiling at this step.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be worth mentioning that for that part Suricata needs to be compiled with enable-profiling and that has a perf impact so it is advised not to leave it like that in prod.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add this note

Copy link
Member

@pevma pevma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job / endeavor :)


First steps to check are:

- Check if the traffic is bidirectional, if it's mostly unidirectional you're missing relevant parts of the flow (see **tshark** example at the bottom)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also check if there is a big discrepancy between SYN vs SYN-ACKs and RSTs in the stats/eve logs.

- Check for encapsulated traffic, while GRE, MPLS etc. are supported they could also lead to performance issues. Especially if there are several layers of encapsulation
- Use tools like **iftop** to spot elephant flows. Flows that have a rate of over 1Gbit/s for a long time can result in one cpu core at 100% all the time and increasing the droprate while it doesn't make sense to dig deep into this traffic.
- If VLAN is used it might help to disable **vlan.use-for-tracking** especially in scenarios where only one direction of the flow has the VLAN tag
- If VLAN QinQ (IEEE 802.1ad) is used be very cautious if you use **cluster_qm** in combinatin with Intel drivers. While the RFC expects ethertype 0x8100 and 0x88A8 in this case (see https://en.wikipedia.org/wiki/IEEE_802.1ad) most implementations only add 0x8100 on each layer. If the first seen layer has the same VLAN tag but the inner one has different VLAN tags it will still end up in the same queue in **cluster_qm** mode.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be mention what kernel level and specific Intel drivers (ex i40/ixgbe etc..) this is observed under. It may not be true for all Intell/all kernel versions.

Mentioning af-packet might make it easier to differentiate the runmode used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't be able to test all old ones but I can at least add "up to version/firmware XY"

- Use tools like **iftop** to spot elephant flows. Flows that have a rate of over 1Gbit/s for a long time can result in one cpu core at 100% all the time and increasing the droprate while it doesn't make sense to dig deep into this traffic.
- If VLAN is used it might help to disable **vlan.use-for-tracking** especially in scenarios where only one direction of the flow has the VLAN tag
- If VLAN QinQ (IEEE 802.1ad) is used be very cautious if you use **cluster_qm** in combinatin with Intel drivers. While the RFC expects ethertype 0x8100 and 0x88A8 in this case (see https://en.wikipedia.org/wiki/IEEE_802.1ad) most implementations only add 0x8100 on each layer. If the first seen layer has the same VLAN tag but the inner one has different VLAN tags it will still end up in the same queue in **cluster_qm** mode.
- Check for other unusual or complex protocols that aren't supported very well. In several cases we've seen that Cisco Fabric Path (ethertype 0x8903) causes performance issues. It's recommended to filter it, one option would be a bpf filter with **not ether proto 0x8903**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A useful addition could be mentioning that bulk debug targeting could help in pinpointing an issue.
For example run suricata with bp filter - port 80 , or port 25 or not port 443 could help zeroing in on problematic protocol or rules category in combination with perf top for a specific cpu or thread.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add a section for this as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
4 participants