Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add performance debugging section #4525

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
67 changes: 67 additions & 0 deletions doc/userguide/performance/debug.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
Performance Debugging
=====================

There are many possibilities that could be the reason for performance issues.
In this section we will guide you through some options.

General
-------

First of all you should check all the log files with a focus on stats.log and
suricata.log if any obvious issues are seen. There are several tools that can
help to find a root cause.

A first step is to run a tool like **htop** to get an overview of the system
load and if there is a bottleneck with the traffic distribution. For example if
you can see that only a small number of cpu cores hit 100% all the time and
others don't, it could be related to a bad traffic distribution or elephant
flows. In the first case try to improve the configuration, in the other case
try to filter or shunt those big flows with either bpf filter, bypass rules or
eBPF/XDP.

Another helpful tool is **perf** which helps to spot performance issues. Make
sure you have it installed and also the debug symbols installed for suricata or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/suricata/Suricata/

the output won't be very helpful. This output is also helpful when you report
performance issues as the Suricata Development team can narrow down possible
bugs with that.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/bugs/issues/ ?


::

sudo perf top -p $(pidof suricata)

If you see specific function calls at the top and red it's a hint that those
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/at top and red/at top in red/

are the bottlenecks. For example if you see **IPOnlyMatchPacket** it can be
either a result of high drop rates or incomplete flows which result in
decreased performance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be helpful to add in text for checking out the perf top for a specific cpu and or a thread. (-t , -c i think )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add those.


Another recommendation is to run Suricata without any rules to see if it's
mainly related to the traffic. It can also be helpful to use rule-profiling
and/or packet-profiling at this step.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be worth mentioning that for that part Suricata needs to be compiled with enable-profiling and that has a perf impact so it is advised not to leave it like that in prod.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add this note


Traffic
-------

In most cases where the hardware is fast enough to handle the traffic but the
drop rate is still high it's related to specific traffic issues.

First steps to check are:

- Check if the traffic is bidirectional, if it's mostly unidirectional you're missing relevant parts of the flow (see **tshark** example at the bottom)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also check if there is a big discrepancy between SYN vs SYN-ACKs and RSTs in the stats/eve logs.

- Check for encapsulated traffic, while GRE, MPLS etc. are supported they could also lead to performance issues. Especially if there are several layers of encapsulation
- Use tools like **iftop** to spot elephant flows. Flows that have a rate of over 1Gbit/s for a long time can result in one cpu core at 100% all the time and increasing the droprate while it doesn't make sense to dig deep into this traffic.
- If VLAN is used it might help to disable **vlan.use-for-tracking** especially in scenarios where only one direction of the flow has the VLAN tag
- If VLAN QinQ (IEEE 802.1ad) is used be very cautious if you use **cluster_qm** in combinatin with Intel drivers. While the RFC expects ethertype 0x8100 and 0x88A8 in this case (see https://en.wikipedia.org/wiki/IEEE_802.1ad) most implementations only add 0x8100 on each layer. If the first seen layer has the same VLAN tag but the inner one has different VLAN tags it will still end up in the same queue in **cluster_qm** mode.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be mention what kernel level and specific Intel drivers (ex i40/ixgbe etc..) this is observed under. It may not be true for all Intell/all kernel versions.

Mentioning af-packet might make it easier to differentiate the runmode used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't be able to test all old ones but I can at least add "up to version/firmware XY"

- Check for other unusual or complex protocols that aren't supported very well. In several cases we've seen that Cisco Fabric Path (ethertype 0x8903) causes performance issues. It's recommended to filter it, one option would be a bpf filter with **not ether proto 0x8903**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A useful addition could be mentioning that bulk debug targeting could help in pinpointing an issue.
For example run suricata with bp filter - port 80 , or port 25 or not port 443 could help zeroing in on problematic protocol or rules category in combination with perf top for a specific cpu or thread.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add a section for this as well


Suricata also provides several specific traffic related signatures in the rules
folder that could be enabled for testing to spot specific traffic issues.

If you want to use **tshark** to get an overview of the traffic direction use this command:

::

sudo tshark -i $INTERFACE -q -z conv,ip -a duration:10

The output will show you all flows within 10s and if you see 0 for one
direction you have unidirectional traffic, thus you don't see the ACK packets
for example.
1 change: 1 addition & 0 deletions doc/userguide/performance/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,4 @@ Performance
packet-profiling
rule-profiling
tcmalloc
debug