Skip to content

Conversation

@AltraMayor
Copy link
Owner

On rare occasions, the flow table of a GK instance may be corrupted. This has been seen in a Gatekeeper server in production running for a month without a reboot. The problem can be identified with the presence of log entries similar to the one below:

GATEKEEPER: flow: The GK block failed to delete a key from hash table at gk_del_flow_entry_from_hash: No such file or directory
 for the flow with IP source address 1.1.1.1, and destination address 2.2.2.2

The IP addresses in the log entry above have been anonymized.

While Gatekeeper keeps going, if corruption keeps happening, a reboot will be eventually required. If no more corruption is added, the log of Gatekeeper will be full of warnings and errors.

This patch identifies corruption, heal the flow table, and log information that will enable one to track down the source of corruption. Once the flow table is healed, Gatekeeper will keep working normally and without extra log entries besides the ones added during the healing process.

@AltraMayor AltraMayor added enhancement Production requirement Either the issue is solved, or Gatekeeper doesn't work in production workaround available A temporary solution has been found labels Oct 25, 2021
@AltraMayor AltraMayor added this to the Version 1.1 milestone Oct 25, 2021
Copy link
Owner Author

@AltraMayor AltraMayor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gk: heal corrupted flow tables

Copy link
Owner Author

@AltraMayor AltraMayor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gk: heal corrupted flow tables

Copy link
Collaborator

@mengxiang0811 mengxiang0811 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Patch lib/flow: improve print_flow_err_msg() is ready for merge.

Copy link
Collaborator

@mengxiang0811 mengxiang0811 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Patch gk: improve print_flow_state() is ready for merge.

Copy link
Collaborator

@mengxiang0811 mengxiang0811 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Patch gk: heal corrupted flow tables is ready for merge

Copy link
Collaborator

@mengxiang0811 mengxiang0811 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Patch gk: add an event to scan the keys of flow tables is ready for merge

Copy link
Collaborator

@mengxiang0811 mengxiang0811 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Patch gk: pass keys to test function of flush_flow_table() is ready for merge

Copy link
Collaborator

@mengxiang0811 mengxiang0811 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Patch gk: scan keys of the flow table is ready for merge

Whenever print_flow_err_msg() could not convert one of
the IP addresses of a flow to a string, it would log the error,
but not log the error message in its parameter err_msg.

This patch makes print_flow_err_msg() use the string "<ERROR>"
for the IP address that cannot be converted to a string and log
the full error message.

Moreover, this patch adds the lcore information to the full error
message since flows depend on their GK instances to make sense.
@AltraMayor AltraMayor force-pushed the flow-tb branch 2 times, most recently from bf06e22 to b274624 Compare November 12, 2021 16:33
This patch makes print_flow_state() tolerant to corrupted
flow states as well as makes it log as much information
about a flow as possible.
This patch enhances gk_del_flow_entry_from_hash() to detect when
a flow table is corrupted, to heal it, and to log information to
enable one to investigate the source of corruption.
Thus, Gatekeeper can still work while the investigation goes on.
When corruption is found in a flow table, the GK instance waits
for a full scan of expired entries before scanning the keys of
its flow table.
This patch should put less pressure on the processor cache when
scanning for network prefixes. But the real motivation for
this patch is to enable a future patch to check the health of
the flow table.
When corruption is found in the flow table and all flow entries
have been checked for expiration, scan the keys of the flow table
for invalid keys.
@AltraMayor AltraMayor merged commit 35d8a17 into master Nov 12, 2021
@AltraMayor AltraMayor deleted the flow-tb branch November 12, 2021 17:47
@AltraMayor
Copy link
Owner Author

The code of this pull request has been tested in production, but the problem that originally led to this pull request has not come up again. The merge today is going to prepare production environments for a future occurrence, so we can identify the root cause and solve the problem down the line.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Production requirement Either the issue is solved, or Gatekeeper doesn't work in production workaround available A temporary solution has been found

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants