-
Notifications
You must be signed in to change notification settings - Fork 247
gk: heal corrupted flow tables #531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
AltraMayor
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gk: heal corrupted flow tables
AltraMayor
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gk: heal corrupted flow tables
mengxiang0811
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Patch lib/flow: improve print_flow_err_msg() is ready for merge.
mengxiang0811
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Patch gk: improve print_flow_state() is ready for merge.
mengxiang0811
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Patch gk: heal corrupted flow tables is ready for merge
mengxiang0811
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Patch gk: add an event to scan the keys of flow tables is ready for merge
mengxiang0811
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Patch gk: pass keys to test function of flush_flow_table() is ready for merge
mengxiang0811
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Patch gk: scan keys of the flow table is ready for merge
Whenever print_flow_err_msg() could not convert one of the IP addresses of a flow to a string, it would log the error, but not log the error message in its parameter err_msg. This patch makes print_flow_err_msg() use the string "<ERROR>" for the IP address that cannot be converted to a string and log the full error message. Moreover, this patch adds the lcore information to the full error message since flows depend on their GK instances to make sense.
bf06e22 to
b274624
Compare
This patch makes print_flow_state() tolerant to corrupted flow states as well as makes it log as much information about a flow as possible.
This patch enhances gk_del_flow_entry_from_hash() to detect when a flow table is corrupted, to heal it, and to log information to enable one to investigate the source of corruption. Thus, Gatekeeper can still work while the investigation goes on.
When corruption is found in a flow table, the GK instance waits for a full scan of expired entries before scanning the keys of its flow table.
This patch should put less pressure on the processor cache when scanning for network prefixes. But the real motivation for this patch is to enable a future patch to check the health of the flow table.
When corruption is found in the flow table and all flow entries have been checked for expiration, scan the keys of the flow table for invalid keys.
|
The code of this pull request has been tested in production, but the problem that originally led to this pull request has not come up again. The merge today is going to prepare production environments for a future occurrence, so we can identify the root cause and solve the problem down the line. |
On rare occasions, the flow table of a GK instance may be corrupted. This has been seen in a Gatekeeper server in production running for a month without a reboot. The problem can be identified with the presence of log entries similar to the one below:
The IP addresses in the log entry above have been anonymized.
While Gatekeeper keeps going, if corruption keeps happening, a reboot will be eventually required. If no more corruption is added, the log of Gatekeeper will be full of warnings and errors.
This patch identifies corruption, heal the flow table, and log information that will enable one to track down the source of corruption. Once the flow table is healed, Gatekeeper will keep working normally and without extra log entries besides the ones added during the healing process.