Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kdump support #729

Merged
merged 6 commits into from
Jan 25, 2020
Merged

kdump support #729

merged 6 commits into from
Jan 25, 2020

Conversation

olivier-singla
Copy link
Contributor

In the event of a kernel crash, we need to gather as much information as possible to understand and identify the root cause of the crash. Currently, the kernel does not provide much information, which make kernel crash investigation difficult and time consuming.

Fortunately, there is a way in the kernel to provide more information in the case of a kernel crash. kdump is a feature of the Linux kernel that creates crash dumps in the event of a kernel crash. This PR will add kernel kdump support. Please note that there is another PR in sonic-utilities which is also needed:
sonic-net/sonic-buildimage#3722

An extension to the CLI utilities config and show is provided to configure and manage kdump:

  • view kdump status (enabled/disabled, active, configuration, stored crash files)
  • enable / disable kdump functionality
  • configure kdump (how many kernel crash logs can be saved, memory
    allocated for capture kernel)
  • view kernel crash logs

There is a design document which describes this kdump implementation:
sonic-net/SONiC#510

- What I did
Added kdump support

- How I did it
Please read HLD:
sonic-net/SONiC#510

- How to verify it
config kdump enable
config save
reboot
echo 1 > /proc/sys/kernel/sysrq ; echo c > /proc/sysrq-trigger
show kdump
show kdump log 1 20

@lguohan
Copy link
Contributor

lguohan commented Nov 9, 2019

retest this please

show/main.py Show resolved Hide resolved
show/main.py Show resolved Hide resolved
show/main.py Show resolved Hide resolved
scripts/sonic-kdump-config Outdated Show resolved Hide resolved
scripts/sonic-kdump-config Outdated Show resolved Hide resolved
Copy link
Contributor

@jleveque jleveque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As comments.

Also, I'm wondering whether we should move the kdump group under a "debug" group, where we can place all debug-related functionality that most network engineers would not use. (e.g., config debug kdump ..., and show debug kdump ...

config/main.py Outdated Show resolved Hide resolved
config/main.py Outdated
run_command("sonic-kdump-config --disable")

@kdump.command()
@click.option('-y', '--yes', is_flag=True, help="Force rebooting after enabling kdump")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comments need to be modified.

scripts/sonic-kdump-config Outdated Show resolved Hide resolved
scripts/sonic-kdump-config Outdated Show resolved Hide resolved
# @param verbose If True, the function will display a few additinal information
# @param no If True, will not display reboot needed message
# @return True is the grub configuration has changed, and False if it has not
def cmd_kdump_enable(verbose, no):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the parsing of grub.cfg and the logic be separated into different functions?
The idea would be facilitating the integration of other bootloaders than grub by adding an abstraction layer.

I would propose something like the following in the case of cmd_kdump_enable. Some other places in this file would need some changes too. If a cmd_function refers to grub rather than a function of bootloader then it's lacking genericity.

bootloader = get_bootloader() # this would return Grub() or Aboot() based on autodetection
if bootloader is None:
   print("Feature not supported on this platform")
   return False
image = get_current_image()
old_crashkernel = bootloader.get_crash_kernel( image )
new_crashkernel = "crashkernel: ..."
if old_crashkernel != new_crashkernel:
   bootloader.set_crash_kernel( image, new_crashkernel )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@olivier-singla , since the sonic support multiple boot loader, not just grub. We have aboot, now we are working on u-boot. The concern here is that current code does not provide good abstraction for the community to add support for other types of boot load. Can you simplify the code logic and define the api. So that others can provide their own implementation for their own bootloader?

Copy link
Contributor

@lguohan lguohan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need better abstraction so that the community can add support for different types of boot loader.

Comment on lines +427 to +428
if os.path.exists(grub_cfg):
return kdump_disable_grub(verbose, kdump_enabled, memory, num_dumps)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better than the previous version but I still feel some abstraction is lacking.
Basically with this design for every operation each bootloader has to be added inside a if/elif/elif. This is redundant and unnecessary.
Both the kdump_disable_grub and kdump_config_next_grub functions have a mix of common logic tied with grub specific code. Adding a new bootloader to this file would likely end up with a lot of copy pasting.

Could we instead just rely on a get_bootloader() function? This function would hold the if/elif/elif that returns the Bootloader instance or raise UnknownBootloaderError.
Each bootloader would then override the method of the base class Bootloader to add support.

Adding a new Bootloader would now just be about adding a new class definition and changing 2 lines in get_bootloader` which is a much better design.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@olivier-singla , can you address them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure more abstraction will make things easier or simpler when adding support for another boot loader. There is little code duplicated between kdump_enable_grub, kdump_config_next_grub and cmd_kdump_disable_grub. grub and u-boot (for instance) use a different mechanism: grub use a flat text file, where u-boot use environment variables.

as possible to understand and identify the root cause of the crash.
Currently, the kernel does not provide much information, which make
kernel crash investigation difficult and time consuming.

Fortunately, there is a way in the kernel to provide more information
in the case of a kernel crash. kdump is a feature of the Linux kernel
that creates crash dumps in the event of a kernel crash. This PR
will add kernel kdump support.

An extension to the CLI utilities config and show is provided to
 configure and manage kdump:
 - enable / disable kdump functionality
 - configure kdump
 - view kernel crash logs
- remove some duplicated code
- when running kdump configuration command on Arista switch, display a message
  "Feature not supported on this platform"
Removed --no and --yes option on kdump enable and disable operations.
Restructured the code so it will be easier to support other bootloader than
grub (such as u-boot on arm64 platform).
@olivier-singla
Copy link
Contributor Author

retest please

@lguohan
Copy link
Contributor

lguohan commented Jan 24, 2020

retest this please

@lguohan lguohan merged commit 6babd1c into sonic-net:master Jan 25, 2020
VMCORE_FILE=/proc/vmcore
if [ -e $VMCORE_FILE -a -s $VMCORE_FILE ]; then
debug "We have a /proc/vmcore, then we just kdump'ed"
/sbin/reboot
Copy link
Collaborator

@stephenxs stephenxs Mar 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that to call /sbin/reboot ahead of echoing user request reboot to /host/reboot-cause/reboot-cause.txt fails the reboot cause?
@jleveque Could you please take a look?

stepanblyschak pushed a commit to stepanblyschak/sonic-utilities that referenced this pull request Apr 28, 2022
e98a7af95a9767093904d9e8fd320067163d5f87 (HEAD -> 201911, origin/201911) [syncd] Translate removed RIDs in fdb notification (sonic-net#729)
3ceeae5371eee5b69064fa1af88f51e27caa2d36 [syncd] Process all cases fdb flush notification (sonic-net#726)
115ba0783edf85658fd0329eb23796d758c309f5 fix compile error when compiling with g++-4.8.4 (sonic-net#718)
a67f94d3d91325516069ef8c0d99bdec30bafbce Fix typo at SAI_ATTR_VALUE_TYPE_ACL_FIELD_DATA_UINT32 (sonic-net#662)

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants