Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

system hang during linuxki #17

Closed
dfiduk opened this issue Oct 23, 2018 · 15 comments
Closed

system hang during linuxki #17

dfiduk opened this issue Oct 23, 2018 · 15 comments

Comments

@dfiduk
Copy link

dfiduk commented Oct 23, 2018

When I try to start runki, system hangs. I have to restart system from IPMI. No output in console.
Founded the problem related with OpenvSwitch. When I stop openvswitch-switch.service, runki works without issues.

# uname -a
Linux ds1-cpu-01.ds1 4.13.0-31-generic #34~16.04.1-Ubuntu SMP Fri Jan 19 17:11:01 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

# ovs-vsctl -V    
ovs-vsctl (Open vSwitch) 2.5.2
Compiled Oct 17 2017 16:38:57
DB Schema 7.12.1

# dpkg --status linuxki | grep Version
Version: 5.5-1
@MarkCRay
Copy link
Contributor

Thank you for reporting this issue. To root cause, I'll have to setup a system to duplicate. I haven't used OpenVswitch before, so it may take me some time.

I can update the runki script to quit if OpenVswitch is present, and print a warning.

@MarkCRay
Copy link
Contributor

I have not been able to duplicate the issue so far. I thought there may be a conflict between the LinuxKI kernel module (likit.ko), and the openvswitch kernel module (openvswitch.ko). However, the 2.5.X version will not build the kernel module if installed on OS versions greater than 4.3.

Could you provide more details on how openvswitch was built (ie. ./configure; make; make install), how it was started, and if I need to actually configure a VLAN as well if if just starting the service was sufficient?

@dfiduk
Copy link
Author

dfiduk commented Oct 25, 2018

OpenvSwitch installed from official ubuntu repo.

# dpkg --status openvswitch-common | grep Version
Version: 2.5.2-0ubuntu0.16.04.3
# dpkg --status openvswitch-switch | grep Version
Version: 2.5.2-0ubuntu0.16.04.3

I start it using systemd.service, like a systemctl start openvswitch-switch.service
About config... If I haven't interfaces, all is ok.
_723

But if I added bridge with bond, runki hang node.
_722

But if I add bridge without bond, all is ok.
_724

@MarkCRay
Copy link
Contributor

MarkCRay commented Jan 2, 2019

Sorry its taken me so long to reply. I was starting to get Ubuntu Xenial installed on a server and I noticed the hang occurred when running lsof, rather than during the actual trace collection.

You can try to execute "runki -p" to omit the lsof and some collections from /proc. You can also try to duplicate by just running lsof in the same manner that the runki script does and see if that hangs as well.

$ lsof -M -P -n

@dfiduk
Copy link
Author

dfiduk commented Apr 24, 2019

$ lsof -M -P -n
Works as expected, without any problem.
"runki -p" also is not working...

@MarkCRay
Copy link
Contributor

Can you download and try the latest version - LinuxKI 5.9. I did fix an issue that caused probably right after the likit.ko module was unloaded.

@dfiduk
Copy link
Author

dfiduk commented Apr 24, 2019

Already tested. Nothing new...
Also I tried a new version of openvswitch - 2.5.5-0ubuntu0.16.04.2 - and still nothing new.

I thing, now I faced with another problem, because now it hangs on stage "spooling trace data to disk"

@MarkCRay
Copy link
Contributor

Thanks for trying. Unless I can duplicate the issue, or I can get a memory dump of the crash/hang, its very tough to figure it out.

If you are interested, there are a few things you can try out to narrow down the issue.

You could try to capture only certain events or subsystems with the runki using the "-e" or "-s" options. For example:

$ runki -e hardclock <<would only captured the hardclock trace events.
$ runki -s syscalls <<would only capture the system call trace events.
$ runki -s block <<would only capture block subsystem events

It would be interesting to know if the issue happens only with capturing certain subsystems or events.

@dfiduk
Copy link
Author

dfiduk commented Apr 24, 2019

ok, we will try to run proposed commands and check result. I collected crash dump, but I'm not a guru in dump analysis. So If you interested to analyse our hang, you can get dump (kernel with debug symbols and dump) from the link: https://webdav.digitalenergy.online/runki-crashdump.tar.gz

Thanks for help

@MarkCRay
Copy link
Contributor

I am unfortunately having issued loading the crash dump as crash gives me the following error:

crash: vmlinux-4.13.0-31-generic and dump.201904241450 do not match!

I'll try to pull the Ubuntu bits from their site for 16.04.1 and try to duplicate on a physical server.

@dfiduk
Copy link
Author

dfiduk commented Apr 26, 2019

Выделение_952

It's strange. For me crash not outraged.
If I can help you, please tell me

@MarkCRay
Copy link
Contributor

I have not been able to duplicate this issue yet. However, another customer reported a problem with LinuxKI on a version modified for Power servers. The problem was due to a change in the perf_callchain_entry struction. Prior to Linux version 4.7, it was defined as follows:

struct perf_callchain_entry {
__u64 nr;
__u64 ip[PERF_MAX_STACK_DEPTH];
};

With Linux version 4.7, the definition was changed to:

struct perf_callchain_entry {
__u64 nr;
__u64 ip[0];
};

The LiKI module used this structure to store stack trace information. But on version 4.7 or later, the structure is now smaller, resulting in corruption for whatever followed. However, none of my testing ever showed an issue. It had only showed up on Power servers.

LinuxKI version 6.0 has fixed this. I hope this is related to the Openvswitch issue.

@dfiduk
Copy link
Author

dfiduk commented Nov 24, 2019

Thanks for keeping me posted. I will check for a problem as soon as possible.

@dfiduk
Copy link
Author

dfiduk commented Jan 17, 2020

Seems like problem really solved. We tried LinuxKI 6.0-1 from deb package and actually on node installed openvswitch 2.11.1-2 packaged into deb from vanila scratches.
With the same openvswitch configuration problem does not occur.

Thanks a lot! Now we can use runki for analyse problems on our nodes! =)

@MarkCRay
Copy link
Contributor

I'm glad to hear that everything is working now! If you have any questions, feel free to contact me at mark.ray@hpe.com.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants