Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Erasing GPT/MBR slowness #19

Closed
Sebastian-Roth opened this issue Oct 3, 2018 · 13 comments

Comments

Projects
None yet
3 participants
@Sebastian-Roth
Copy link
Member

commented Oct 3, 2018

As posted in the forums (1, 2, 3, 4, 5) we seem to have an issue where erasing old partition tables is taking a very long time (4-5 minutes compared to a couple of seconds).

So far we know that kernel 4.15.2 is not having the issue but 4.16.6 has. We started testing and so far it looks like all 4.15.x are good.

Great find by @Quazz: https://www.clonezilla.org/downloads/stable/changelog.php

Clonezilla live 2.5.6-22
...

  • Downgrade the Linux kernel to 4.16.16-2 due to an issue of Linux kernel 4.17 that accesses local device very slow.

Though the kernel versions don't seem to match it seems like others are seeing similar things as well.

@Sebastian-Roth

This comment has been minimized.

Copy link
Member Author

commented Oct 3, 2018

Alright, in our tests it turns out that 4.16.4 introduced the issue. @Quazz tested all the versions and found that 4.15.3 to 4.16.3 were all working nice and fast. 4.16.4 and later kernels show the issue. Working on bisect the kernel commits now. More tests will follow.

@Sebastian-Roth

This comment has been minimized.

Copy link
Member Author

commented Oct 3, 2018

One interesting find by @Quazz is that the issue only triggers when a normal job is scheduled. Either running sgdisk -Z by hand or scheduling a debug task does not show the same behavior! But it happens on normal tasks using 4.16.4 and newer kernels reliably. So I started building kernel images for all the commits between 4.16.3 and 4.16.4 to figure out what's exactly causing this - changelog.

@Quazz

This comment has been minimized.

Copy link
Contributor

commented Oct 4, 2018

Doing this in order of testing

001 - Slow
196 - Slow
100 - Slow
150 - Slow
050 - Slow
195 - Slow
025 - Slow

First commit (196) is this one torvalds/linux@e09070c I think.

@Quazz

This comment has been minimized.

Copy link
Contributor

commented Oct 4, 2018

Since Sebastian told me that he didn't experience slowness on his Oracle VirtualBox VMs, I figured I'd play around with some settings to see if maybe I can narrow down where to look.

When I disable IO-APIC (and thus only use one core), erasing goes fast again!

I thought kernel parameter nosmp would thus have the same result, but no dice. Makes this even stranger to me!

@Quazz

This comment has been minimized.

Copy link
Contributor

commented Oct 5, 2018

Interesting stuff, with IO-APIC back enabled, erasing goes slow as expected.

However, if you then (wait 3-5 seconds when it gets 'stuck') press a key such as PageUP (and others that produce weird output on the console when pressing them (interestingely enough shift+pageup seems to work too) several times in a row then the task manages to complete in a timely fashion.

I have no idea what that is, maybe something to do with the console? (since only when changes happen to the console does the task manage to complete, potentially this helps explain why debug mode is fine, too).

Just a thought (unsubstantiated), but perhaps it's not the sgdisk command getting stuck, perhaps it doesn't even getting invoked yet at all because the console output is 'still going'. Don't have anything to back that up, but it's worth looking into maybe.

edit: just tested and it is 100% sgdisk getting stuck. And it's sgdisk in general, not specifically sgdisk -Z

edit2: I used postinit scripts to replace sgdisk -Z with wipefs -a which seems to work the same and finishes instantly.

Exitcodes might be different of course, here's the man page anyway, could be interesting:

https://linux.die.net/man/8/wipefs

@Sebastian-Roth

This comment has been minimized.

Copy link
Member Author

commented Oct 5, 2018

I got something horribly wrong when building those 196 kernel binaries. Not sure what exactly but. Sorry for that!

Talking to @Quazz about it I had the idea that this could be related to the kernel random number generator. It uses different types of user input to generate most valuable randomness. Keyboard input is one of them. You can hit pretty much any key on the keyboard and after 10-15 keys it got enough an finishes straight away. This is also why we don't see the issue in debug mode. Random number generator is already filled enough when we type in the commands I suppose.

Now testing step by step...
26696cdda301830a16511391a3b1515c9b3b17fb (nr. 196) = fast
ab5860f5ce700bc4becc4d6abf01cc380c7ffe85 (nr. 035) = fast
1d0d9058215e75533f01fbb3db93621f142e1a3d (nr. 025) = slow
6efa23d5851f1702a3cddbdde63607ea6588b665 (nr. 032) = slow
89b59f050347d376c2ace8b1ceb908a218cfdc2e (nr. 033) = slow
cd8d7a5778a4abf76ee8fe8f1bfcf78976029f8d (nr. 034) = slow

So we found which kernel commit is causing this: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=cd8d7a5778a4abf76ee8fe8f1bfcf78976029f8d

@leshik

This comment has been minimized.

Copy link

commented Oct 10, 2018

Does any workaround exist for this issue when using new kernels?

@Quazz

This comment has been minimized.

Copy link
Contributor

commented Oct 10, 2018

@leshik The hang is caused by a lack of available entropy for the rng which is used by the erasing step.

Currently you should either use kernel 4.15.2 (or older) or if you have to use the new kernels, you can generate entropy manually by pressing keys and moving the mouse at the station in question.

While we did find the kernel commit that causes this to occur, I think it is best to patch (or wait for an update) on the packages. Preliminary testing looks good anyway.

@leshik

This comment has been minimized.

Copy link

commented Oct 11, 2018

@Quazz does the patch exist already? I was unable to find any PRs related to crng in FOG.

@Sebastian-Roth

This comment has been minimized.

Copy link
Member Author

commented Oct 11, 2018

@leshik It's not actually a rng (random number generator) issue caused by FOG code but a combination of kernel change (as posted above) and buildroot toolchain. We just figured out that newer buildroot versions fixed that issue for us (ref - search for Util-linux: Fix blocking on getrandom()). @Quazz and I already did test builds and the slowness issue is indeed fixed! But some minor hurdles come with updating to the latest buildroot version that we need to fix properly before releasing new buildroot init files for FOG.

Will be soon to come!

@Sebastian-Roth

This comment has been minimized.

Copy link
Member Author

commented Oct 13, 2018

Had to fix a couple of things when updating to the latest Buildroot environment that took me a little while. Latest inits and kernels are now uploaded. To use those run:

sudo -i
cd /var/www/fog/service/ipxe
mv bzImage bzImage.orig
wget https://fogproject.org/kernels/bzImage
mv bzImage32 bzImage32.orig
wget https://fogproject.org/kernels/bzImage32
mv init.xz init.xz.orig
wget https://fogproject.org/inits/init.xz
mv init_32.xz init_32.xz.orig
wget https://fogproject.org/inits/init_32.xz

Closing this issue as fixed now!

@Sebastian-Roth

This comment has been minimized.

Copy link
Member Author

commented Apr 28, 2019

@Quazz Haha, seems like we are running into the same kind of issue with 2019.02.1 but this time it's not util-linux's libuuid call but openssh ssh-keygen hanging!!

Seems like this commit in openssh later made it into Buildroot and now we see the same hang as we earlier had on "Erasing GPT/MBR ..." (hanging on uuid generation).

@Sebastian-Roth

This comment has been minimized.

Copy link
Member Author

commented Apr 28, 2019

After an extensive debugging session I was able to find a nice solution to this problem. Fixed in 61abe48.

I tracked it down to where ssh-keygen calles RAND_status and that call seems to block on some machines, maybe just virtual machines. Adding haveged is working great to fill the entropy pool on bootup so ssh key generation is not hanging and the user does not have to add entropy by using the keyboard either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.