"Buffer I/O error" regression between kernels 4.11 and 4.12 #57

Open
alkisg opened this Issue Sep 29, 2017 · 10 comments

Comments

Projects
None yet
3 participants

alkisg commented Sep 29, 2017

Using Ubuntu 16.04, nbd-server and client=1:3.13-1.

a) With mainline (vanilla) kernel 4.11.12 on the client, the commands below run fine without errors.
b) Then I upgrade to mainline kernel 4.12.0, and I run again:

modprobe nbd
nbd-client server-ip -N /opt/ltsp/i386 /dev/nbd5
dmesg

And I see the following errors:

[ 73.824873] nbd: registered device at major 43
[ 84.791001] nbd5: detected capacity change from 0 to 20936916992
[ 84.791071] block nbd5: Attempted send on invalid socket
[ 84.791077] blk_update_request: I/O error, dev nbd5, sector 0
[ 84.791080] Buffer I/O error on dev nbd5, logical block 0, async page read
<the 3 lines above repeated 10 times>
[ 84.791132] block nbd5: Attempted send on invalid socket
[ 84.791133] blk_update_request: I/O error, dev nbd5, sector 2
[ 84.791134] Buffer I/O error on dev nbd5, logical block 1, async page read
[ 84.791140] ldm_validate_partition_table(): Disk read failed.
[ 84.791175] Dev nbd5: unable to read RDB block 0
[ 84.791228] nbd5: unable to read partition table

I can reproduce this in many installations, real or VMs.
Thanks!

@alkisg alkisg changed the title from "Buffer I/O error" regression between kernels 4.8 and 4.10 to "Buffer I/O error" regression between kernels 4.11 and 4.12 Sep 30, 2017

alkisg commented Sep 30, 2017

Ubuntu uses patched kernels with various backports, so the numbers were misleading.
I updated the issue description to reflect the mainline (vanilla) numbers instead.
Just for completeness, the issue happens:
In Ubuntu's kernels: after 4.8.0-58 and before 4.10.0-14
In mainline kernels: after 4.11.12 and before 4.12.0

I also saw that to make things 100% reproducible, I needed to call udevadm settle, so my test case now is:

modprobe nbd
udevadm settle
nbd-client server-ip -N /opt/ltsp/i386 /dev/nbd5
udevadm settle
dmesg | grep nbd

alkisg commented Sep 30, 2017

I also tested on Debian Stretch:

  • 4.9.0-3-686-pae: OK
  • 4.11.0-0.bpo.1-686-pae: OK
  • 4.12.0-0.bpo.2-686-pae: Has the problem
Owner

yoe commented Oct 1, 2017

Are you saying that the problem does not exist in the most recent kernels you could find? Or do I misunderstand you there?

alkisg commented Oct 2, 2017

Hi Wouter, let me phrase it better,

the problem started with kernel 4.12 and is still happening in the most recent kernel I could find, which was 4.14-rc2.

Owner

yoe commented Oct 3, 2017

Oh, okay then.

@josefbacik, any idea?

Oops, I'll take a look in the morning.

Oh actually I think this is my timeout patch that I fixed later, can you try Linus master?

alkisg commented Oct 6, 2017

Hi Josef, I tried the latest vanilla kernel from Ubuntu's daily builds and the problem still happens there:
http://kernel.ubuntu.com/~kernel-ppa/mainline/daily/current/
cod/tip/daily/2017-10-05
42b76d0e6b1fe0fcb90e0ff6b4d053d50597b031

I.e. Linux master torvalds/linux@42b76d0

Alright got it nailed down, sorry about that, apparently all of my regression tests only do the netlink interface, save the one that checks that the ioctl and netlink interfaces behave with eachother. I've submitted the patch

[PATCH] nbd: don't set the device size until we're connected

to fix it and cc'ed stable so it'll make its way back to distro kernels. @yoe sorry, I thought I had updated the mailinglist but accidentally sent it to the old sf list. I've fixed my stuff so that won't happen again.

alkisg commented Oct 10, 2017

Thanks a lot! I'll try to detect when it lands on daily builds, so that I can test it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment