Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in drbd 9.1.5 on CentOS 7 #26

Open
izyk opened this issue Feb 11, 2022 · 8 comments
Open

Bug in drbd 9.1.5 on CentOS 7 #26

izyk opened this issue Feb 11, 2022 · 8 comments

Comments

@izyk
Copy link

izyk commented Feb 11, 2022

Hello.
I don't sure it's drbd problem, but, after I upgrade packages kmod-drbd90-9.1.4 -> kmod-drbd90-9.1.5 from elrepo. I have а error in my message on a md raid.

My block stack is:
mdraid -> lvm -> drbd -> vdo -> lvm

I have trouble only with raid devices with chunks (usually 512K size) raid0, raid10. With raid1 no problem.
Please, could you give me a hint where could be the error?

Feb 11 02:48:58 arh kernel: md/raid10:md124: make_request bug: can't convert block across chunks or bigger than 512k 2755544 32
Feb 11 02:48:58 arh kernel: drbd r1/0 drbd2: disk( UpToDate -> Failed )
Feb 11 02:48:58 arh kernel: drbd r1/0 drbd2: Local IO failed in drbd_request_endio. Detaching...
Feb 11 02:48:58 arh kernel: drbd r1/0 drbd2: local READ IO error sector 2752472+64 on dm-3
Feb 11 02:48:58 arh kernel: drbd r1/0 drbd2: sending new current UUID: 3E82544B6FC832F1
Feb 11 02:48:59 arh kernel: drbd r1/0 drbd2: disk( Failed -> Diskless )
Feb 11 02:48:59 arh kernel: drbd r1/0 drbd2: Should have called drbd_al_complete_io(, 4294724168, 4096), but my Disk seems to have failed :(

After this, the primary worked in diskless mode. If primary on raid1, all works normal, and secondary UpToDate, even if the secondary is on a raid0.

drbd90-utils-9.19.1-1.el7.elrepo.x86_64
kmod-drbd90-9.1.5-1.el7_9.elrepo.x86_64

I don't try revert to kmod-9.1.4 yet, but with previous kernel and 9.1.5 I get the same.

@izyk
Copy link
Author

izyk commented Feb 14, 2022

With 9.1.4 all right as before. "Online verify done" without errors.
uname -a
Linux 3.10.0-1160.49.1.el7.x86_64 #1 SMP Tue Nov 30 15:51:32 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

modinfo drbd
filename: /lib/modules/3.10.0-1160.49.1.el7.x86_64/weak-updates/drbd90/drbd.ko
alias: block-major-147-*
license: GPL
version: 9.1.4
description: drbd - Distributed Replicated Block Device v9.1.4
author: Philipp Reisner phil@linbit.com, Lars Ellenberg lars@linbit.com
retpoline: Y
rhelversion: 7.9
srcversion: DC4A3A79803F1566C1F7ABE
depends: libcrc32c
vermagic: 3.10.0-1160.el7.x86_64 SMP mod_unload modversions
signer: The ELRepo Project (http://elrepo.org): ELRepo.org Secure Boot Key
sig_key: F3:65:AD:34:81:A7:B2:0E:34:27:B6:1B:2A:26:63:5B:83:FE:42:7B
sig_hashalgo: sha256
parm: enable_faults:int
parm: fault_rate:int
parm: fault_count:int
parm: fault_devs:int
parm: disable_sendpage:bool
parm: allow_oos:DONT USE! (bool)
parm: minor_count:Approximate number of drbd devices (1-255) (uint)
parm: usermode_helper:string
parm: protocol_version_min:drbd_protocol_version

@Jaybus2
Copy link

Jaybus2 commented Mar 8, 2022

I can confirm this problem with kmod-drbd90-9.1.5-1.el7_9.elrepo.x86_64 with both kernel-3.10.0-1160.53.1 and kernel-3.10.0-1160.49.1. Also that it occurs only for raid10 mdraid devices. The previous 9.1.4 kmod works with both of those kernels plus the latest kernel-3.10.0-1160.59.1. No errors are logged for the raid10 device when using the 9.1.4 kmod, only when using the 9.1.5 version.

@JoelColledge
Copy link
Contributor

Does this issue also occur with 9.1.6? Does it occur when you build DRBD from a release tarball instead of using the elrepo packages?

@mitzone
Copy link

mitzone commented Jun 7, 2022

TLDR : DRBD 9.1.7 NO-GO on CentOS7 if using raid10 md arrays for the underlying drbd disk.

Just want to state I have the exact same issue. My cluster failed after upgrading from 9.0.x(don't know the exact version) to 9.1.7. Running on 3.10.0-1160.66.1.el7.x86_64 kernel.
Get this issue in syslog when using an raid10 MD array for drbd (works with raid1 or linear. Also works if running directly on the disks, aka no md raid):

[ 1996.269915] drbd storage: Committing cluster-wide state change 2875711901 (0ms)
[ 1996.269930] drbd storage: role( Secondary -> Primary )
[ 1996.269933] drbd storage/0 drbd0: disk( Inconsistent -> UpToDate )
[ 1996.270004] drbd storage/0 drbd0: size = 32 GB (33532892 KB)
[ 1996.286479] drbd storage: Forced to consider local data as UpToDate!
[ 1996.288057] drbd storage/0 drbd0: new current UUID: 2BDAB23564612AE9 weak: FFFFFFFFFFFFFFFD
[ 2010.790278] md/raid10:md200: make_request bug: can't convert block across chunks or bigger than 512k 33530432 256
[ 2010.790307] drbd storage/0 drbd0: disk( UpToDate -> Failed )
[ 2010.790359] drbd storage/0 drbd0: Local IO failed in drbd_request_endio. Detaching...
[ 2010.790455] drbd storage/0 drbd0: local WRITE IO error sector 33530432+512 on md200
[ 2010.792848] drbd storage/0 drbd0: disk( Failed -> Diskless )
[ 2277.261791] drbd storage: Preparing cluster-wide state change 391032618 (1->-1 3/2)

I also tried compiling it from sources, same issue.

Tried on Rocky Linux 8, works like a charm.
I could not found a way to sign the kernel module on Rocky 8, so I disabled UEFI secure boot signing like it's stated here:
https://askubuntu.com/questions/762254/why-do-i-get-required-key-not-available-when-install-3rd-party-kernel-modules

Writing this in case anyone encounters same issues as me and hope they will not lose 20 hours of debugging as I did.

@Jaybus2
Copy link

Jaybus2 commented Jul 27, 2022

I'm still getting the "make_request bug: can't convert block across chunks or bigger than 512k" with Centos 7 kernel 3.10.0-1160.71.1 using kernel module from DRBD 9.1.7 when the backing store is an LVM partition where the PV is a md RAID device. Also, the 9.1.7 kmod has the same error with several previous versions of the Centos kernel. By contrast, the 9.1.4 kmod works with all Centos kernels since at least 3.10.0-1160.49.1.

Btw, other LVs in that same VG (also on the same md RAID10 PV), have no issues. Only the LVs being used as DRBD backing storage and only with DRBD > 9,1,4. Something in newer DRBD versions kmod breaks md devices. I see no LVM messages, only the md error, and of course that leads to the DRBD messages re. moving from UpToDate to Failed to Diskless.

@Jaybus2
Copy link

Jaybus2 commented Dec 21, 2022

Update: This issue still persists in 9.1.12. The issue always starts with a md raid10 error:
md/raid10:md200: make_request bug: can't convert block across chunks or bigger than 512k 33530432 256
It does not happen when the storage for the DRBD device is on md raid1, only for md raid10. I am not setup to test any other raid levels.

Info on the DRBD device (and underlying LVM and md raid10 devices) causing the issue for me is below. Note that other LVs on this same raid10 PV that are locally mounted or used for iSCSI (ie. not used for DRBD backing storage) work just fine. Also note that the raid10 device chink size is 512k

[root@cnode3 drbd.d]# uname -r
3.10.0-1160.71.1.el7.x86_64

[root@cnode3 drbd.d]# cat r13_access_home.res
resource drbd_access_home {
meta-disk internal;
on cnode3 {
node-id 0;
device /dev/drbd13 minor 13;
disk /dev/vg_b/lv_access_home;
address ipv4 10.0.99.3:7801;
}
on cnode2 {
node-id 1;
device /dev/drbd13 minor 13;
disk /dev/vg_b/lv_access_home;
address ipv4 10.0.99.2:7801;
}
}

[root@cnode3 ~]# lvdisplay /dev/vg_b/lv_access_home
--- Logical volume ---
LV Path /dev/vg_b/lv_access_home
LV Name lv_access_home
VG Name vg_b
LV UUID jiTZLD-CGmp-x9W3-AxcF-kcjH-GWgW-DsDJ0I
LV Write Access read/write
LV Creation host, time cnode3, 2017-09-06 14:33:06 -0400
LV Status available

open 2

LV Size 350.00 GiB
Current LE 89600
Segments 2
Allocation inherit
Read ahead sectors auto

  • currently set to 4096
    Block device 253:18

[root@cnode3 ~]# pvdisplay /dev/md125
--- Physical volume ---
PV Name /dev/md125
VG Name vg_b
PV Size <3.64 TiB / not usable 4.00 MiB
Allocatable yes
PE Size 4.00 MiB
Total PE 953799
Free PE 159175
Allocated PE 794624
PV UUID CY3lVe-5E4f-fIS5-RmMv-HYI1-nH0Z-Iil0Ig

[root@cnode3 ~]# mdadm -D /dev/md125
/dev/md125:
Version : 1.2
Creation Time : Wed Jan 13 12:12:53 2016
Raid Level : raid10
Array Size : 3906764800 (3.64 TiB 4.00 TB)
Used Dev Size : 1953382400 (1862.89 GiB 2000.26 GB)
Raid Devices : 4
Total Devices : 4
Persistence : Superblock is persistent

 Intent Bitmap : Internal

   Update Time : Wed Dec 21 10:13:45 2022
         State : active, checking
Active Devices : 4

Working Devices : 4
Failed Devices : 0
Spare Devices : 0

        Layout : near=2
    Chunk Size : 512K

Consistency Policy : bitmap

  Check Status : 33% complete

          Name : cnode3:3  (local to host cnode3)
          UUID : 1204deeb:5393b7c0:7630ffc9:b6f7d835
        Events : 921962

Number   Major   Minor   RaidDevice State
   0       8       17        0      active sync set-A   /dev/sdb1
   1       8        1        1      active sync set-B   /dev/sda1
   2       8       33        2      active sync set-A   /dev/sdc1
   3       8       49        3      active sync set-B   /dev/sdd1

@Jaybus2
Copy link

Jaybus2 commented Mar 1, 2023

This bug still persists in 9.1.13, with a caveat. As of 9.1.13 it works with a md raid10 backing device as long as the DRBD device is secondary and resync works at startup. However, when the DRBD device is made primary, the same errors persists. Tested on Centos 7 with latest kernel 3.10.0-1160.83.1.el7. Kernel log messages:

Mar 1 08:43:12 cnode2 kernel: drbd drbd_access_home: Preparing cluster-wide state change 3028602266 (1->-1 3/1)
Mar 1 08:43:12 cnode2 kernel: drbd drbd_access_home: State change 3028602266: primary_nodes=2, weak_nodes=FFFFFFFFFFFFFFFC
Mar 1 08:43:12 cnode2 kernel: drbd drbd_access_home: Committing cluster-wide state change 3028602266 (0ms)
Mar 1 08:43:12 cnode2 kernel: drbd drbd_access_home: role( Secondary -> Primary )
Mar 1 08:43:39 cnode2 kernel: md/raid10:md127: make_request bug: can't convert block across chunks or bigger than 256k 448794880 132
Mar 1 08:43:39 cnode2 kernel: drbd drbd_access_home/0 drbd13: disk( UpToDate -> Failed )
Mar 1 08:43:39 cnode2 kernel: drbd drbd_access_home/0 drbd13: Local IO failed in drbd_request_endio. Detaching...
Mar 1 08:43:39 cnode2 kernel: drbd drbd_access_home/0 drbd13: local READ IO error sector 29362432+264 on ffff9fcff9a389c0
Mar 1 08:43:39 cnode2 kernel: drbd drbd_access_home/0 drbd13: sending new current UUID: 9C66E258C0F9F361
Mar 1 08:43:39 cnode2 kernel: drbd drbd_access_home/0 drbd13: disk( Failed -> Diskless )

@josedev-union
Copy link

josedev-union commented Sep 27, 2023

I faced similiar issue with 9.1.16.
The backing device is md raid0 device for my case.
After I detached the device from the primary node and attached again, then the same error occurs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants