lmt 3.2.10 displays INACTIVE for most targets #53

LaHaine · 2019-12-11T13:04:44Z

I've upgraded from 3.2.7 to 3.2.10 and now all but one OST display the message INACTIVE 0s remaining instead of the current statistics. Only one OST went through recovery and has
status: COMPLETE in recovery_status, all others have INACTIVE, but they are mounted and working fine.

I'm running Lustre 2.10.8.

ofaaland · 2019-12-11T16:30:26Z

That's the same LMT and Lustre version we are running on my test system. For the working OST, and one of the other ones, please post the output of the following:
(1) systemctl status cerebrod
(2) lmtmetric -m ost

thanks

LaHaine · 2019-12-12T07:36:22Z

Here's the requested output:

[miscoss14] /root # systemctl status cerebrod
● cerebrod.service - LSB: cerebrod startup script
   Loaded: loaded (/etc/rc.d/init.d/cerebrod; bad; vendor preset: disabled)
   Active: active (running) since Do 2019-12-12 08:30:56 CET; 2min 53s ago
     Docs: man:systemd-sysv-generator(8)
  Process: 124811 ExecStop=/etc/rc.d/init.d/cerebrod stop (code=exited, status=0/SUCCESS)
  Process: 124821 ExecStart=/etc/rc.d/init.d/cerebrod start (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/cerebrod.service
           └─124830 /usr/sbin/cerebrod

Dez 12 08:30:56 miscoss14.example.com systemd[1]: Starting LSB: cerebrod ...
Dez 12 08:30:56 miscoss14.example.com cerebrod[124821]: Starting cerebrod...
Dez 12 08:30:56 miscoss14.example.com cerebrod[124821]: [  OK  ]
Dez 12 08:30:56 miscoss14.example.com systemd[1]: Started LSB: cerebrod s...
Dez 12 08:30:56 miscoss14.example.com cerebrod[124821]: MODULE DIR = /usr...
Hint: Some lines were ellipsized, use -l to show in full.
[miscoss14] /root # lmtmetric -m ost
ost: 2;miscoss14.example.com;0.929274;98.051369;fs23-OST0002;106652818;111858688;61314606412;113584425328;111942704533504;119326363060308;287094948;247;18067;0;4;41806;131;COMPLETE 115/115 0s remaining;
[miscoss13] /root # systemctl status cerebrod
● cerebrod.service - LSB: cerebrod startup script
   Loaded: loaded (/etc/rc.d/init.d/cerebrod; bad; vendor preset: disabled)
   Active: active (running) since Do 2019-12-12 08:30:55 CET; 3min 54s ago
     Docs: man:systemd-sysv-generator(8)
  Process: 11451 ExecStop=/etc/rc.d/init.d/cerebrod stop (code=exited, status=0/SUCCESS)
  Process: 11462 ExecStart=/etc/rc.d/init.d/cerebrod start (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/cerebrod.service
           └─11471 /usr/sbin/cerebrod

Dez 12 08:30:55 miscoss13.example.com systemd[1]: Stopped LSB: cerebrod s...
Dez 12 08:30:55 miscoss13.example.com systemd[1]: Starting LSB: cerebrod ...
Dez 12 08:30:55 miscoss13.example.com cerebrod[11462]: Starting cerebrod:...
Dez 12 08:30:55 miscoss13.example.com cerebrod[11462]: [  OK  ]
Dez 12 08:30:55 miscoss13.example.com systemd[1]: Started LSB: cerebrod s...
Dez 12 08:30:55 miscoss13.example.com cerebrod[11462]: MODULE DIR = /usr/...
Hint: Some lines were ellipsized, use -l to show in full.
[miscoss13] /root # lmtmetric -m ost
ost: 2;miscoss13.example.com;0.936406;97.671312;fs23-OST0001;106654593;111858688;60841203828;113584425328;109235123597312;122235912992865;274506509;247;17829;0;6;44786;16;INACTIVE  0s remaining;

ofaaland · 2019-12-12T21:57:29Z

Hi,
So that "INACTIVE" is coming from the recovery_status file. On those two OSS nodes, please provide the contents of that file, like this

$ find /proc/fs/lustre/ -name recovery_status | xargs cat 
status: COMPLETE
recovery_start: 1576042037
recovery_duration: 74
completed_clients: 124/124
replayed_requests: 0
last_transno: 1129576398848
VBR: DISABLED
IR: DISABLED

LaHaine · 2019-12-13T07:27:37Z

Sure:

[miscoss13] /root # find /proc/fs/lustre/ -name recovery_status | xargs cat
status: INACTIVE
[miscoss14] /root # find /proc/fs/lustre/ -name recovery_status | xargs cat
status: COMPLETE
recovery_start: 1572261173
recovery_duration: 72
completed_clients: 115/115
replayed_requests: 6
last_transno: 12885434153
VBR: DISABLED
IR: ENABLED

ofaaland · 2019-12-16T19:27:57Z

It looks to me like that means 0 clients have connected to fs23-OST0001.

Can you check one of your lustre client nodes with "lfs check osts" and compare those same two OSTs appear? I suspect fs23-OST0002 will report "active" and fs23-OST0001 will either be missing or report "inactive".

If they both say "active", please post the following between those two OSTs:

find /proc/fs/lustre/obdfilter/ -name num_exports | while read fname; do echo $fname $(cat $fname); done

Thanks

LaHaine · 2019-12-17T07:25:01Z

All OSTs appear just fine on the clients.

Here's the output of your command on the OSS:

[miscoss13] /root # find /proc/fs/lustre/obdfilter/ -name num_exports | while read fname; do echo $fname $(cat $fname); done
/proc/fs/lustre/obdfilter/fs23-OST0001/num_exports 247
[miscoss14] /root # find /proc/fs/lustre/obdfilter/ -name num_exports | while read fname; do echo $fname $(cat $fname); done
/proc/fs/lustre/obdfilter/fs23-OST0002/num_exports 247

ofaaland · 2020-02-18T19:29:39Z

Are the servers and clients both Lustre 2.10.8?

LaHaine · 2020-02-19T06:50:19Z

I think there was a single 2.12.3 client, all others 2.10.8.

ofaaland · 2020-03-05T22:12:16Z

Have these targets (MDTs and OSTs, on the server nodes) ever, in their lifetime, been un-mounted and then re-mounted?

I just created a new lustre 2.12.4 file system from scratch, and observe the same behavior you describe, after they have been mounted for the first time - the recovery_status file just says "status:INACTIVE". After umount and mounting again, the recovery_status files have the expected content.

LaHaine · 2020-03-09T07:39:30Z

I can't say for sure, but I guess they have been mounted several times already.

defaziogiancarlo · 2021-10-13T19:47:43Z

There is a related (and possibly the same) issue at https://jira.whamcloud.com/browse/LU-14930

ofaaland · 2024-11-01T17:23:36Z

Another user, @alvaromartin990, ran into this recently. Some background in case it helps him or others who end up here:

When a Lustre target (e.g. MDT0000) starts, it connects with clients and goes through a process called "recovery" which handles the case where a lustre target failed and was restarted. An example might be that MDT0000 is hosted on server "lustre1", which lost power due and then was powered on and MDT0000 started again. There may have been in-process i/o's at the time, and the clients and servers must ensure that any such i/o's either landed on-disk or are replayed. During this "recovery" process, no new i/o requests are accepted by the server, and no new clients are allowed to mount the file system.

After clients and servers have synchronized state, the lustre target exits recovery and resumes normal operation.

Lustre reports this to sysadmins and tools like LMT via the recovery_status procfile. Here's an example, for a target which has exited recovery, in which the recovery_status file has typical contents:

$ cat /proc/fs/lustre/mdt/lflood-MDT0000/recovery_status
status: COMPLETE
recovery_start: 1730246008
recovery_duration: 67
completed_clients: 99/99
replayed_requests: 0
last_transno: 1090964840825
VBR: DISABLED
IR: ENABLED

LMT tries to let sysadmins know a lustre target is in recovery, so sysadmins know why normal operation isn't occurring.

For LMT users running into this issue, however, the recovery_status file for one or more targets contains "status: INACTIVE" when the target clearly is up and isn't in recovery (because it's allowing new clients to mount the FS, and handling new i/o requests).

This seems to me like a Lustre bug, and https://jira.whamcloud.com/browse/LU-14930 did result in a patches for Lustre 2.12 (never merged), 2.14 (never merged), and 2.15 (landed).

That said, it would be good if LMT could work around this behavior for sites running Lustre versions before 2.15.

alvaromartin990 · 2024-11-04T16:23:42Z

Hi @ofaaland,

Thank you so much for your explanation, it makes sense now. It sounds like the best way to proceed with this is to update to Lustre 2.15. I will definitely consider doing so since I'd like to use all LMT features.

Thanks!

ofaaland mentioned this issue Nov 1, 2024

Authentication plugin 'caching_sha2_password' cannot be loaded #70

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lmt 3.2.10 displays INACTIVE for most targets #53

lmt 3.2.10 displays INACTIVE for most targets #53

LaHaine commented Dec 11, 2019

ofaaland commented Dec 11, 2019

LaHaine commented Dec 12, 2019

ofaaland commented Dec 12, 2019 •

edited

Loading

LaHaine commented Dec 13, 2019

ofaaland commented Dec 16, 2019

LaHaine commented Dec 17, 2019

ofaaland commented Feb 18, 2020

LaHaine commented Feb 19, 2020

ofaaland commented Mar 5, 2020 •

edited

Loading

LaHaine commented Mar 9, 2020

defaziogiancarlo commented Oct 13, 2021

ofaaland commented Nov 1, 2024

alvaromartin990 commented Nov 4, 2024

lmt 3.2.10 displays INACTIVE for most targets #53

lmt 3.2.10 displays INACTIVE for most targets #53

Comments

LaHaine commented Dec 11, 2019

ofaaland commented Dec 11, 2019

LaHaine commented Dec 12, 2019

ofaaland commented Dec 12, 2019 • edited Loading

LaHaine commented Dec 13, 2019

ofaaland commented Dec 16, 2019

LaHaine commented Dec 17, 2019

ofaaland commented Feb 18, 2020

LaHaine commented Feb 19, 2020

ofaaland commented Mar 5, 2020 • edited Loading

LaHaine commented Mar 9, 2020

defaziogiancarlo commented Oct 13, 2021

ofaaland commented Nov 1, 2024

alvaromartin990 commented Nov 4, 2024

ofaaland commented Dec 12, 2019 •

edited

Loading

ofaaland commented Mar 5, 2020 •

edited

Loading