Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lmt 3.2.10 displays INACTIVE for most targets #53

Open
LaHaine opened this issue Dec 11, 2019 · 13 comments
Open

lmt 3.2.10 displays INACTIVE for most targets #53

LaHaine opened this issue Dec 11, 2019 · 13 comments

Comments

@LaHaine
Copy link
Contributor

LaHaine commented Dec 11, 2019

I've upgraded from 3.2.7 to 3.2.10 and now all but one OST display the message INACTIVE 0s remaining instead of the current statistics. Only one OST went through recovery and has
status: COMPLETE in recovery_status, all others have INACTIVE, but they are mounted and working fine.

I'm running Lustre 2.10.8.

@ofaaland
Copy link
Contributor

That's the same LMT and Lustre version we are running on my test system. For the working OST, and one of the other ones, please post the output of the following:
(1) systemctl status cerebrod
(2) lmtmetric -m ost

thanks

@LaHaine
Copy link
Contributor Author

LaHaine commented Dec 12, 2019

Here's the requested output:

[miscoss14] /root # systemctl status cerebrod
● cerebrod.service - LSB: cerebrod startup script
   Loaded: loaded (/etc/rc.d/init.d/cerebrod; bad; vendor preset: disabled)
   Active: active (running) since Do 2019-12-12 08:30:56 CET; 2min 53s ago
     Docs: man:systemd-sysv-generator(8)
  Process: 124811 ExecStop=/etc/rc.d/init.d/cerebrod stop (code=exited, status=0/SUCCESS)
  Process: 124821 ExecStart=/etc/rc.d/init.d/cerebrod start (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/cerebrod.service
           └─124830 /usr/sbin/cerebrod

Dez 12 08:30:56 miscoss14.example.com systemd[1]: Starting LSB: cerebrod ...
Dez 12 08:30:56 miscoss14.example.com cerebrod[124821]: Starting cerebrod...
Dez 12 08:30:56 miscoss14.example.com cerebrod[124821]: [  OK  ]
Dez 12 08:30:56 miscoss14.example.com systemd[1]: Started LSB: cerebrod s...
Dez 12 08:30:56 miscoss14.example.com cerebrod[124821]: MODULE DIR = /usr...
Hint: Some lines were ellipsized, use -l to show in full.
[miscoss14] /root # lmtmetric -m ost
ost: 2;miscoss14.example.com;0.929274;98.051369;fs23-OST0002;106652818;111858688;61314606412;113584425328;111942704533504;119326363060308;287094948;247;18067;0;4;41806;131;COMPLETE 115/115 0s remaining;
[miscoss13] /root # systemctl status cerebrod
● cerebrod.service - LSB: cerebrod startup script
   Loaded: loaded (/etc/rc.d/init.d/cerebrod; bad; vendor preset: disabled)
   Active: active (running) since Do 2019-12-12 08:30:55 CET; 3min 54s ago
     Docs: man:systemd-sysv-generator(8)
  Process: 11451 ExecStop=/etc/rc.d/init.d/cerebrod stop (code=exited, status=0/SUCCESS)
  Process: 11462 ExecStart=/etc/rc.d/init.d/cerebrod start (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/cerebrod.service
           └─11471 /usr/sbin/cerebrod

Dez 12 08:30:55 miscoss13.example.com systemd[1]: Stopped LSB: cerebrod s...
Dez 12 08:30:55 miscoss13.example.com systemd[1]: Starting LSB: cerebrod ...
Dez 12 08:30:55 miscoss13.example.com cerebrod[11462]: Starting cerebrod:...
Dez 12 08:30:55 miscoss13.example.com cerebrod[11462]: [  OK  ]
Dez 12 08:30:55 miscoss13.example.com systemd[1]: Started LSB: cerebrod s...
Dez 12 08:30:55 miscoss13.example.com cerebrod[11462]: MODULE DIR = /usr/...
Hint: Some lines were ellipsized, use -l to show in full.
[miscoss13] /root # lmtmetric -m ost
ost: 2;miscoss13.example.com;0.936406;97.671312;fs23-OST0001;106654593;111858688;60841203828;113584425328;109235123597312;122235912992865;274506509;247;17829;0;6;44786;16;INACTIVE  0s remaining;

@ofaaland
Copy link
Contributor

ofaaland commented Dec 12, 2019

Hi,
So that "INACTIVE" is coming from the recovery_status file. On those two OSS nodes, please provide the contents of that file, like this

$ find /proc/fs/lustre/ -name recovery_status | xargs cat 
status: COMPLETE
recovery_start: 1576042037
recovery_duration: 74
completed_clients: 124/124
replayed_requests: 0
last_transno: 1129576398848
VBR: DISABLED
IR: DISABLED

@LaHaine
Copy link
Contributor Author

LaHaine commented Dec 13, 2019

Sure:

[miscoss13] /root # find /proc/fs/lustre/ -name recovery_status | xargs cat
status: INACTIVE
[miscoss14] /root # find /proc/fs/lustre/ -name recovery_status | xargs cat
status: COMPLETE
recovery_start: 1572261173
recovery_duration: 72
completed_clients: 115/115
replayed_requests: 6
last_transno: 12885434153
VBR: DISABLED
IR: ENABLED

@ofaaland
Copy link
Contributor

It looks to me like that means 0 clients have connected to fs23-OST0001.

Can you check one of your lustre client nodes with "lfs check osts" and compare those same two OSTs appear? I suspect fs23-OST0002 will report "active" and fs23-OST0001 will either be missing or report "inactive".

If they both say "active", please post the following between those two OSTs:

find /proc/fs/lustre/obdfilter/ -name num_exports | while read fname; do echo $fname $(cat $fname); done

Thanks

@LaHaine
Copy link
Contributor Author

LaHaine commented Dec 17, 2019

All OSTs appear just fine on the clients.

Here's the output of your command on the OSS:

[miscoss13] /root # find /proc/fs/lustre/obdfilter/ -name num_exports | while read fname; do echo $fname $(cat $fname); done
/proc/fs/lustre/obdfilter/fs23-OST0001/num_exports 247
[miscoss14] /root # find /proc/fs/lustre/obdfilter/ -name num_exports | while read fname; do echo $fname $(cat $fname); done
/proc/fs/lustre/obdfilter/fs23-OST0002/num_exports 247


@ofaaland
Copy link
Contributor

Are the servers and clients both Lustre 2.10.8?

@LaHaine
Copy link
Contributor Author

LaHaine commented Feb 19, 2020

I think there was a single 2.12.3 client, all others 2.10.8.

@ofaaland
Copy link
Contributor

ofaaland commented Mar 5, 2020

Have these targets (MDTs and OSTs, on the server nodes) ever, in their lifetime, been un-mounted and then re-mounted?

I just created a new lustre 2.12.4 file system from scratch, and observe the same behavior you describe, after they have been mounted for the first time - the recovery_status file just says "status:INACTIVE". After umount and mounting again, the recovery_status files have the expected content.

@LaHaine
Copy link
Contributor Author

LaHaine commented Mar 9, 2020

I can't say for sure, but I guess they have been mounted several times already.

@defaziogiancarlo
Copy link
Contributor

There is a related (and possibly the same) issue at https://jira.whamcloud.com/browse/LU-14930

@ofaaland
Copy link
Contributor

ofaaland commented Nov 1, 2024

Another user, @alvaromartin990, ran into this recently. Some background in case it helps him or others who end up here:

When a Lustre target (e.g. MDT0000) starts, it connects with clients and goes through a process called "recovery" which handles the case where a lustre target failed and was restarted. An example might be that MDT0000 is hosted on server "lustre1", which lost power due and then was powered on and MDT0000 started again. There may have been in-process i/o's at the time, and the clients and servers must ensure that any such i/o's either landed on-disk or are replayed. During this "recovery" process, no new i/o requests are accepted by the server, and no new clients are allowed to mount the file system.

After clients and servers have synchronized state, the lustre target exits recovery and resumes normal operation.

Lustre reports this to sysadmins and tools like LMT via the recovery_status procfile. Here's an example, for a target which has exited recovery, in which the recovery_status file has typical contents:

$ cat /proc/fs/lustre/mdt/lflood-MDT0000/recovery_status
status: COMPLETE
recovery_start: 1730246008
recovery_duration: 67
completed_clients: 99/99
replayed_requests: 0
last_transno: 1090964840825
VBR: DISABLED
IR: ENABLED

LMT tries to let sysadmins know a lustre target is in recovery, so sysadmins know why normal operation isn't occurring.

For LMT users running into this issue, however, the recovery_status file for one or more targets contains "status: INACTIVE" when the target clearly is up and isn't in recovery (because it's allowing new clients to mount the FS, and handling new i/o requests).

This seems to me like a Lustre bug, and https://jira.whamcloud.com/browse/LU-14930 did result in a patches for Lustre 2.12 (never merged), 2.14 (never merged), and 2.15 (landed).

That said, it would be good if LMT could work around this behavior for sites running Lustre versions before 2.15.

@alvaromartin990
Copy link

Hi @ofaaland,

Thank you so much for your explanation, it makes sense now. It sounds like the best way to proceed with this is to update to Lustre 2.15. I will definitely consider doing so since I'd like to use all LMT features.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants