Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add NVMe support #32

Closed
Rohlik opened this issue Apr 16, 2019 · 13 comments
Closed

Add NVMe support #32

Rohlik opened this issue Apr 16, 2019 · 13 comments
Assignees

Comments

@Rohlik
Copy link
Contributor

Rohlik commented Apr 16, 2019

Hello,
NVMe interface type is not currently supported but are becoming popular.
My suggestion is to add nvme option to -i parameter.

@Napsty
Copy link
Owner

Napsty commented Apr 16, 2019

Good idea. Could you please share a smartctl -a output of a nvme drive?

@Rohlik
Copy link
Contributor Author

Rohlik commented Apr 17, 2019

Of course.

[root@fooo ~]# smartctl --all /dev/nvme1n1
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-957.5.1.el7.x86_64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       SAMSUNG MZQLW960HMJP-00003
Serial Number:                      S35XNX0KA02248
Firmware Version:                   CXV8601Q
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 960 197 124 096 [960 GB]
Unallocated NVM Capacity:           0
Controller ID:                      2
Number of Namespaces:               1
Namespace 1 Size/Capacity:          960 197 124 096 [960 GB]
Namespace 1 Utilization:            803 477 762 048 [803 GB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Wed Apr 17 10:10:54 2019 CEST
Firmware Updates (0x17):            3 Slots, Slot 1 R/O, no Reset required
Optional Admin Commands (0x000e):   Format Frmw_DL NS_Mngmt
Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Maximum Data Transfer Size:         64 Pages
Warning  Comp. Temp. Threshold:     84 Celsius
Critical Comp. Temp. Threshold:     85 Celsius
Namespace 1 Features (0x02):        NA_Fields

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     9.00W       -        -    0  0  0  0        5       5

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0
 1 -    4096       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02, NSID 0x1)
Critical Warning:                   0x00
Temperature:                        32 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    4 092 023 [2,09 TB]
Data Units Written:                 3 044 102 [1,55 TB]
Host Read Commands:                 31 971 434
Host Write Commands:                23 782 181
Controller Busy Time:               128
Power Cycles:                       25
Power On Hours:                     2 114
Unsafe Shutdowns:                   20
Media and Data Integrity Errors:    0
Error Information Log Entries:      9
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               32 Celsius

Error Information (NVMe Log 0x01, max 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0          9     0  0x001b  0x4004  0x028            0     0     -
  1          8     0  0x001b  0x4004  0x028            0     0     -
  2          7     0  0x001b  0x4004  0x028            0     0     -
  3          6     0  0x001b  0x4004  0x028            0     0     -
  4          5     0  0x001b  0x4004  0x028            0     0     -
  5          4     0  0x001b  0x4004  0x028            0     0     -
  6          3     0  0x001b  0x4004  0x028            0     0     -
  7          2     0  0x001b  0x4004  0x028            0     0     -
  8          1     0  0x001b  0x4004  0x028            0     0     -

@Napsty
Copy link
Owner

Napsty commented Jun 6, 2019

Here's the smartctl output of another NVMe:

# smartctl -a /dev/nvme0
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-4.18.5-arch1-1-ARCH] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       UCS-SDHPCIE 800GB
Serial Number:                      STM0001BAE33
Firmware Version:                   KMCCP108
PCI Vendor ID:                      0x1c58
PCI Vendor Subsystem ID:            0x1137
IEEE OUI Identifier:                0x000cca
Controller ID:                      414
Number of Namespaces:               1
Namespace 1 Size/Capacity:          800,166,076,416 [800 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            000cca 00602ba300
Local Time is:                      Thu Jun  6 08:06:40 2019 UTC
Firmware Updates (0x09):            4 Slots, Slot 1 R/O
Optional Admin Commands (0x0006):   Format Frmw_DL
Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +    25.00W       -        -    0  0  0  0    15000   15000
 1 +    20.00W       -        -    1  1  1  1    15000   15000
 2 +    15.00W       -        -    2  2  2  2    15000   15000
 3 +    10.00W       -        -    3  3  3  3    15000   15000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0
 1 -     512       8         2
 2 -    4096       0         0
 3 -    4096       8         1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        42 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    1,667,851 [853 GB]
Data Units Written:                 5,430,405 [2.78 TB]
Host Read Commands:                 11,553,415
Host Write Commands:                23,371,696
Controller Busy Time:               89
Power Cycles:                       92
Power On Hours:                     6,563
Unsafe Shutdowns:                   80
Media and Data Integrity Errors:    0
Error Information Log Entries:      0

Error Information (NVMe Log 0x01, max 63 entries)
No Errors Logged

And another one:

smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.4.75] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       UCS-SDHPCIE 1.6TB
Serial Number:                      CJH00100C4C9
Firmware Version:                   KMCCP105
PCI Vendor ID:                      0x1c58
PCI Vendor Subsystem ID:            0x1137
IEEE OUI Identifier:                0x000cca
Controller ID:                      415
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,600,321,314,816 [1.60 TB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Thu Jun  6 04:23:50 2019 EDT
Firmware Updates (0x09):            4 Slots, Slot 1 R/O
Optional Admin Commands (0x0006):   Format Frmw_DL
Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +    25.00W       -        -    0  0  0  0    15000   15000
 1 +    20.00W       -        -    1  1  1  1    15000   15000
 2 +    15.00W       -        -    2  2  2  2    15000   15000
 3 +    10.00W       -        -    3  3  3  3    15000   15000
 4 -    10.00W       -        -    3  3  3  3    15000   15000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0
 1 -     512       8         2
 2 -    4096       0         0
 3 -    4096       8         1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff)
Critical Warning:                   0x00
Temperature:                        29 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    1,394,509 [713 GB]
Data Units Written:                 173,186,724 [88.6 TB]
Host Read Commands:                 131,022,884
Host Write Commands:                11,448,977,782
Controller Busy Time:               51,941
Power Cycles:                       43
Power On Hours:                     18,856
Unsafe Shutdowns:                   40
Media and Data Integrity Errors:    0
Error Information Log Entries:      3

Error Information (NVMe Log 0x01, max 63 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0          3     1  0x00f7  0xdead  0x000            0     0  0xf7
  1          2     1  0x00f2  0xdead  0x000            0     0  0xf2
  2          1     1  0x0109  0xdead  0x000            0     0  0x09

@Napsty Napsty self-assigned this Jun 6, 2019
@Napsty
Copy link
Owner

Napsty commented Jun 6, 2019

The SMART attributes of a NVMe drive can be seen as log identifier 02h on a NVMe device.
Information based on the current NVMe specification (https://nvmexpress.org/wp-content/uploads/NVM-Express-1_3d-2019.03.20-Ratified.pdf)

Attributes worth to check for NVMe devices

Critical Warning

This field indicates critical warnings for the state of the controller.Each bit corresponds to a critical warning type; multiple bits may be set.If a bit is cleared to ‘0’, then that critical warning does not apply.Critical warnings may result in an asynchronous event notification to the host.Bits in this field represent the current associated state and are not persistent.

Bit 0: If set to ‘1’, then the available spare capacity has fallen below the threshold
Bit 1: If set to ‘1’, then a temperature is (> over temp threshold) or (< below temp threshold)
Bit 2: f set to ‘1’, then the NVM subsystem reliability has been degraded due to significant media related errors or any internal error that degrades NVM subsystemreliability.
Bit 3: If set to ‘1’, then the media has been placed in read only mode
Bit 4: If set to ‘1’, then the volatile memory backup device has failed.This field is only valid if the controller has a volatile memory backup solution.

So, to my current understanding, a value of 0x00means everything is OK so far and this NVMe does not have a memory backup. A value of 0x10 would mean serious degradation of the device.
Nope, that's wrong. Trying to find the relevant infos or specs how the value would actually look like. I believe I found two cases so far:

Any hint in the right direction to understand how the bits are actually set and how this represents the final value would be much appreciated!

Update: Yes! Seems I found it in the smartmontools source code: https://github.com/smartmontools/smartmontools/blob/e3fdde7aff4cd069e629ee987bf33ac8ccd621ad/smartmontools/nvmeprint.cpp#L300

These are the possible values for attribute Critical Warning as of now:

  • 0x01 = available spare has fallen below threshold
  • 0x02 = temperature is above or below threshold
  • 0x04 = NVM subsystem reliability has been degraded
  • 0x08 = media has been placed in read only mode
  • 0x10 = volatile memory backup device has failed
  • 0x1f = unknown critical warning(s)

But what I still don't understand is what if multiple errors happen at the same time. E.g. available spare (0x01) and temperature threshold (0x02). Would that result in 0x03? I have nowhere seen any example like this.

According to the source code, smartctl itself will already report a fail on the self-assessment check (step 1 in check_smart). In this case we could skip this attribute and focus on the other ones with performance data.

Available Spare

Contains a normalized percentage (0 to 100%) of the remaining spare capacity available.

Means as soon as the value is less than 100%, the device is slowly wearing out. This is an important indicator to see when a device will likely be "too old/too used" and needs to be replaced.

Percentage Used

Contains a vendor specific estimate of the percentage of NVM subsystemlife used based on the actual usage and the manufacturer’s prediction of NVM life.

Not sure yet if this should be counted in.

Media and Data Integrity Errors

Contains the number of occurrences where the controller detected an unrecovered data integrity error.Errors such as uncorrectable ECC, CRC checksum failure, or LBA tag mismatch are included in this field.

Probably the most important attribute to be checked. Similar to "bad sectors" of a hard drive.

Error Information Log Entries

Contains the number of Error Information log entries over the life of the controller.

Not sure yet, however this could be a helpful hint to see increasing issues on a device.

Performance data to be collected

All attributes except "Critical Warning"

  • Temperature
  • Available Spare
  • Percentage Used
  • Data Units Read
  • Data Units Written
  • Host Read Commands
  • Host Write Commands
  • Controller Busy Time
  • Power Cycles
  • Power On Hours
  • Unsafe Shutdowns
  • Media and Data Integrity Errors
  • Error Information Log Entries

@roben
Copy link

roben commented Mar 12, 2020

Hi, are there any news on this? Can I offer help with something?

@Napsty
Copy link
Owner

Napsty commented Mar 13, 2020

@roben I have the code "in my mind" already, but I need a system with a NVMe to test. Anyone would be willing to give me a remote access to a system having a nvme? Contact me on https://www.claudiokuenzler.com/about/.

@roben
Copy link

roben commented Mar 13, 2020

Sorry, I only have company servers available where I can't provide access to.

I stumbled upon this, though: https://github.com/thomas-krenn/check_smart_attributes#NVMedevices
It seems to do similar checks and already supports NVMEs, so maybe it can help to confirm your ideas.

@Napsty
Copy link
Owner

Napsty commented Mar 17, 2020

Working on it. Someone got me a remote access to a server with NVMe.

@Napsty
Copy link
Owner

Napsty commented Mar 17, 2020

@roben
Copy link

roben commented Mar 19, 2020

Thanks! It looks good:

/usr/lib/nagios/plugins/check_nrpe -H xxx -c check_smart_nvme_all OK: [/dev/nvme0] - Device is clean --- [/dev/nvme1] - Device is clean|

with

command[check_smart_nvme_all]=/usr/local/.../check_smart.pl -g "/dev/nvme[0-9]" -i nvme

It's hard to test for the faulty drive case, though, because they are all working fine.

@Napsty
Copy link
Owner

Napsty commented Mar 19, 2020

@roben Thanks for testing. I just pushed another important change (regex adjusted). Can you test again with the newest version from the nvme branch please:

https://raw.githubusercontent.com/Napsty/check_smart/nvme/check_smart.pl

Please also make a single NVME drive check if you can, to see if performance data are correctly appearing. (worked on the server I got access to)

@roben
Copy link

roben commented Mar 19, 2020

Here's the single device check:

./check_smart.pl -d /dev/nvme0 -i nvme
OK: Drive  KXG60ZNV1T02 TOSHIBA S/N 89CS10Z1T0RM: no SMART errors detected. |Temperature=42 Available_Spare=100 Available_Spare_Threshold=10 Percentage_Used=0 Data_Units_Read=13608073 Data_Units_Written=6004240 Host_Read_Commands=3734157080 Host_Write_Commands=41754653 Controller_Busy_Time=684 Power_Cycles=6 Power_On_Hours=2906 Unsafe_Shutdowns=2 Media_and_Data_Integrity_Errors=0 Error_Information_Log_Entries=0 Warning__Comp._Temperature_Time=0 Critical_Comp._Temperature_Time=0 Temperature_Sensor_1=42

The output for the multi device check was the same as above.

@Napsty
Copy link
Owner

Napsty commented Mar 25, 2020

NVMe support officially released with 6.7.0.

@Napsty Napsty closed this as completed Mar 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants