Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Extend metrics being evaluated in the plugin #17

Open
1 task done
martialblog opened this issue Feb 9, 2024 · 6 comments
Open
1 task done

[Feature]: Extend metrics being evaluated in the plugin #17

martialblog opened this issue Feb 9, 2024 · 6 comments
Assignees
Labels

Comments

@martialblog
Copy link
Member

Please try to fill out as much of the information below as you can. Thank you!

  • Yes, I've searched similar issues on GitHub and didn't find any.

Which version contains the bug?

3.1.0

Describe the bug

The usage output might be wrong and or confusing. Maybe we'll just use the percentage the btrfs output shows.

Output from the plugin:

'Data,RAID1': 11.66842% used (0.42458TB/3.6387TB), 
'Metadata,RAID1': 0.01452% used (0.00053TB/3.6387TB), 
'System,RAID1': 0.0% used (0.0TB/3.6387TB

Output from btrfs filesystem usage

Data,RAID1: Size:469225177088, Used:466829971456 (99.49%)
   /dev/sda     469225177088
   /dev/sdb     469225177088

Metadata,RAID1: Size:2147483648, Used:580681728 (27.04%)
   /dev/sda     2147483648
   /dev/sdb     2147483648

System,RAID1: Size:41943040, Used:81920 (0.20%)
   /dev/sda       41943040
   /dev/sdb       41943040

Full output:

check_disk_btrfs --btrfs-path /usr/bin/btrfs -V /tank --no-sudo -w 100 -c 100 -v --missing --error
/usr/bin/btrfs filesystem usage -b /tank
Label: Data,RAID1, Total: 469225177088, Used: 466829971456
Label: Metadata,RAID1, Total: 2147483648, Used: 580730880
Label: System,RAID1, Total: 41943040, Used: 81920
/usr/bin/btrfs filesystem show /tank
/usr/bin/btrfs scrub stat /tank
OK: 'Data,RAID1': 11.66842% used (0.42458TB/3.6387TB), 'Metadata,RAID1': 0.01452% used (0.00053TB/3.6387TB), 'System,RAID1': 0.0% used (0.0TB/3.6387TB) | dataraid1_used=466829971456;100;100;; dataraid1_total=4000797868032;100;100;; metadataraid1_used=580730880;100;100;; metadataraid1_total=4000797868032;100;100;; systemraid1_used=81920;100;100;; systemraid1_total=4000797868032;100;100;;
/usr/bin/btrfs filesystem usage -b /tank
Overall:
    Device size:                     4000797868032
    Device allocated:                 942829207552
    Device unallocated:              3057968660480
    Device missing:                              0
    Device slack:                                0
    Used:                             934821470208
    Free (estimated):                1531379535872      (min: 1531379535872)
    Free (statfs, df):               1531378462720
    Data ratio:                               2.00
    Metadata ratio:                           2.00
    Global reserve:                      485900288      (used: 0)
    Multiple profiles:                          no

Data,RAID1: Size:469225177088, Used:466829971456 (99.49%)
   /dev/sda     469225177088
   /dev/sdb     469225177088

Metadata,RAID1: Size:2147483648, Used:580681728 (27.04%)
   /dev/sda     2147483648
   /dev/sdb     2147483648

System,RAID1: Size:41943040, Used:81920 (0.20%)
   /dev/sda       41943040
   /dev/sdb       41943040

Unallocated:
   /dev/sda     1528984330240
   /dev/sdb     1528984330240

See also #13

How to recreate the bug?

No response

@martialblog martialblog added bug needs-triage Needs to be triaged labels Feb 9, 2024
@martialblog martialblog self-assigned this Feb 9, 2024
@martialblog martialblog removed the needs-triage Needs to be triaged label Feb 9, 2024
@martialblog
Copy link
Member Author

I think showing the same percentage as the filesystem usage would be sufficient.

Example:
OK: 'Data,RAID1': 99.49% used , 'Metadata,RAID1': 27.04% used, 'System,RAID1': 0.20% used | dataraid1_used=4668299...

@bratkartoffel any thoughts?

@martialblog martialblog added this to the 3.1.1 milestone Feb 9, 2024
@bratkartoffel
Copy link

Yes, that's what I would expect

@martialblog
Copy link
Member Author

First)

Had a look at the code a bit. The unallocated and no-unallocated flags are strange indeed.

Setting the unallocated flag will use the Device size for all usage percentage calculations instead of data, metadata, system separately. The flag is also set to true by default, so there was no way to unset it, since no-unallocated is unused.

So removing or reworking the no-unallocated and unallocated flags might be an idea.

Second)

I think the df output and btrfs filesystem usage cannot be directly compared.

But I have to read up on btrfs and the intention of this plugin (I didn't write it and is was ported from an older Perl based script from what I can tell).

Right now, I guess the intention was to compare these usage in percent against the given warning/critical thresholds:

Data,RAID1: Size:469225177088, Used:466829971456 (99.49%)
Metadata,RAID1: Size:2147483648, Used:580681728 (27.04%)
System,RAID1: Size:41943040, Used:81920 (0.20%)

So -w 80 -c 90 with 99.49% would give you a CRITICAL. But I could be wrong.

@martialblog martialblog removed this from the 3.1.1 milestone Feb 9, 2024
@martialblog
Copy link
Member Author

@RincewindsHat as discussed, I think we need to take a closer look at the intentions of this plugin. Feel free to chime in

@bratkartoffel
Copy link

bratkartoffel commented Feb 9, 2024

You're right, the question is, what do you expect from this plugin to do. When I take a look at the 99.49% usage for Data, the only way to use this plugin in a sensible way is by using -w 100 -c 100 as any other option would always lead to WARNING or even CRITICAL.

I'd like to monitor the disk and scrub status, the free disk space can be also monitored by the generic check_disk provided by the https://github.com/monitoring-plugins/monitoring-plugins collection.

But maybe I don't understand oder misinterpret the output of the btrfs program as I'm new to btrfs (been using it for just 3 months now). I also need to dig in the gory details of the Metadata / System / Data split done by the btrfs folks and what this really is for.

// Edit: After some reading (mainly):

(for reference here an excerpt description for the reported values from the btrfs doc)

Device size -- sum of raw device capacity available to the filesystem, note that this may not be the same as the total device size (the difference is accounted as slack)

Device allocated -- sum of total space allocated for data/metadata/system profiles, this also accounts space reserved but not yet used for extents

Device unallocated -- the remaining unallocated space for future allocations (difference of the above two numbers)

Device missing -- sum of capacity of all missing devices

Device slack -- sum of slack space on all devices (difference between entire device size and the space occupied by filesystem)

Used -- sum of the used space of data/metadata/system profiles, not including the reserved space

Free (estimated) -- approximate size of the remaining free space usable for data, including currently allocated space and estimating the usage of the unallocated space based on the block group profiles, the min is the lower bound of the estimate in case multiple profiles are present

Free (statfs, df) -- the amount of space available for data as reported by the statfs/statvfs syscall, also returned as Avail in the output of df. The value is calculated in a different way and may not match the estimate in some cases (e.g. multiple profiles).

Data ratio -- ratio of total space for data including redundancy or parity to the effectively usable data space, e.g. single is 1.0, RAID1 is 2.0 and for RAID5/6 it depends on the number of devices

Metadata ratio -- ditto, for metadata

Global reserve -- portion of metadata currently used for global block reserve, used for emergency purposes (like deletion on a full filesystem)

So 99%+ usage for any of the types (metadata, system or data) is not an issue unless there is no more space unallocated. Or did I miss something?

This I'd expect this plugin to check the following metrics:

  1. Verify the "global reserve" is below the configured threshold (ideally the usage should be 0? maybe new flag?)
  2. Verify the "device size" vs "device allocated" ratio against the configured threshold (is a hint for disk usage?)
  3. Verify the "device missing" is 0 (indicates missing drives), maybe via a new flag?
  4. Verify "data ratio" and "metadata ratio" is the same, ideally should match the configured raid-level. May be different / lower than expected during resilver / balance operations?

@martialblog
Copy link
Member Author

Thanks for the additional info. I also found our old article about the plugin: https://www.netways.de/blog/2013/02/01/uberwachung-von-btrfs-filesystemen/

So the byte conversion stuff is due the --raw option not being present yet!

I agree with your suggestion for metrics and that we might need new flags.
The current --missing flag checks for missing devices.

Next step would be to agree on a CLI design. Example:

check_disk_btrfs 
# new flags
  --global-reserve-warning Global reserve is below the configured threshold 
  --global-reserve-critical

  --device-allocation-ratio-warning  Device size - allocated ratio against the configured threshold
  --device-allocation-ratio-critical

# current flags
  -m, --missing-devices
  --volume Path to the Btrfs volume
  --btrfs-path Specify the btrfs path to the executable
  --sudo-path Specify the sudo path to the executable

The "data ratio" and "metadata ratio" verification could be implicit or enabled with a flag, I'm not sure what's best here.

@RincewindsHat any thoughts?

@martialblog martialblog added feature and removed bug labels Feb 23, 2024
@martialblog martialblog changed the title [Bug]: Confirm usage information calculation is correct [Feature]: Extend metrics being evaluated in the plugin Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants