New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix FsInfo device deduplication #94744
Fix FsInfo device deduplication #94744
Conversation
Pinging @elastic/es-data-management (Team:Data Management) |
Hi @joegallo, I've created a changelog YAML for you. |
Note: the splash damage on this is pretty small -- at least as far as I understand, it would only apply when using multiple data paths on a single drive. Related to #24472 only in the absolutely loosest sense -- I was doing some investigation of that ticket and happened across this behavior. That ticket is about mistaken de-duplication at the cluster level when multiple nodes share an ip address but have independent storage. This PR doesn't help there at all. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it but I don't have enough information to approve it. Sorry. I'm a bit terrified. It really seems right.
Another note to myself for myself, the deduplication in question for this PR only applies within a single node. The Lines 77 to 85 in 6e87449
Coming back to my single node cluster, the
1TB, ✅. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your reasoning seems sound. However, mount is Nullable (I assume because of Windows?). So we should fall back to path if mount is null?
That code for spins hasn't existed in years.
They're not especially nullable as compared to, say, path. All of them can be null in the case of an FsInfo.Path that represents a total rather than a concrete entry. Any of them can be null in tests.
I don't think so -- Elasticsearch won't start if the paths are duplicated, so that's one thing, but more than that I just don't think it makes sense to de-duplicate on paths for this purpose.
Today has been a github archaeology day, which makes it a good day. #1622 is the original issue for adding fs stats. It was closed via 0a3c941. In that original implementation, there were two ways of populating an
So in that world, yes, "in that way" sure is an interesting way of phrasing this, but sure, it can totally be
I added some commits that drop the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great explanation! LGTM
yw! Thanks for the review, this PR has been a lot of fun (not all that much code writing, but the investigation was delightful). |
#12053 changed the deduplication to be based on
path
(prior to that it had been bydev
). I submit that deduplication based onpath
is not correct, and that what we actually want is deduplication based onmount
.Consider the following
GET _nodes/stats/fs
snippet for a single node cluster on my box:The paths are different, as they must be, so no de-duplication is done and the total is 2 terabytes. I assure you, though, there's only one drive attached to my laptop, and it's 1TB. 😉
By way of contrast, with this PR in place, we'd see:
And the
total.total_in_bytes
reflects the reality that I have just 1TB of storage.Not entirely that different from #32569 which I closed yesterday.