Determine depth level for directories #3

doulikecookiedough · 2023-01-11T18:38:31Z

Discuss the implications of having exponentially large dataset submissions over time - and how that will affect our proposed hashstore solution at its current depth.

We may need to increase the depth level to ensure that our solution accounts for future submissions that may exceed what is estimated.

mbjones · 2023-01-12T22:36:46Z

From the Physical Layout design doc:

Because each digit in the hash can contain 16 values, the directory structure can contain 65,536 subdirectories (256^2).

If we move to 3 levels, we get 256^3 directories == 16,777,216 ~ 16M -- so a big jump. So the main discussion is whether 65K or 16M directories is better for organizing our content. How many files do we have now and expect in the future, and, assuming they are fairly evenly distributed across, how many files per directory would we expect?

doulikecookiedough · 2023-01-31T21:11:26Z

Referencing the ADC Report Y7Q2 - we have approximately 7000 datasets uploaded consisting of ~1,000,000 objects from 2016-2022 totalling to ~75-80TB of data.

Current growth rate: ~12TB
~ Monthly 1-5TB datasets and occassionally 10-15TB in size
~ Frequently consisting of between 50k and 500k objects (and a dataset of 1 mil is not out of the question)

The type of dataset we are trying to accommodate better is one like KNB Ofir Levy, which is about 2TB in size and has ~450K files.

https://knb.ecoinformatics.org/view/doi%3A10.5063%2FF1Z899CZ
Due to limitations on file counts, we were unable to load 450K files individually - and instead, broke them up into hierarchical packages to overcome the hurdle (they were tarred up into a set of regional archives (that each contain a subset of the dataset).

doulikecookiedough · 2023-02-01T23:02:26Z

Context:

If we follow the existing growth rate of ~ 12TB and 1 million objects per year, over 10 years that would be 10 million objects and ~ 120TB of data. Assuming an even distribution of objects:

If there were 65K directories, there would be ~153 objects per directory
If there were 16M directories, there would be ~0.625 objects per directory

If we use an aggressive estimate of ~60TB and 5 million objects per year per year, over 10 years that would be 50 million objects and ~600TB of data.

If there were 65K directories, there would be ~769 objects per directory
If there were 16M directories, there would be ~3.125 objects per directory

Findings & Rationale:

Using the aggressive estimate with 65K directories, it's not that unreasonable to have 769 files in a directory potentially. However, there is an online consensus that having many files in one folder is not a good practice in general - and there are more recommendations for having as few files as possibles per directory.
-- On CephFS's documentation, they also agree that it's not a best practice to do so (the example used is 1 million files in a directory).
There does not appear to be a directory limit for more recent file systems like XFS (user responded that he was able to reach 16 million directories) or any mention of a directory limit in CephFS either.
-- Given our current proposed implementation where files are retrieved based on PIDs provided (the direct path to obj), we should not have performance issues when retrieving files even if they're stored several levels deep.
While you can technically fit many files in a directory in modern filesystems, it will eventually lead to performance issues for common operations as file grows, like when attempting to get a list of files in the directory itself.
Links of interest:
-- CephFS discussion RE: maximum number of files per directory
-- NFTS Performance and Large Volumes
-- Discussion on # of folders in a Windows Folder
-- Performance discussion of retrieving files via web
Assuming that we want to future-proof our substorage system and not have another re-factor down the line, we should prepare to receive escalatingly large datasets in terms of TB and objects.
-- If we double our aggressive estimate (10 million objects per year), we will still potentially have a very reasonable amount of files per directory with 16M directories
---- If there were 65K directories, there would be ~1538 objects per directory
---- If there were 16M directories, there would be ~6.25 objects per directory

Recommended Directory Levels:

3 Levels, 16M directories

Next Steps:

Review with Matt and discuss findings and if this needs to be benchmarked (and how to if so) for approval

doulikecookiedough · 2023-02-09T17:40:43Z

After discussing with the team, we have agreed to proceed with 3 directory levels deep.

mbjones assigned doulikecookiedough Jan 12, 2023

mbjones added the question Further information is requested label Jan 12, 2023

doulikecookiedough closed this as completed Feb 9, 2023

mbjones added this to the 1.0.0 milestone Apr 6, 2023

doulikecookiedough mentioned this issue Oct 6, 2023

Extensively Test HashStore for 1.1.0 #32

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Determine depth level for directories #3

Determine depth level for directories #3

doulikecookiedough commented Jan 11, 2023

mbjones commented Jan 12, 2023

doulikecookiedough commented Jan 31, 2023

doulikecookiedough commented Feb 1, 2023 •

edited

Loading

doulikecookiedough commented Feb 9, 2023

Determine depth level for directories #3

Determine depth level for directories #3

Comments

doulikecookiedough commented Jan 11, 2023

mbjones commented Jan 12, 2023

doulikecookiedough commented Jan 31, 2023

doulikecookiedough commented Feb 1, 2023 • edited Loading

doulikecookiedough commented Feb 9, 2023

doulikecookiedough commented Feb 1, 2023 •

edited

Loading