Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine depth level for directories #3

Closed
doulikecookiedough opened this issue Jan 11, 2023 · 4 comments
Closed

Determine depth level for directories #3

doulikecookiedough opened this issue Jan 11, 2023 · 4 comments
Assignees
Labels
question Further information is requested
Milestone

Comments

@doulikecookiedough
Copy link
Contributor

Discuss the implications of having exponentially large dataset submissions over time - and how that will affect our proposed hashstore solution at its current depth.

We may need to increase the depth level to ensure that our solution accounts for future submissions that may exceed what is estimated.

@mbjones mbjones added the question Further information is requested label Jan 12, 2023
@mbjones
Copy link
Member

mbjones commented Jan 12, 2023

From the Physical Layout design doc:

Because each digit in the hash can contain 16 values, the directory structure can contain 65,536 subdirectories (256^2).

If we move to 3 levels, we get 256^3 directories == 16,777,216 ~ 16M -- so a big jump. So the main discussion is whether 65K or 16M directories is better for organizing our content. How many files do we have now and expect in the future, and, assuming they are fairly evenly distributed across, how many files per directory would we expect?

@doulikecookiedough
Copy link
Contributor Author

Referencing the ADC Report Y7Q2 - we have approximately 7000 datasets uploaded consisting of ~1,000,000 objects from 2016-2022 totalling to ~75-80TB of data.

  • Current growth rate: ~12TB
    ~ Monthly 1-5TB datasets and occassionally 10-15TB in size
    ~ Frequently consisting of between 50k and 500k objects (and a dataset of 1 mil is not out of the question)

image

The type of dataset we are trying to accommodate better is one like KNB Ofir Levy, which is about 2TB in size and has ~450K files.

  • https://knb.ecoinformatics.org/view/doi%3A10.5063%2FF1Z899CZ
  • Due to limitations on file counts, we were unable to load 450K files individually - and instead, broke them up into hierarchical packages to overcome the hurdle (they were tarred up into a set of regional archives (that each contain a subset of the dataset).

@doulikecookiedough
Copy link
Contributor Author

doulikecookiedough commented Feb 1, 2023

Context:

If we follow the existing growth rate of ~ 12TB and 1 million objects per year, over 10 years that would be 10 million objects and ~ 120TB of data. Assuming an even distribution of objects:

  • If there were 65K directories, there would be ~153 objects per directory
  • If there were 16M directories, there would be ~0.625 objects per directory

If we use an aggressive estimate of ~60TB and 5 million objects per year per year, over 10 years that would be 50 million objects and ~600TB of data.

  • If there were 65K directories, there would be ~769 objects per directory
  • If there were 16M directories, there would be ~3.125 objects per directory

Findings & Rationale:

  • Using the aggressive estimate with 65K directories, it's not that unreasonable to have 769 files in a directory potentially. However, there is an online consensus that having many files in one folder is not a good practice in general - and there are more recommendations for having as few files as possibles per directory.
    -- On CephFS's documentation, they also agree that it's not a best practice to do so (the example used is 1 million files in a directory).
  • There does not appear to be a directory limit for more recent file systems like XFS (user responded that he was able to reach 16 million directories) or any mention of a directory limit in CephFS either.
    -- Given our current proposed implementation where files are retrieved based on PIDs provided (the direct path to obj), we should not have performance issues when retrieving files even if they're stored several levels deep.
  • While you can technically fit many files in a directory in modern filesystems, it will eventually lead to performance issues for common operations as file grows, like when attempting to get a list of files in the directory itself.
  • Links of interest:
    -- CephFS discussion RE: maximum number of files per directory
    -- NFTS Performance and Large Volumes
    -- Discussion on # of folders in a Windows Folder
    -- Performance discussion of retrieving files via web
  • Assuming that we want to future-proof our substorage system and not have another re-factor down the line, we should prepare to receive escalatingly large datasets in terms of TB and objects.
    -- If we double our aggressive estimate (10 million objects per year), we will still potentially have a very reasonable amount of files per directory with 16M directories
    ---- If there were 65K directories, there would be ~1538 objects per directory
    ---- If there were 16M directories, there would be ~6.25 objects per directory

Recommended Directory Levels:

  • 3 Levels, 16M directories

Next Steps:

  • Review with Matt and discuss findings and if this needs to be benchmarked (and how to if so) for approval

@doulikecookiedough
Copy link
Contributor Author

After discussing with the team, we have agreed to proceed with 3 directory levels deep.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants