Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit the number of dirs and files an allocation could have #639

Closed
2 tasks
peterlimg opened this issue Apr 17, 2022 · 6 comments · Fixed by #676
Closed
2 tasks

Limit the number of dirs and files an allocation could have #639

peterlimg opened this issue Apr 17, 2022 · 6 comments · Fixed by #676
Assignees
Labels
good first issue Good for newcomers mainnet Optimization Will improve performance

Comments

@peterlimg
Copy link
Member

This is (perhaps) a temporary fix for resolving the slow allocation root calculation when there are enormous dirs and files in an allocation. Issue #627 detailed the problem, but we may not consider to do it in splitting tables way. Instead, we will consider to optimize it by organization the dirs and files in MPT tree, discussed here, just like what we do for blockchain state. Anyway, we will not put effects on it before mainnet. What we currently need to do is limit the number of dirs and files each allocation could have. The steps could be:

  • Add max_dirs_files to config file
  • Check the count of the files and dirs of an allocation each time when adding new files, return error if the count reached the max allowed number
@peterlimg peterlimg added mainnet Optimization Will improve performance labels Apr 17, 2022
@sculptex
Copy link

It seems that even with optimized database and root hash calculations, having many files and folders in a single allocation is inevitably gonna become slow and unwieldy simply for sake of fetching directory/file listings if nothing else.

The list-all option is a particular problem. I know we have pagination in progress but rarely would a user need all this in one go. Perhaps more likely they would like the complete folder structure and then fetch lists of files for where they are interested.

So, I just believe we need tools to encourage users to organise in a way that would be most efficient.

Example.
The S3 migration tool could perform analysis prior to migration. If more than say e.g. 1million files, the user could be recommended that storing all these in a single allocation would reduce performance and (in particular) result in slow directory listings. Better would be to split the data into multiple allocations. When including the ability to 'name' allocations, they could actually 'emulate' folders.

I see the main drawbacks with this of course would be that files and folders couldn't be moved between allocations and the user would have multiple allocations to manage, but I think better to support graceful method like this rather than impose hard limits and deter potential big data storage.

@guruhubb
Copy link
Member

@sculptex this is something the user can manage. We will limit the number of files/directories regardless of our future optimization.

@lpoli
Copy link
Contributor

lpoli commented Apr 24, 2022

In my opinion, splitting tables for reference_objects would be very good. Its like having separation of concern.
Regarding having limit in files/directories; @guruhubb I think it should also be registered in smart contract just like blobber capacity, so that user can know pre-hand before creating an allocation.

One obvious thing is we should only limit one of number of files/directories in a directory and depths of directories. In our case we should limit number of files/directories a directory can have.

If anyone is taking over this task; don't forget to remove soft-deletion where applicable.

@cnlangzi cnlangzi added the good first issue Good for newcomers label Apr 25, 2022
@sculptex
Copy link

sculptex commented May 2, 2022

.

@cnlangzi cnlangzi assigned avanaur and unassigned cnlangzi May 5, 2022
@avanaur
Copy link
Contributor

avanaur commented May 5, 2022

@peterlimg @cnlangzi @lpoli @guruhubb @sculptex

I am planning to work on this next. Would like to confirm the following

  1. New limit is on the total combined directories and files allowed per allocation
  2. The config will be named max_dirs_files and will live in blobber (is this a global config? can another blobber config vary?)
  3. Will start with 1000 limit (is this a sensible number?)
  4. Max depth for subdirectory is out of scope
  5. Max file/subdirectory per directory is out of scope
  6. Restriction to apply when adding new files (guess it automatically cover other file actions which results to another new file?)

@sculptex
Copy link

sculptex commented May 6, 2022

Conversation moved to slack

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers mainnet Optimization Will improve performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants