Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve data onbaording speed: ipfs add and ipfs dag import|export #10383

Open
3 tasks done
mishmosh opened this issue Apr 3, 2024 · 9 comments
Open
3 tasks done

Improve data onbaording speed: ipfs add and ipfs dag import|export #10383

mishmosh opened this issue Apr 3, 2024 · 9 comments
Labels
kind/enhancement A net-new feature or improvement to an existing feature need/author-input Needs input from the original author need/triage Needs initial labeling and prioritization

Comments

@mishmosh
Copy link
Contributor

mishmosh commented Apr 3, 2024

Checklist

  • My issue is specific & actionable.
  • I am not suggesting a protocol enhancement.
  • I have searched on the issue tracker for my issue.

Description

This is a followup on user @endomorphosis's comment in the Filecoin community discussions about IPFS hashing being slow.

I noticed that when trying to index large ML models that the IPFS daemon hashing seems to be single threaded, and therefore somewhat slow when indexing large files. If this is funded, it is my hope that someone in your org can try to create a new spec, to parallelize the hashing of large files.

Per @lidel in an ipfs-steering conversation on 2 April 2024:

In my mind this is not about inventing new hashing specifications, this is about making the most popular implementation majority of ecosystem uses for data onboarding (Kubo) better. My translation:

the IPFS daemon hashing [..] slow when indexing large files
→ Kubo's commands like ipfs add are not as fast as they "should be", when comparing with sha256sum over the number of chunks

parallelize the hashing of large files.
→ improve implementation, make core commands like ipfs dag import|export and ipfs add as fast as possible (we know they are not)

Once we have reference implementation, we can add some rules of thumb how to implement UnixFs hashing and chunking to "notes for implementers" section of wip Unix specification.

@mishmosh mishmosh added the kind/enhancement A net-new feature or improvement to an existing feature label Apr 3, 2024
@endomorphosis
Copy link

endomorphosis commented Apr 3, 2024

I don't know about what hashing method is running under the hood, but this was the questions that I had when contemplating it.

Normally the entire file is passed through in chunks, and each chunk is passed in serial, and includes the previous chunk, which means that parallelizing this will not work. In the alternative it should be possible hash chunks in parallel, and then take the hashes of the parallel chunks and then do a sha256 of those chunks.

Additionally, I know that there is hardware acceleration and various hardware instructions for sha256, however I dont know how much you really will enjoy supporting different hardware, or what sort of applications which will automatically handle the accelerators and what level of support that they have.

I seem to remember it taking around 2 days to index ~4TB of models on a 2TB NVME cached ZFS 8 drive SAS array

@lidel
Copy link
Member

lidel commented Apr 3, 2024

@endomorphosis mind providing some answers to below questions? These will let us narrow down the areas that require optimization to better serve your use case (AI model data?):

  1. data itself: how does 4TB of you model data look like? (e.g. average file sizes, number of files in a directory, directory tree depth)
  2. data onboarding: is it done with plain ipfs add or a different command else? Do you use any custom parameters?
  3. storage backend: are you using flatfs+leveldb (default) or something else (badgerds?)]
  4. is ipfs daemon running while you perform the import? (or did you pass --offline flag to skip announcements?)

@lidel lidel added the need/triage Needs initial labeling and prioritization label Apr 3, 2024
@lidel lidel changed the title Make core commands faster: ipfs add and ipfs dag import|export Improve data onbaording speed: ipfs add and ipfs dag import|export Apr 3, 2024
@endomorphosis
Copy link

endomorphosis commented Apr 3, 2024

it looks like:

Mostly large language models mostly between 7GB to 70G using badgerds on huggingface repositories, IPFS daemon is running online and i use -r to archive the entire folder, e.g. https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF/tree/main

https://github.com/endomorphosis/ipfs_transformers/blob/main/ipfs_transformers/ipfs_kit_lib/ipfs.py#L73
/usr/local/bin/ipfs daemon --enable-gc --enable-pubsub-experiment

@endomorphosis
Copy link

endomorphosis commented Apr 3, 2024

ipfs@workstation:/storage/cloudkit-models/Mixtral-8x7B-v0.1-GGUF-Q8_0$ time ipfs add mixtral-8x7b-v0.1.Q8_0.gguf
added QmXVg3Ae6wRwbvkVqMwySyx6qdcVdjEy1iu8xHnwT9dAoB mixtral-8x7b-v0.1.Q8_0.gguf
46.22 GiB / 46.22 GiB [=============================================================================================================================================================================] 100.00%
real 15m32.782s
user 0m22.393s
sys 1m54.226s

(copy from zfs to ssd, then zfs to zfs)
barberb@workstation:/storage/cloudkit-models/Mixtral-8x7B-v0.1-GGUF-Q8_0$ time cp mixtral-8x7b-v0.1.Q8_0.gguf /tmp/mixtral.bin

real 1m59.033s
user 0m0.681s
sys 1m17.771s

barberb@workstation:/storage/cloudkit-models/Mixtral-8x7B-v0.1-GGUF-Q8_0$ time cp mixtral-8x7b-v0.1.Q8_0.gguf ../

real 2m28.763s
user 0m0.624s
sys 0m45.349s

Dual Xeon E5-V4 CPU with 8x WD gold and 1x Samsung 3x8 pcie lane enterprise NVME on ZFS
I will try on my windows laptop workstation on only NVME using the desktop client in a second, and will update this, but it looks like ~ 50MB/s, and there the CPU utilization is always very low.

Windows nvme -> nvme (different devices) Intel(R) Xeon(R) CPU E3-1535M v6 using the desktop client
12m 30s
80% total cpu util (4 cores)

@lidel
Copy link
Member

lidel commented Apr 16, 2024

@endomorphosis Triage questions:

@lidel lidel added the need/author-input Needs input from the original author label Apr 16, 2024
@endomorphosis
Copy link

endomorphosis commented Apr 16, 2024

devel@workstation:/tmp$ time ipfs add mixtral-8x7b-v0.1.Q8_0.gguf
added QmXVg3Ae6wRwbvkVqMwySyx6qdcVdjEy1iu8xHnwT9dAoB mixtral-8x7b-v0.1.Q8_0.gguf
46.22 GiB / 46.22 GiB [===========================================================================================================================================================================================================] 100.00%
real 14m8.824s
user 12m19.404s
sys 1m31.919s

running offline

fregg@workstation:/tmp$ time ipfs add mixtral-8x7b-v0.1.Q8_0.gguf
added QmXVg3Ae6wRwbvkVqMwySyx6qdcVdjEy1iu8xHnwT9dAoB mixtral-8x7b-v0.1.Q8_0.gguf
46.22 GiB / 46.22 GiB [===========================================================================================================================================================================================================] 100.00%
real 9m1.508s
user 12m46.422s
sys 2m11.760s

syncwrite to false

@endomorphosis
Copy link

I do want to mention that I am writing a wrapper to import datasets into IPFS as well, where the number of files will be on the order of 7 million legal caselaw documents, I want to know if you have any feedback about what optimizations ought to be made for that instance?

@hsanjuan
Copy link
Contributor

hsanjuan commented Apr 16, 2024

Per #9678 I have an old branch with a bunch of optimizations for data import that I could rescue/cleanup. (edit: not for merging though as it hardcodes some stuff, just for using).

One low hanging fruit would be to add support for badger4 and pebble backends too. Badgerv1 is ages old.

@endomorphosis
Copy link

Per #9678 I have an old branch with a bunch of optimizations for data import that I could rescue/cleanup. (edit: not for merging though as it hardcodes some stuff, just for using).

One low hanging fruit would be to add support for badger4 and pebble backends too. Badgerv1 is ages old.

I will look at that for my repositories
https://github.com/endomorphosis/ipfs_transformers
https://github.com/endomorphosis/ipfs_datasets

I would like to impress on your org (protocol labs) that I am trying to make this ergonomic for machine learning developers, and from what I have seen is that other projects (e.g. Bachalau, Iroh) are migrating away from libp2p / IPFS, because the project needs to be performance optimized.

I only have two more weeks I can spend on this hugging face bridge, and right now its more of a life raft in case the government decides to overregulate machine learning, than it is a viable solution that ML devs would turn to, but if something reasonably effective for decentralized low latency MLops existed I would probably stop development and just buy filecoins.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement A net-new feature or improvement to an existing feature need/author-input Needs input from the original author need/triage Needs initial labeling and prioritization
Projects
Status: No status
Development

No branches or pull requests

4 participants