Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clip_back H14: making it work on an SSD < 2TB; making it fast #304

Open
FlimFlamm opened this issue Aug 13, 2023 · 5 comments
Open

Clip_back H14: making it work on an SSD < 2TB; making it fast #304

FlimFlamm opened this issue Aug 13, 2023 · 5 comments
Labels
enhancement New feature or request

Comments

@FlimFlamm
Copy link
Contributor

Goal: My end goal is to use the clip back-end as part of an image-captioning pipeline (will be needing to process many millions of images). The constraint I'm presently working under is that my SSD is only 2TB in capacity, so my intent is to only use the English metadata in conjunction with the H14 index.

Hardware is as follows:

CPU: 13900k
SSD: Samsung 990 pro 2TB
GPU: 4090
RAM: 96GB (@6400hz)

I have managed to get the clip-backend up and running, but I still need to figure out how to align the laion 2B english-only metadata with the H-14 5B index. The guide indicates that the no-language and the multi-language metadata should also be used, but I simply don't have the storage capacity for them.

I converted the metadata I did get into a single pyarrow file (as the guide instructs), and the backend will run, but the results being returned by front-end searches are mostly incorrect. Any guidance on this would be very much appreciated.

I'm also interested in any sort of performance tweaks that might improve performance; eg: reorder_metadata_by_ivf_index (I wonder if this would address my metadata dilemma...).

Thanks in advance for any help or insights.

@rom1504
Copy link
Owner

rom1504 commented Aug 13, 2023 via email

@FlimFlamm
Copy link
Contributor Author

About correctness: did you get the metadata from the same source as the index?

As far as I can tell, they're the correct ones. Following the H-14 guide

This is the index file

This is the english metadata

About English only: you could change the clip back code to filter out results ids above 2B

This sounds like the perfect fix; some of my results are correct (about 1/3rd of them), so I suspect this will work. Any pointers or specifics to achieve this most succinctly? (eg: perhaps modifying the load_clip_indices() function in clip_back?). I would also be interested to trim down the index file instead given that would free up a non trivial amount of storage space.

Thanks very much in any case!

@rom1504
Copy link
Owner

rom1504 commented Aug 13, 2023

my suggestion is to filter as soon as possible right after the search so here https://github.com/rom1504/clip-retrieval/blob/main/clip_retrieval/clip_back.py#L366

filtering the index itself is also possible but it would require you to process all the index files and remove any ids > 2B

@FlimFlamm
Copy link
Contributor Author

FlimFlamm commented Aug 17, 2023

my suggestion is to filter as soon as possible right after the search so here https://github.com/rom1504/clip-retrieval/blob/main/clip_retrieval/clip_back.py#L366

filtering the index itself is also possible but it would require you to process all the index files and remove any ids > 2B

I narrowed down my issue a bit further after implementing your suggestion but am unsure how to proceed: after setting an id threshold, it turns out that any id above roughly 100 million is misaligned in the front-end or ClipClient results. I can't be sure whether it's malformed index files or a bad metadata arrow file, but I did re-download and re-merge both of them (one at a time) as a sanity check.

Been trying many things in hopes that I missed a step, but no luck so far. One of my suspicions is that a recent change to the instructions may have messed something up, specifically where
for i in {00..79}; do aria2c -x 16 https://huggingface.co/datasets/laion/laion5b-h14-index/resolve/main/index-parts/$i.index; done
was changed to:
for i in {00..79}; do aria2c -x 16 https://huggingface.co/datasets/laion/laion5b-h14-index/resolve/main/index-parts/$i.index -o $i.parquet; done

Which was done to prevent default names being incorrect, but the index files are in fact .index files, so perhaps saving them as .parquet before merging them could affect the alignment somehow?

It seems like my issue is with metadata alignment; the only other suspicion I have right now is that by simply not having the rest of the metadata (by only compiling the English metadatas), the alignment of the metadata is somehow messed up.

Hoping to get your thoughts on this before I potentially re-download the index file with the correct name scheme.

In case it helps, my 0.arrow (english metadata) file is 385.8 GB, the merged_index_ivfdata (5B_H-14) file is 754.9 GB, and the populated.index file is 513.2 MB

@FlimFlamm
Copy link
Contributor Author

I managed to get everything working and aligned. Although I'm not positive which of the following was the fix (mostly due to my inability to systematically test because of storage capacity limits), here was how I modified steps from the guide:

for i in {00..79}; do aria2c -x 16 https://huggingface.co/datasets/laion/laion5b-h14-index/resolve/main/index-parts/$i.index -o $i.parquet; done

becomes:

for i in {00..79}; do aria2c -x 16 https://huggingface.co/datasets/laion/laion5b-h14-index/resolve/main/index-parts/$i.index -o $i.index; done

and

for i in {0000..2313}; do aria2c -x 16 https://huggingface.co/datasets/laion/laion2b-en-vit-h-14-embeddings/resolve/main/metadata/metadata_$i.parquet -o $i.parquet; done

becomes (along with the other metadata downloads)

for i in {0000..2313}; do aria2c -x 16 https://huggingface.co/datasets/laion/laion2b-en-vit-h-14-embeddings/resolve/main/metadata/metadata_$i.parquet -o metadata_$i.parquet; done

Other than that, a final sanity check that may have been at play was validating that I did have all 2314 of the English metadata files before compiling them. After the first download of the datasets I stopped checking to make sure they were all there (I assumed aria2 was fool-proof), but on this last attempt i checked and found a few missing files. I'll make a pull request and include a warning that advises people check for missing files, in addition to the above changes just to keep consistent with the database filenames.

I'm still interested in figuring out how to maximize clip back's performance for my use case, so I'll leave this issue open and eventually complete it with my findings (or perhaps ask some related questions :D )

Thanks again for your help @rom1504 , I'm looking forward to using this for big data jobs!

@rom1504 rom1504 added the enhancement New feature or request label Jan 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants