Clip_back H14: making it work on an SSD < 2TB; making it fast #304

FlimFlamm · 2023-08-13T06:04:58Z

Goal: My end goal is to use the clip back-end as part of an image-captioning pipeline (will be needing to process many millions of images). The constraint I'm presently working under is that my SSD is only 2TB in capacity, so my intent is to only use the English metadata in conjunction with the H14 index.

Hardware is as follows:

CPU: 13900k
SSD: Samsung 990 pro 2TB
GPU: 4090
RAM: 96GB (@6400hz)

I have managed to get the clip-backend up and running, but I still need to figure out how to align the laion 2B english-only metadata with the H-14 5B index. The guide indicates that the no-language and the multi-language metadata should also be used, but I simply don't have the storage capacity for them.

I converted the metadata I did get into a single pyarrow file (as the guide instructs), and the backend will run, but the results being returned by front-end searches are mostly incorrect. Any guidance on this would be very much appreciated.

I'm also interested in any sort of performance tweaks that might improve performance; eg: reorder_metadata_by_ivf_index (I wonder if this would address my metadata dilemma...).

Thanks in advance for any help or insights.

The text was updated successfully, but these errors were encountered:

rom1504 · 2023-08-13T10:27:03Z

About correctness: did you get the metadata from the same source as the index? About English only: you could change the clip back code to filter out results ids above 2B

…

On Sun, Aug 13, 2023, 08:05 Chris ***@***.***> wrote: Goal: My end goal is to use the clip back-end as part of an image-captioning pipeline (will be needing to process many millions of images). The constraint I'm presently working under is that my SSD is only 2TB in capacity, so my intent is to only use the English metadata in conjunction with the H14 index. Hardware is as follows: CPU: 13900k SSD: Samsung 990 pro 2TB GPU: 4090 RAM: 96GB ***@***.***) I have managed to get the clip-backend up and running, but I still need to figure out how to align the laion 2B english-only metadata with the H-14 5B index. The guide indicates that the no-language and the multi-language metadata should also be used, but I simply don't have the storage capacity for them. I converted the metadata I did get into a single pyarrow file (as the guide instructs), and the backend will run, but the results being returned by front-end searches are mostly incorrect. Any guidance on this would be very much appreciated. I'm also interested in any sort of performance tweaks that might improve performance; eg: reorder_metadata_by_ivf_index (I wonder if this would address my metadata dilemma...). Thanks in advance for any help or insights. — Reply to this email directly, view it on GitHub <#304>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAR437TBKYTUVEHXUG54NVDXVBVBJANCNFSM6AAAAAA3ONQZUU> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

FlimFlamm · 2023-08-13T11:01:46Z

About correctness: did you get the metadata from the same source as the index?

As far as I can tell, they're the correct ones. Following the H-14 guide

This is the index file

This is the english metadata

About English only: you could change the clip back code to filter out results ids above 2B

This sounds like the perfect fix; some of my results are correct (about 1/3rd of them), so I suspect this will work. Any pointers or specifics to achieve this most succinctly? (eg: perhaps modifying the load_clip_indices() function in clip_back?). I would also be interested to trim down the index file instead given that would free up a non trivial amount of storage space.

Thanks very much in any case!

rom1504 · 2023-08-13T11:21:12Z

my suggestion is to filter as soon as possible right after the search so here https://github.com/rom1504/clip-retrieval/blob/main/clip_retrieval/clip_back.py#L366

filtering the index itself is also possible but it would require you to process all the index files and remove any ids > 2B

FlimFlamm · 2023-08-17T10:50:31Z

my suggestion is to filter as soon as possible right after the search so here https://github.com/rom1504/clip-retrieval/blob/main/clip_retrieval/clip_back.py#L366

filtering the index itself is also possible but it would require you to process all the index files and remove any ids > 2B

I narrowed down my issue a bit further after implementing your suggestion but am unsure how to proceed: after setting an id threshold, it turns out that any id above roughly 100 million is misaligned in the front-end or ClipClient results. I can't be sure whether it's malformed index files or a bad metadata arrow file, but I did re-download and re-merge both of them (one at a time) as a sanity check.

Been trying many things in hopes that I missed a step, but no luck so far. One of my suspicions is that a recent change to the instructions may have messed something up, specifically where
for i in {00..79}; do aria2c -x 16 https://huggingface.co/datasets/laion/laion5b-h14-index/resolve/main/index-parts/$i.index; done
was changed to:
for i in {00..79}; do aria2c -x 16 https://huggingface.co/datasets/laion/laion5b-h14-index/resolve/main/index-parts/$i.index -o $i.parquet; done

Which was done to prevent default names being incorrect, but the index files are in fact .index files, so perhaps saving them as .parquet before merging them could affect the alignment somehow?

It seems like my issue is with metadata alignment; the only other suspicion I have right now is that by simply not having the rest of the metadata (by only compiling the English metadatas), the alignment of the metadata is somehow messed up.

Hoping to get your thoughts on this before I potentially re-download the index file with the correct name scheme.

In case it helps, my 0.arrow (english metadata) file is 385.8 GB, the merged_index_ivfdata (5B_H-14) file is 754.9 GB, and the populated.index file is 513.2 MB

FlimFlamm · 2023-08-19T06:07:51Z

I managed to get everything working and aligned. Although I'm not positive which of the following was the fix (mostly due to my inability to systematically test because of storage capacity limits), here was how I modified steps from the guide:

for i in {00..79}; do aria2c -x 16 https://huggingface.co/datasets/laion/laion5b-h14-index/resolve/main/index-parts/$i.index -o $i.parquet; done

becomes:

for i in {00..79}; do aria2c -x 16 https://huggingface.co/datasets/laion/laion5b-h14-index/resolve/main/index-parts/$i.index -o $i.index; done

and

for i in {0000..2313}; do aria2c -x 16 https://huggingface.co/datasets/laion/laion2b-en-vit-h-14-embeddings/resolve/main/metadata/metadata_$i.parquet -o $i.parquet; done

becomes (along with the other metadata downloads)

for i in {0000..2313}; do aria2c -x 16 https://huggingface.co/datasets/laion/laion2b-en-vit-h-14-embeddings/resolve/main/metadata/metadata_$i.parquet -o metadata_$i.parquet; done

Other than that, a final sanity check that may have been at play was validating that I did have all 2314 of the English metadata files before compiling them. After the first download of the datasets I stopped checking to make sure they were all there (I assumed aria2 was fool-proof), but on this last attempt i checked and found a few missing files. I'll make a pull request and include a warning that advises people check for missing files, in addition to the above changes just to keep consistent with the database filenames.

I'm still interested in figuring out how to maximize clip back's performance for my use case, so I'll leave this issue open and eventually complete it with my findings (or perhaps ask some related questions :D )

Thanks again for your help @rom1504 , I'm looking forward to using this for big data jobs!

rom1504 added the enhancement New feature or request label Jan 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clip_back H14: making it work on an SSD < 2TB; making it fast #304

Clip_back H14: making it work on an SSD < 2TB; making it fast #304

FlimFlamm commented Aug 13, 2023

rom1504 commented Aug 13, 2023 via email

FlimFlamm commented Aug 13, 2023

rom1504 commented Aug 13, 2023

FlimFlamm commented Aug 17, 2023 •

edited

FlimFlamm commented Aug 19, 2023

Clip_back H14: making it work on an SSD < 2TB; making it fast #304

Clip_back H14: making it work on an SSD < 2TB; making it fast #304

Comments

FlimFlamm commented Aug 13, 2023

rom1504 commented Aug 13, 2023 via email

FlimFlamm commented Aug 13, 2023

rom1504 commented Aug 13, 2023

FlimFlamm commented Aug 17, 2023 • edited

FlimFlamm commented Aug 19, 2023

FlimFlamm commented Aug 17, 2023 •

edited