Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speeding up clip-retrieval back for large number of images #213

Open
varadgunjal opened this issue Dec 8, 2022 · 44 comments
Open

Speeding up clip-retrieval back for large number of images #213

varadgunjal opened this issue Dec 8, 2022 · 44 comments
Labels
enhancement New feature or request

Comments

@varadgunjal
Copy link

varadgunjal commented Dec 8, 2022

I'm experimenting with retrieving large number of images (providing num_images as 10-20k in the query). However, I notice that the response is super slow. For 2k images it took ~38s to complete. To speed it up, I tried some of the suggestions from the README -

  1. I tried to follow the instructions here - https://github.com/rom1504/clip-retrieval#clip-back-benchmark-and-monitoring - and turned off memory mapping (set enable_faiss_memory_mapping, use_arrow and enable_hdf5 to false), but then it throws an error saying
    RuntimeError: Error in faiss::Index* faiss::read_index(faiss::IOReader*, int) at /project/faiss/faiss/impl/index_read.cpp:527: Error: 'ret == (1)' failed: read error in /efs/data/laion-5b-index/image.index: 0 != 1 (Is a directory)

Did I misunderstand "turn off memory mapping"?


  1. As mentioned under the Options section of clip back, I tried to set reorder_metadata_by_ivf_index to true (while keeping enable_faiss_memory_mapping and use_arrow to true as before). But this gives the following stack trace -
[2022-12-08 19:14:36,767] ERROR in app: Exception on /knn-service [POST]
Traceback (most recent call last):
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/flask/app.py", line 1820, in full_dispatch_request
    rv = self.dispatch_request()
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/flask/app.py", line 1796, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/flask_restful/__init__.py", line 467, in wrapper
    resp = resource(*args, **kwargs)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/flask/views.py", line 107, in view
    return current_app.ensure_sync(self.dispatch_request)(**kwargs)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/flask_restful/__init__.py", line 582, in dispatch_request
    resp = meth(*args, **kwargs)
  File "<decorator-gen-2>", line 2, in post
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/prometheus_client/context_managers.py", line 81, in wrapped
    return func(*args, **kwargs)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/clip_retrieval/clip_back.py", line 488, in post
    return self.query(
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/clip_retrieval/clip_back.py", line 451, in query
    distances, indices = self.knn_search(
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/clip_retrieval/clip_back.py", line 360, in knn_search
    results = np.take(ivf_old_to_new_mapping, indices[0])
  File "<__array_function__ internals>", line 180, in take
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/numpy/core/fromnumeric.py", line 190, in take
    return _wrapfunc(a, 'take', indices, axis=axis, out=out, mode=mode)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/numpy/core/fromnumeric.py", line 54, in _wrapfunc
    return _wrapit(obj, method, *args, **kwds)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/numpy/core/fromnumeric.py", line 43, in _wrapit
    result = getattr(asarray(obj), method)(*args, **kwds)
IndexError: index 5437962295 is out of bounds for axis 0 with size 1

  1. The clip benchmarking section also mentions using GPUs for fast clip inference - is there an option to enable this in clip back?
@rom1504
Copy link
Owner

rom1504 commented Dec 8, 2022

Hey, glad you got things working locally

What kind of hardware do you have?
Are the files on a nvme ssd? How much ram do you have ?

Probably the best way to speed things up is #125 so that the reordering option will work with arrow

You may also disable safety and near deduplication

@varadgunjal
Copy link
Author

varadgunjal commented Dec 8, 2022

I'm currently keeping index files on an EFS - could that be a source of problems? I can move it to a SSD if that would result in better performance.

@varadgunjal
Copy link
Author

Hey, glad you got things working locally

What kind of hardware do you have? Are the files on a nvme ssd? How much ram do you have ?

Probably the best way to speed things up is #125 so that the reordering option will work with arrow

You may also disable safety and near deduplication

So #125 should be possible with this in the config -

{   
   ...
   "enable_faiss_memory_mapping": true,
   "use_arrow": true,
   "reorder_metadata_by_ivf_index: true
   ...
}

Right?

@rom1504
Copy link
Owner

rom1504 commented Dec 8, 2022

No it needs new code I'm afraid

Yes prefer using a ssd

1 similar comment
@rom1504
Copy link
Owner

rom1504 commented Dec 8, 2022

No it needs new code I'm afraid

Yes prefer using a ssd

@varadgunjal
Copy link
Author

Sorry I'm a little confused - where exactly does the reorder_metadata_by_ivf_index option help since that issue states it's not for arrow yet?

Also, regarding speedup, just using the original config with SSD would yield that much benefit (I believe 20 query/s is mentioned in the benchmarking section)? No GPUs required?

@rom1504
Copy link
Owner

rom1504 commented Dec 9, 2022

reorder_metadata_by_ivf_index cannot currently help with the arrow files
You have 2 options

  1. Rebuild the hdf5 collection from the parquet metadata using that option
  2. Implement the re-ordering with arrow and open a PR

I advise 2

Regarding speed up. The number in the readme is for a smaller index. However it is indeed possible to get good speeds. It will need some work however. Here are slow things

  • implement batching: that will speed up 10x
  • metadata reordering: 100x on metadata fetching speed
  • disable safety and near dedup or use a faster implementation

@rom1504
Copy link
Owner

rom1504 commented Dec 9, 2022

GPU won't help without batching

@varadgunjal
Copy link
Author

I see. Thank you so much for all your help! I will look into your suggestions and try to implement at least one.

One last clarification about the original post above : the benchmarking section mentions "turning off memory mapping options can also speed up requests, at the cost of high ram usage". How does this work?

@rom1504
Copy link
Owner

rom1504 commented Dec 9, 2022

Turning off memory mapping means putting the whole index in ram. For a 800GB index it would mean either getting a machine with a lot of ram or splitting in many machines
I recommend investigating the other options first

@varadgunjal
Copy link
Author

Got it. I started by rebuilding the hdf5 collection with the reorder_metadata_by_ivf_index option. It does make responses faster, but the responses only return the 'id' & 'similarity' column even though I set "columns_to_return": ["url", "caption", "NSFW", "id", "similarity"]

Should I reduce it down to fewer columns (like only ["url", "caption"]? Or does the reordering limit to returning only id & similarity?

@rom1504
Copy link
Owner

rom1504 commented Dec 9, 2022 via email

@varadgunjal
Copy link
Author

No I don't think I explicitly disable it. Is there a flag that does that?

@rom1504
Copy link
Owner

rom1504 commented Dec 9, 2022

To enable it you need to use one of --enable_hdf5 True or use arrow, have a metadata collection that contain all items and have no error in the console

@varadgunjal
Copy link
Author

varadgunjal commented Dec 9, 2022

I do have enable_hdf5=True and there is no error at the console. My config looks like so -

{
    "laion5B": {
            "indice_folder": "laion-5b-index",
            "provide_safety_model": false,
            "provide_violence_detector": false,
            "enable_faiss_memory_mapping": true,
            "use_arrow": false,
            "enable_hdf5": true,
            "reorder_metadata_by_ivf_index": true,
            "columns_to_return": ["url", "caption", "NSFW", "id", "similarity"],
            "clip_model": "ViT-L/14",
            "enable_mclip_option": false
    }
}

... and running it with reorder_metadata_by_ivf_index did cause some extra processing to happen (2 progress bars show up). I re-ran it by removing the metadata_reordered.hdf5 & ivf_old_to_new_mapping.npy that are created, but still get the same result.

@varadgunjal
Copy link
Author

I tried to call the MetadataService explicitly using the ids returned by the KnnService, but it doesn't return any metadata for any of the listed IDs, using the above config. However, if I switch to "use_arrow": True (and "enable_hdf5": False), the MetadataService does return requested metadata.

I guess that's why this check -

if meta is not None:
- fails and I get only 'id' & 'similarity' in the output.

@rom1504
Copy link
Owner

rom1504 commented Dec 9, 2022 via email

@varadgunjal
Copy link
Author

Yes, I see metadata_reordered.hdf5 in the folder.

@varadgunjal
Copy link
Author

For testing, I'm querying it like so -

payload = {
    "text":"red car",
    "modality":"image",
    "num_images":20,
    "indice_name":"laion5B",
    "use_mclip":False,
    "deduplicate":True,
    "use_safety_model":True,
    "use_violence_detector":True,
    "aesthetic_score":"",
    "aesthetic_weight":0.5
}

response = requests.post(
    "http://127.0.0.1:1234/knn-service",
    data=json.dumps(payload)
)

@rom1504
Copy link
Owner

rom1504 commented Dec 9, 2022 via email

@rom1504
Copy link
Owner

rom1504 commented Dec 9, 2022 via email

@varadgunjal
Copy link
Author

varadgunjal commented Dec 9, 2022

Oh hmm. The ivf_old_to_new_mapping.npy is around 42G and the metadata_reordered is only a few KB. What could the cause for that be? ISn't that generated automatically with the reorder_ flag?

@rom1504
Copy link
Owner

rom1504 commented Dec 9, 2022 via email

@varadgunjal
Copy link
Author

varadgunjal commented Dec 9, 2022

Ahh that's the problem. I was still pointing to the metadata folder with the arrow files that is provided on HF and not to the local folder with these parquet files. Thank you!

@varadgunjal
Copy link
Author

One comment regarding the metadata parquet files - when I downloaded them, it was more manageable and informative to keep them in their respective folders laion1B-nolang, laion2B-en & laion2B-multi, rather than just dumping it all into one metadata folder. Is it possible to consider adapting the reordering code to align with such a folder structure rather than having to get the user put in effort (for eg. do something like rename the parquet files so that they don't overwrite)?

I think it would just require updating -

for parquet_files in tqdm(sorted(data_dir.glob("*.parquet"))):

...to .rglob instead and the order of the parquet files would remain the same (1B-nolang, then 2B-en and then 2B-multi) in the result

@rom1504
Copy link
Owner

rom1504 commented Dec 13, 2022

yeah absolutely

I think using something like this https://github.com/rom1504/embedding-reader/blob/main/embedding_reader/get_file_list.py#L38 should do the trick (this is what gets used in autofaiss so it's the right order)

@rom1504
Copy link
Owner

rom1504 commented Dec 15, 2022

@varadgunjal hey just wondering, did you have any success ?
feel free to talk at discord about it as well if you want, I'm rom1504#5008 there

@varadgunjal
Copy link
Author

varadgunjal commented Dec 16, 2022

@rom1504 I've been experimenting with this for the past 2 days and have a few notes -

  1. Firstly, I retried the hdf5-based reordering we had discussed earlier. This time I'm fairly certain it worked as intended - it took over a day for the processing to complete after I ran clip-retrieval back ... with a config file as here. I have verified that the ivf_old_to_new_mapping.npy is 42G & metadata_reordered.hdf5 is ~1TB. However, I don't see any speedup as compared to using the arrow files without reordering. I did a test with requesting 20000 images per query and it still takes ~40s to return a response - I was expecting it to reduce by a factor of 2 at least. Is it possible I've done something incorrect again? Or is a significant speedup not to be expected?

  2. Regarding the layout of the metadata parquet files, using .rglob seems to do the job for me. Is this change enough or should I use your code suggestion from get_file_list ? I'm not sure what the preferred way is.

  3. Regarding adapt ivf metadata reordering to work with arrow #125 the only thing left is to make the ArrowSink class efficient. I could use some guidance there - is Discord better for discussing this or should I keep the discussion here for easy reference later?

@rom1504
Copy link
Owner

rom1504 commented Dec 21, 2022

  1. that seems surprising. is ivf_old_to_new_mapping.npy on a ssd ? can you benchmark what takes time ? (index vs metadata)
  2. https://github.com/rom1504/embedding-reader/blob/main/embedding_reader/get_file_list.py#L38 is the best way but anything that works is ok
  3. feel free to talk in discord but indeed making arrowsink efficient is important

@varadgunjal
Copy link
Author

About 1, Yes I'm certain that ivf_old_to_new_mapping.npy is on SSD. It is ~42GB. I did benchmark it. It's similar to what I observed earlier : the index returns super quickly (a few ms) with the id and similarity, but the metadata takes all the remaining time.

@rom1504
Copy link
Owner

rom1504 commented Dec 23, 2022

Ok. Curious what you see with arrow

This problem of efficiently mapping an incremental id to a string is surprisingly hard.

At some point I had benchmarked all the popular on disk kV store (leveldb, rocksdb, leveldb,..) and didn't find them faster than hdf5/arrow

However reordering was faster for me

I think we should maybe set up an easily reproduceable benchmark scripts, maybe independent from this repo, so we can easily benchmark all possible solution.

It's a much simpler problem than approximate knn. There must be a solution

@varadgunjal
Copy link
Author

Out of curiosity, do you have any numbers on how much faster reordering was for you?

I'm using a gp3 SSD on AWS. Not sure if there's any better suited one?

BTW for reference, the ivf_old_to_new_mapping.npy is 42GB & the metadata_reordered.hdf5 is ~1.3TB - do those numbers sound correct?

@varadgunjal
Copy link
Author

I wanted to also check - which metadata column does the returned id from the index search map to? The columns I saw in the metadata files here are - 'image_path', 'caption', 'NSFW', 'similarity', 'LICENSE', 'url', 'key', 'status', 'error_message', 'width', 'height', 'original_width', 'original_height', 'exif', 'md5'. Would it map to 'key' perhaps? And is there a relationship between the id & which of the shards it would be in?

I ask because as an alternative while I'm debugging this, I was thinking I will make do with id & similarity for now since those are returned quickly. And then I could batch the returned ids from multiple text queries together and find them in the sharded metadata files?

@rom1504
Copy link
Owner

rom1504 commented Dec 23, 2022

id maps to the line number with the metadata files sorted in alphabetical order

@rom1504
Copy link
Owner

rom1504 commented Dec 23, 2022

If your goal is to do a lot of queries, you can definitely do a full scan of the metadata a single time instead of using random access

@varadgunjal
Copy link
Author

I see. Thanks! These are line numbers of the metadata files from the-eye right? Bc I noticed they are setup differently than the ones in HF (and, as I was mentioning on Discord, have lesser number of total samples).

@rom1504
Copy link
Owner

rom1504 commented Dec 23, 2022

it's the ones next to embeddings, see the table at download section there https://laion.ai/blog/laion-5b/

@varadgunjal
Copy link
Author

Ahh yes. Those are the ones at the-eye as well. Thank you! I'll run this experiment and see if it works for my use case.

@varadgunjal
Copy link
Author

varadgunjal commented Dec 23, 2022

Do these ids / line numbers go from 0 to 5.85B in order of 1B, 2B-en, 2B-multi?

So for eg, if I get a returned id 2370603503 I should be looking in 2B-en metadata files since it is greater than the ~1.2B in 1B-nolang? And the line number would be approximately 2370603503 - 1.23B?

@rom1504
Copy link
Owner

rom1504 commented Dec 23, 2022

Yes

@varadgunjal
Copy link
Author

This doesn't seem to hold up from my initial tests. Here's an example of what I tried, maybe you can point out where I'm wrong -

  1. I sent a query to a locally running clip-back and got a response with metadata. One of the reponses is like so -
{'caption': 'Ekstra rzadki Pepe',
  'url': 'https://pobierak.jeja.pl/images_thumb/5/5/3/251008_300x160.jpg',
  'similarity': 0.23226161301136017,
  'NSFW': 'UNLIKELY',
  'id': 1266299506}
  1. Since the returned id was greater than the total number of samples in laion1B-nolang metadata next to the embeddings (which I counted as 1,231,502,026), I searched for this example in laion2B-en metadata.

  2. The offset was 1,266,299,506 - 1,231,502,026 = 34797480.

  3. Now I counted the number of samples in each of the parquet metadata files from 0000 onwards till I exceeded 34797480. The example should therefore be within the parquet file at which the count was exceeded. This came out to be metadata_0037.parquet under laion2B-en metadata files.

  4. However, I don't see this example anywhere in that parquet file when I tried to search for it by loading it into pandas.

Am I going about this correctly?

@rom1504
Copy link
Owner

rom1504 commented Dec 24, 2022

You are now using the non re-ordered collection, right?

@rom1504
Copy link
Owner

rom1504 commented Dec 24, 2022

Reordered collection is using a completely different ordering

@varadgunjal
Copy link
Author

Well that was super dumb of me. Thanks for pointing that out!

@rom1504 rom1504 added the enhancement New feature or request label Jan 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants