Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Directly get image from indices #377

Open
Maxwells-Demons opened this issue Mar 18, 2024 · 6 comments
Open

Directly get image from indices #377

Maxwells-Demons opened this issue Mar 18, 2024 · 6 comments

Comments

@Maxwells-Demons
Copy link

Hi, Rom. I have downloaded laion400m and launched KnnService with follow arguments:

    indices_paths="indices_paths_ViTL14.json"
    clip_model="ViT-L/14"
    enable_hdf5=False
    enable_faiss_memory_mapping=True
    columns_to_return = ["url", "image_path", "caption", "NSFW"]
    reorder_metadata_by_ivf_index=False
    enable_mclip_option=True
    use_jit=False
    use_arrow=True
    provide_safety_model=False
    provide_violence_detector=False
    provide_aesthetic_embeddings=True

    clip_resources = load_clip_indices(
        indices_paths=indices_paths,
        clip_options=ClipOptions(
            indice_folder="",
            clip_model=clip_model,
            enable_hdf5=enable_hdf5,
            enable_faiss_memory_mapping=enable_faiss_memory_mapping,
            columns_to_return=columns_to_return,
            reorder_metadata_by_ivf_index=reorder_metadata_by_ivf_index,
            enable_mclip_option=enable_mclip_option,
            use_jit=use_jit,
            use_arrow=use_arrow,
            provide_safety_model=provide_safety_model,
            provide_violence_detector=provide_violence_detector,
            provide_aesthetic_embeddings=provide_aesthetic_embeddings,
        ),
    )
    knnservice = KnnService(clip_resources=clip_resources)

In the code of clip_retrieval/clip_back.py, KnnService.query, Line 466

results = self.map_to_metadata(
      indices, distances, num_images, clip_resource.metadata_provider, clip_resource.columns_to_return
)

and in results, I could only have 'url' and 'caption' like:

[{'url': 'https://s3.us-west-2.amazonaws.com/prod.retreat.guru/images/16212/medium/photo%20%280000000E%29.JPG',
  'caption': 'Soul Safari Holistic Retreats'}]

But I noticed that the indices are list like:

[193396883, 169693704, 226852546, 94594796, 10774506, 139003161, 3917167, 217605597, 191966779, 197146260]

As you mentioned in other issues before, the url links are gradually becoming unavailable. I think It would be more possible to access it from files I have downloaded. So my question is: how could I directly get the image data from such indices?

@rom1504
Copy link
Owner

rom1504 commented Mar 18, 2024 via email

@Maxwells-Demons
Copy link
Author

Hi, you can replace the urls in the metadata by urls pointing to an http service (eg nginx) where you host the images you have downloaded

On Mon, Mar 18, 2024, 12:43 PM Maxwells_Ayakashi @.> wrote: Hi, Rom. I have downloaded laion400m and launched KnnService with follow arguments: indices_paths="indices_paths_ViTL14.json" clip_model="ViT-L/14" enable_hdf5=False enable_faiss_memory_mapping=True columns_to_return = ["url", "image_path", "caption", "NSFW"] reorder_metadata_by_ivf_index=False enable_mclip_option=True use_jit=False use_arrow=True provide_safety_model=False provide_violence_detector=False provide_aesthetic_embeddings=True clip_resources = load_clip_indices( indices_paths=indices_paths, clip_options=ClipOptions( indice_folder="", clip_model=clip_model, enable_hdf5=enable_hdf5, enable_faiss_memory_mapping=enable_faiss_memory_mapping, columns_to_return=columns_to_return, reorder_metadata_by_ivf_index=reorder_metadata_by_ivf_index, enable_mclip_option=enable_mclip_option, use_jit=use_jit, use_arrow=use_arrow, provide_safety_model=provide_safety_model, provide_violence_detector=provide_violence_detector, provide_aesthetic_embeddings=provide_aesthetic_embeddings, ), ) knnservice = KnnService(clip_resources=clip_resources) In the code of clip_retrieval/clip_back.py, KnnService.query, Line 466 results = self.map_to_metadata( indices, distances, num_images, clip_resource.metadata_provider, clip_resource.columns_to_return ) and in results, I could only have 'url' and 'caption' like: [{'url': 'https://s3.us-west-2.amazonaws.com/prod.retreat.guru/images/16212/medium/photo%20%280000000E%29.JPG', 'caption': 'Soul Safari Holistic Retreats'}] But I noticed that the indices are list like: [193396883, 169693704, 226852546, 94594796, 10774506, 139003161, 3917167, 217605597, 191966779, 197146260] As you mentioned in other issues before, the url links are gradually becoming unavailable. I think It would be more possible to access it from files I have downloaded. So my question is: how could I directly get the image data from such indices? — Reply to this email directly, view it on GitHub <#377>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437XURHE46TVG6W4CQKDYY3HOJAVCNFSM6AAAAABE3NF7HWVHI2DSMVQWIX3LMV43ASLTON2WKOZSGE4TCOJXGY2TENQ . You are receiving this because you are subscribed to this thread.Message ID: @.>

Emm, sorry for I'm not very familiar with these. Could you please explain them in more detail? e.g. "urls pointing to an http
service where you host the images"

BTW, I noticed that in the dataframe of a parquet file, the key image_path exists:

       image_path                                            caption      NSFW  ...  original_height                                               exif                                             sha256
0       000000007  Ben Affleck Could Be Latest Addition To <em>Th...  UNLIKELY  ...              320                                                 {}  6561021576f886c0334b06955cea13e973101f296e0280...
1       000000015  60 Pcs Table Decorations Supplies Moana Themed...  UNLIKELY  ...              200                                                 {}  2432d4ca862e078d911e9becdd7aa7bd85e5832ec5e44f...
2       000000001  Silverline Air Framing Nailer 90mm 10 - 12 Gau...  UNLIKELY  ...              225                                                 {}  b453f327a45b2b734772d8b38d12c1a441b0d69ceb458e...
3       000000049                Mini girls green crochet floral top  UNLIKELY  ...              300                                                 {}  0ba5c4d3842b670ec67a95227121c84944d73436b95fcf...
4       000000075  HARRY CHAPIN - Soundstage: An Evening With Har...  UNLIKELY  ...              200                                                 {}  1cc2add844cdab60decf867ba4242e88fa95b814e6799b...
...

But it didn't come up in retrieved meta (with only caption and url) when I start KnnService with Arrow file. If I directly use parquet files (use_arrow=False) to start KnnService, there are image_path in retrieved metas, but it didn't match with indices:

>>> metas[0]['image_path'], indices[0]
('194406653', 193396883)

@rom1504
Copy link
Owner

rom1504 commented Mar 18, 2024 via email

@Maxwells-Demons
Copy link
Author

First, I downloaded laion400m and launched KnnService with use_arrow=True, the retrieved metas only contain 'url' and 'caption', in the query function, I noticed there is a list indices.
Second, I launch KnnService with use_arrow=False (eg. using metadata parquet files). Now the metas contain 'image_path', but it is different from the indices in corresponding index.

So my question is: Is it possible to access image data locally from a index number? If so, which is the correct index?
Besides, you mentioned:

replace the urls in the metadata by urls pointing to an http service (eg nginx) where you host the images you have downloaded
Could you please explain them in more detail?

Thank you so much for taking time to answer my question. I apologize for any misunderstood caused.

@rom1504
Copy link
Owner

rom1504 commented Mar 18, 2024

Where would you like the jpeg bytes to appear after querying the index exactly? Is it in the browser or some other place?

@Maxwells-Demons
Copy link
Author

Where would you like the jpeg bytes to appear after querying the index exactly? Is it in the browser or some other place?

In python file, since I'm going to do retrieval augmented generation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants