Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idea: Instance open-vocab segmentation with RADIO? #42

Open
javiabellan opened this issue Apr 23, 2024 · 3 comments
Open

Idea: Instance open-vocab segmentation with RADIO? #42

javiabellan opened this issue Apr 23, 2024 · 3 comments

Comments

@javiabellan
Copy link

javiabellan commented Apr 23, 2024

I want to explore this idea of

  1. Doing instance segmentation with an open-set vocabulary
  2. Compute an embedding of each segementation.

This are a-priori other tasks not related to RADIO but radio has all the ingredients (CLIP + DINO + SAM) to solve this problem.

If we look to sota methods wi will find grounded-sam. That is a 2 step process:

  1. Compute bboxes with grounded-dino
    1. Compute text embedding of desired clases (radio has CLIP)
    2. Compute DINOv2 feats
    3. Use a DETR style model to merge both and produce bboxes
  2. Use SAM with the bboxes prompt encoder to produce segmenation maps for each bbox.

I also find very intersting the tree/nested detection of the NanoOWL work. I think (because it is based in OWL-ViT) it only uses the CLIP vision and text encoders.

I would like to explore these (and more) ideas to explore more use cases for the RADIO model :)

@mranzinger
Copy link
Collaborator

Hey, that's a really cool idea! Would love for any updates on progress if you get them. Feel free to just email us directly, in case you don't want to broadcast.

One thing that we discovered in the past month is that RADIO currently has a weird behavior where it operates differently at low resolution and high resolution. The switch appears to occur at around 720px, but it's a soft target. Below 720px, the CLIP and DINO heads work as expected, but SAM does not. And then above 720px, the opposite happens. We are calling it "most switching", and we briefly discuss it in the latest ArXiv version.

We think we have figured out how to address it, so I expect we'll release a new version of the model in the near future.

In the interim, if it's giving you trouble, shoot me an email and we'll try to get you sorted out.

@javiabellan
Copy link
Author

javiabellan commented Apr 23, 2024

Thats switch makes sense, as i remeber from the paper CLIP and DINO was distilled at something near 224~448 and SAM at 1024px.

Backing to my current interest: I want to figure out tree-based detection/segmentation, if i find something i will tell you.

PS: I dont like the grounded-sam approach because is not end2end -> 2 phases and on the later phase SAM has to segment by only having the box as input and not the original text.

@SimonGeb
Copy link

SimonGeb commented May 7, 2024

I'm very curious to hear how this is going to go. I have been exploring Grounded DINO heavily for producing pseudo 'class probabilities' due to its open-set nature. Basically, I'm using Grounded DINO to produce a class probability distribution for each mask given any set of input classes. These can then be used for downstream tasks.

I wonder whether RADIO can be used to do the same, but in an integrated end-to-end manner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants