Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions from applying mm-shap to the new model (LLaVA-next) #7

Open
ChengYuChuan opened this issue Apr 1, 2024 · 2 comments
Open

Comments

@ChengYuChuan
Copy link

The first question:

masked_X[0, 0] = 49406

I check with this code and try to apply it also on the model I am interested in LLaVA-Next (https://huggingface.co/docs/transformers/model_doc/llava_next). I know the number 49406 (rule out CLS and SEP, it’s 49408-2) represent for vocab_size. Since the same parameters in LLaVA-Next is None by default, I am wondering how to pick an apporpreate number for it, also, with other parameters. If you have any idea of it, please let me know.

The second question:

I find out a example down below:

from PIL import Image
import requests

from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities

# Reference: https://huggingface.co/docs/transformers/model_doc/clip

Usually, it would need a text_input for asking the caption. However, I didn’t see the asking part in ‘mm-shap_clip_dataset.py’

There are some parameters I would need to revise when I implement LLaVA-Next

LLaVA:-Next https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf/blob/main/config.json

  • image_size: 336
  • vocab_size: 32000

it seems to me that LLaVA is more complex than Clip model since it seperate a picture into four.

Clip: https://huggingface.co/openai/clip-vit-base-patch32/blob/main/config.json

  • image_size: 224
  • vocab_size: 49408

Here is the setting of the expirement I would like to do on LLaVa-Next with MM-Shap metrics.

  • num_samples: all
  • task: image_sentence_alignment
  • Dataset: existence
@LetiP
Copy link
Member

LetiP commented Apr 4, 2024

What a coincidence, I am currently also looking into LLaVA and extending MM-SHAP to such models as well.
MM-SHAP as presented in the paper works only for encoders and what you are talking about and what I am looking into, is autoregressive / decoder models.

I am also currently working on writing my thesis and submitting this month, so I am super busy and look into this more deeper in May. I can get back to you around then.

@LetiP
Copy link
Member

LetiP commented May 7, 2024

It could be that you find here what you were asking for: https://github.com/Heidelberg-NLP/CC-SHAP-VLM
I was working on my thesis and making new experiments which included MM-SHAP on three VL decoder models. I featured the new experiments in the paper linked in the new repo.
Just wanted to drop this now, I am still busy with thesis writing and I do not have much time to polish things. 🙈

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants