Questions from applying mm-shap to the new model (LLaVA-next) #7

ChengYuChuan · 2024-04-01T19:58:32Z

The first question:

Line 65 in 00a66bf

masked_X[0, 0] = 49406

I check with this code and try to apply it also on the model I am interested in LLaVA-Next (https://huggingface.co/docs/transformers/model_doc/llava_next). I know the number 49406 (rule out CLS and SEP, it’s 49408-2) represent for vocab_size. Since the same parameters in LLaVA-Next is None by default, I am wondering how to pick an apporpreate number for it, also, with other parameters. If you have any idea of it, please let me know.

The second question:

I find out a example down below:

from PIL import Image
import requests

from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities

# Reference: https://huggingface.co/docs/transformers/model_doc/clip

Usually, it would need a text_input for asking the caption. However, I didn’t see the asking part in ‘mm-shap_clip_dataset.py’

There are some parameters I would need to revise when I implement LLaVA-Next

LLaVA:-Next https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf/blob/main/config.json

image_size: 336
vocab_size: 32000

it seems to me that LLaVA is more complex than Clip model since it seperate a picture into four.

Clip: https://huggingface.co/openai/clip-vit-base-patch32/blob/main/config.json

image_size: 224
vocab_size: 49408

Here is the setting of the expirement I would like to do on LLaVa-Next with MM-Shap metrics.

num_samples: all
task: image_sentence_alignment
Dataset: existence

The text was updated successfully, but these errors were encountered:

LetiP · 2024-04-04T12:41:33Z

What a coincidence, I am currently also looking into LLaVA and extending MM-SHAP to such models as well.
MM-SHAP as presented in the paper works only for encoders and what you are talking about and what I am looking into, is autoregressive / decoder models.

I am also currently working on writing my thesis and submitting this month, so I am super busy and look into this more deeper in May. I can get back to you around then.

LetiP · 2024-05-07T20:10:37Z

It could be that you find here what you were asking for: https://github.com/Heidelberg-NLP/CC-SHAP-VLM
I was working on my thesis and making new experiments which included MM-SHAP on three VL decoder models. I featured the new experiments in the paper linked in the new repo.
Just wanted to drop this now, I am still busy with thesis writing and I do not have much time to polish things. 🙈

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions from applying mm-shap to the new model (LLaVA-next) #7

Questions from applying mm-shap to the new model (LLaVA-next) #7

ChengYuChuan commented Apr 1, 2024

LetiP commented Apr 4, 2024

LetiP commented May 7, 2024

Questions from applying mm-shap to the new model (LLaVA-next) #7

Questions from applying mm-shap to the new model (LLaVA-next) #7

Comments

ChengYuChuan commented Apr 1, 2024

LetiP commented Apr 4, 2024

LetiP commented May 7, 2024