# Building and Evaluating Multimodal AI

## Author: Dr. Nimrita Koul
## www.linkedin.com/in/nimritakoul

# Task1 : Retrieving Embeddings and Verifying the Similarity Between Them

1. Use the transformers library from HuggingFace to load CLIP Input processor and the corresponding model
2. Load an image and some text descriptions
3. Use the CLIPProcessor to learn joint embeddings of the image and the text statements.
4. Extract the image and the text embeddings learned by the CLIPProcessor.
5. Calculate cosine similarity between the image embedding and each one of the text embeddings.
6. Verify that the most relevant text caption has the highest similarity score with the image embedding.

In [None]:
#Required Imports
import torch
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
from IPython.display import  display
import numpy as np

In [None]:
def compute_clip_similarity(model, processor, image, texts):
    """
    Compute CLIP similarity scores between an image and multiple text descriptions.

    Args:
        model: Loaded CLIP model
        processor: Loaded CLIP processor
        image: PIL Image object
        texts: List of text descriptions

    Returns:
        tuple: (similarities, probs, best_text, best_similarity)
    """
    # Preprocess inputs using CLIP processor
    inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)

    # Extract embeddings from CLIP Model
    with torch.no_grad():
        image_features = model.get_image_features(inputs["pixel_values"])  # Image embedding
        text_features = model.get_text_features(inputs["input_ids"], inputs["attention_mask"])  # Text embeddings

    # Normalize embeddings for cosine similarity
    image_features = image_features / image_features.norm(dim=-1, keepdim=True)
    text_features = text_features / text_features.norm(dim=-1, keepdim=True)

    # Compute cosine similarity
    similarities = (image_features @ text_features.T).squeeze(0)
    probs = similarities.softmax(dim=0)

    # Display results
    for text, sim, prob in zip(texts, similarities, probs):
        print(f"Text: '{text}' | Similarity: {sim.item():.4f} | Probability: {prob.item():.4f}")

    # Find best match
    best_idx = similarities.argmax().item()
    best_text = texts[best_idx]
    best_similarity = similarities[best_idx]
    print(f"\nBest aligned text: '{best_text}' | Similarity: {best_similarity:.4f}")

    return image_features, text_features, best_idx, similarities


## You need to a Hugging Face account and a valid access token to access models from Hugging Face Hub.


Creat you free HF account here:
https://huggingface.co/

Creat your HF access token here:
https://huggingface.co/settings/account

In [None]:
# Example usage with your existing code:
# Load CLIP model and processor
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Let us try with an image of Indian Food

In [None]:
# Sample image
image_food = Image.open("Indianfood.jpg").convert("RGB")
#display(image_food)
# Text descriptions
texts = [
    "A historic city square in Warsaw",
    "A beach with palm trees",
    "Delicious Indian Food"
]
# Call the function
image_features, text_features, best_idx, similarities = compute_clip_similarity(model, processor, image_food, texts)


##  Now, let us try with  an image of Warsaw City Square

In [None]:
image_Warsaw = Image.open("Warsaw.jpg").convert("RGB")
#display(image_Warsaw)


# Text descriptions
texts_Warsaw = [
    "A historic city square in Warsaw",
    "A beach with palm trees",
    "Delicious Indian Food"
]

# Call the function
image_features, text_features, best_idx, similarities = compute_clip_similarity(model, processor, image_Warsaw, texts)


#Task 2:  Investigate the Effect of Fusion Strategies on Model Response

In this task, we will frame a binary classification task:

given an image and its true  caption (positive sample) we will check the confidence with which a classifier correctly classifies this true pair and the influence of fusion strategy on this confidence.

We will check the influence of these 3 fusion techniques:

- Early Fusion
- Late Fusion
- Cross-Modal Attention


## First let us extract the embeddings of the image and the corresponding caption as in previous cells:

In [None]:
## This cell contains code just like the previous cell for similarity calculation.
# Prerequisites: pip install transformers torch Pillow requests numpy

import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
from IPython.display import display
import io
import numpy as np

# Load CLIP model and processor
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")


# Compute clip similarity on image of Warsaw and All captions

In [None]:
# Call the function to compute clip similarity on image of Warsaw and All captions
image_features, text_features, best_idx, similarities = compute_clip_similarity(model, processor, image_Warsaw, texts_Warsaw)

In [None]:
import torch
import torch.nn.functional as F

# Fusion Techniques (using first text as positive example)
#Extract image embedding
image_emb = image_features  # [1, 512]

# Extract the text embedding of its matching caption (the first one in our dataset texts_Warsaw)
text_emb = text_features[0:1]  # [1, 512], "A historic city square in Warsaw"

# This is a positive sample - (image and caption are a match, so the label is 1.0)
label = torch.tensor([1.0])  # Ground truth: match

#Let us implement three fusion schemes one by one

# 1. Early Fusion
class EarlyFusionModel(nn.Module):
    def __init__(self, input_dim=1024, hidden_dim=256):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim) #One Hidden Layer
        self.fc2 = nn.Linear(hidden_dim, 1) # The output layer with one node

    def forward(self, image_emb, text_emb):
        #Fuse embeddings before passing through the classifier network
        fused = torch.cat((image_emb, text_emb), dim=1)  # [1, 1024]
        x = F.relu(self.fc1(fused))#pass the fused embeddings through hidden layer
        #The output is the probability that image and text embeddings are a positive pair.
        return torch.sigmoid(self.fc2(x))#pass activations through output layer and sigmoid

# 2. Late Fusion
class LateFusionModel(nn.Module):
    def __init__(self, input_dim=512, hidden_dim=128):
        super().__init__()
        self.image_branch = nn.Linear(input_dim, hidden_dim) # Hidden layer for image embeddings
        self.text_branch = nn.Linear(input_dim, hidden_dim)# Hidden layer for text embeddings
        self.fc = nn.Linear(hidden_dim * 2, 1) #Output layer with one node

    def forward(self, image_emb, text_emb):
        #separately process the image and text embeddings
        img_out = F.relu(self.image_branch(image_emb)) #Pass image embeddings through separate hidden layer
        txt_out = F.relu(self.text_branch(text_emb))#Pass text embeddings through separate hidden layer
        #Fuse the outputs for image and text before passing to the final classification layer
        fused = torch.cat((img_out, txt_out), dim=1)  # [1, 256]
        #The output is the probability that image and text embeddings are a positive pair.
        return torch.sigmoid(self.fc(fused))

# 3. Cross-Modal Attention
class CrossModalAttentionModel(nn.Module):
    def __init__(self, embed_dim=512, hidden_dim=256):
        super().__init__()
        self.attn = nn.MultiheadAttention(embed_dim, num_heads=4)#MHA layer
        self.fc1 = nn.Linear(embed_dim, hidden_dim) # one hidden layer
        self.fc2 = nn.Linear(hidden_dim, 1) #one output layer with one node

    def forward(self, image_emb, text_emb):
        image_emb = image_emb.unsqueeze(0)  # [1, 1, 512]
        text_emb = text_emb.unsqueeze(0)    # [1, 1, 512]
        attn_out, _ = self.attn(text_emb, image_emb, image_emb) #compute attention score of image and text embeddings
        x = F.relu(self.fc1(attn_out.squeeze(0))) #pass the attention scores through hidden layer
        #The output is the probability that image and text embeddings are a positive pair.
        return torch.sigmoid(self.fc2(x))

In [None]:
# Instantiate and run fusion strategies
early_model = EarlyFusionModel()
late_model  = LateFusionModel()
attn_model  = CrossModalAttentionModel()

print("\nFusion Results (Text: 'A historic city square in Warsaw'):")
early_output = early_model(image_emb, text_emb)
print(f"Early Fusion Prediction: {early_output.item():.4f} | Label: {label.item()}")

late_output = late_model(image_emb, text_emb)
print(f"Late Fusion Prediction: {late_output.item():.4f} | Label: {label.item()}")

attn_output = attn_model(image_emb, text_emb)
print(f"Cross-Modal Attention Prediction: {attn_output.item():.4f} | Label: {label.item()}")


# Task 3: Evaluating MultiModal RAG

## Steps:

1. We wil give a text query to a multimodal model and ask it to retrieve a relevant image and its caption from a small dataset of images and captions.

2. Then we will ask the model to generate a detailed description of the retrieved image.

3. Finally, we will evaluate the generated description against the caption in our dataset or a reference caption.

## Components in this RAG Task:
- Retrieval: Use CLIP model retrieve image and its caption based on query (context).
- Generation: Use GPT-2 to generate text description using retrieved context.
- Evaluation: Use
1. BLEU score to compare generated description to a reference description.
2. LLM as a judge, we use GPT2-medium as a judge to evaluate the generated description.

## In below cell,
- we load CLIP models for image+caption retrieval,
- load gpt2 model for generation of text description for retrieved image.
- define our image+text dataset

In [None]:
# Prerequisites: pip install transformers torch Pillow requests numpy nltk
import torch
from transformers import CLIPProcessor, CLIPModel, GPT2Tokenizer, GPT2LMHeadModel
from PIL import Image
import requests
import io
from nltk.translate.bleu_score import sentence_bleu
import numpy as np

In [None]:
# Load models for multimodal image+caption retrieval
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

#gpt2 for generation of description of the retrieved image
gpt2_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
gpt2_model = GPT2LMHeadModel.from_pretrained("gpt2")


In [None]:
# Let us create a small dataset (image + text pairs)
knowledge_base = [
    {
        "image_url": "Warsaw.jpg",
        "caption": "A historic city square in Warsaw with colorful buildings."
    },

    {
        "image_url": "Beach.jpg",
        "caption": "A beautiful beach with crystal clear waters."
    },
    {
        "image_url": "Palm.jpg",
        "caption": "Luscious palm trees under a sunny sky."
    },

    {
        "image_url": "Indianfood.jpg",
        "caption": "Delicious Indian Food."
    }

    ]


## In below cell,
- we are adding all images to an images[] list, and all captions to a captions[] list
- and displaying the images

In [None]:
# Load and preprocess knowledge base images
images = []
captions = []
for item in knowledge_base:
    try:
        img = Image.open(item['image_url']).convert("RGB")
        display(img)
        images.append(img)
        captions.append(item["caption"])
    except Exception as e:
        print(f"Error loading {item['image_url']}: {e}")


## In the below cell:

1. we define a query describing an image
2. then we use CLIP model to retrieve the image and its corresponding caption from our dataset that match the query query.

In [None]:
# Query provides the context for retrieval, you can use your documents for this too.
query = "Describe a historic square with colourful, tall buildings"

# Step 1: Retrieval with CLIP
inputs = clip_processor(text=[query] + captions, images=images, return_tensors="pt", padding=True)

#get embeddings
with torch.no_grad():
    image_features = clip_model.get_image_features(inputs["pixel_values"])  # [2, 512]
    text_features = clip_model.get_text_features(inputs["input_ids"], inputs["attention_mask"])  # [3, 512]


# Normalize and compute similarities
image_features = image_features / image_features.norm(dim=-1, keepdim=True)
text_features = text_features / text_features.norm(dim=-1, keepdim=True)

#compute similarity score between text embeddings and image embedding
similarities = (text_features[0:1] @ image_features.T).squeeze(0)  # Query vs. images

#identify the index of best image that matches the query and captions
best_idx = similarities.argmax().item()


In [None]:
# Retrieved context
retrieved_image = images[best_idx] #image that best matches the query
retrieved_caption = captions[best_idx] #caption that best matches the retrieved image
display(retrieved_image)
print(f"Retrieved Caption: '{retrieved_caption}'")


# Now, we use gpt2 text generator model to generate a description based on query defined above and the caption retrieved from our dataset by CLIP(text to text generation task).

In [None]:
# Step 2: Description Generation with GPT-2
# Construct your prompt with query and retrieved caption
prompt = f"Query: {query}\nContext: {retrieved_caption}\nDescription: "

# instantiate gpt2_tokenizer
inputs = gpt2_tokenizer(prompt, return_tensors="pt")

#call the generate() API on inputs
outputs = gpt2_model.generate(
    inputs["input_ids"],
    max_length=50,
    num_return_sequences=1,
    no_repeat_ngram_size=2,
    do_sample=True,
    pad_token_id = gpt2_tokenizer.eos_token_id,
    attention_mask = inputs["attention_mask"],
    top_k=50
)

# decode the outputs
generated_text = gpt2_tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"\nGenerated Description:\n{generated_text}")


# Now, let us evaluate the description generated by gpt2 model against a Reference Text Description using BLUE Score

In [None]:

# Step 3: Evaluation with BLEU
# This is reference text description to evaluate against
reference = "A historic city square in Warsaw with colorful buildings and cobblestone streets."

#tokenize reference description
ref_tokens = reference.split()
#tokenize generated description (generated_text variable from previous cell)
gen_tokens = generated_text.split("Description: ")[1].split()

# Print for debugging
print(f"Reference Tokens: {ref_tokens}")
print(f"Generated Tokens: {gen_tokens}")


###  BLEU (Bilingual Evaluation Understudy) score counts the number of overlapping tokens (n-grams) in two or more pieces of text. Scores are normalized to be between 0 and 1. Higher scores indicate better alignment between two pieces of text.

### We are counting Unigram overlap between the generated and reference description here:

In [None]:

# Calculate BLEU with smoothing
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
smoothie = SmoothingFunction().method1
bleu_score = sentence_bleu([ref_tokens], gen_tokens, smoothing_function=smoothie)
print(f"\nBLEU Score with Smoothing: {bleu_score:.4f}")


# Next, we will evaluate the Generated Description using another LLM as a Judge

Steps:
1. Use gpt2-medium model as a judge
2. Design a prompt for the LLM judge
3. Invoke the judge LLM to evaluate the reference nad generated description

In [None]:
from transformers import pipeline

# Instantiate gpt2-medium model and its tokenizer for text-generation
evaluator = pipeline("text-generation", model="gpt2-medium", tokenizer="gpt2-medium")

# Configure the tokenizer
evaluator.tokenizer.pad_token = evaluator.tokenizer.eos_token
evaluator.tokenizer.pad_token_id = evaluator.tokenizer.eos_token_id

## Define the function that uses an LLM judge to evaluate two texts
def llm_judge_with_model(reference, generated_text):
    ## Design the prompt for LLM Judge
    prompt = (
        f"You are an evaluator. Compare the following texts and assign a numeric score from 0 to 10 based on relevance, coherence, and factual accuracy.\n"
        f"Provide your response in this format:\n"
        f"Score: [number]/10\n"
        f"Explanation: [Your reasoning]\n\n"
        f"Example 1:\n"
        f"Reference: 'The sky is blue and vast.'\n"
        f"Generated: 'The sky is blue and wide.'\n"
        f"Score: 8/10\n"
        f"Explanation: The generated text is relevant and coherent, with minor variation in word choice.\n\n"
        f"Example 2:\n"
        f"Reference: 'A tall mountain in the Alps.'\n"
        f"Generated: 'A small hill in France.'\n"
        f"Score: 3/10\n"
        f"Explanation: The generated text is less relevant and factually inaccurate.\n\n"
        f"Now evaluate the following pair of reference and generated text:\n"
        f"Reference: '{reference}'\n"
        f"Generated: '{generated_text}'\n"
    )

    #Call the Judge
    response = evaluator(prompt,max_new_tokens=50,num_return_sequences=1,pad_token_id=evaluator.tokenizer.pad_token_id)[0]["generated_text"]


    #response = evaluator(prompt, max_length=200, num_return_sequences=1, truncation=True)[0]["generated_text"]
    print("\n--- LLM Judgment (Real Model) ---")

    # regular expression code to extract numeric score from judge LLM's output
    import re
    score_match = re.search(r"Score:\s*(\d+|X)/10", response)
    explanation_match = re.search(r"Explanation:\s*(.+)", response)

    if score_match:
        score_str = score_match.group(1)
        score = int(score_str) if score_str.isdigit() else 5
    else:
        score = 5
    explanation = explanation_match.group(1) if explanation_match else "Could not parse response properly."

    #Print the score and the explanation from the response
    print(f"Score: {score}/10")
    print(f"Explanation: {explanation}")
    return score


# Specify a reference description and call the Judge LLM to evaluate
reference = "A historic city square in Warsaw with colorful buildings and cobblestone streets."

#Call the function to evaluate two texts
llm_score = llm_judge_with_model(reference, generated_text)


## Recap: In this Notebook, we saw three multimodal tasks in action:
1. Retrieving multimodal embeddings from CLIP model and verifying their similarity scores.
2. Verifying the impact of fusion strategies on the performance of the classifier head of an AI model.
3. Evaluating a multimodal RAG application using BLEU score and LLM as a judge.

#References:

[1].Multimodal Deep Learning, https://people.csail.mit.edu/khosla/papers/icml2011_ngiam.pdf

[2].https://slds-lmu.github.io/seminar_multimodal_dl/c02-00-multimodal.html#c02-05-text-plus-img

[3].https://cmu-multicomp-lab.github.io/mmml-course/fall2020/

[4].https://huyenchip.com/2023/10/10/multimodal.html

[5].https://github.com/huggingface/transformers/blob/main/examples/pytorch/contrastive-image-text/README.md

[6]. https://sites.research.google/med-palm/
