<a href="https://colab.research.google.com/github/RDGopal/IB9AU-2026/blob/main/MLM1_Vision_Transformers_and_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Vision Transformers & Image Embeddings


###Learning Objectives:
1. Understand how Vision Transformers convert images to embeddings
2. Explore CLIP's contrastive learning for text-image alignment
3. Apply to financial document similarity search



### Setting up the Environment

Before we dive into the models, we need to set up our Python environment. This involves installing the necessary libraries and importing them. We'll be using `transformers` for Vision Transformer and CLIP models, `torch` for tensor operations, `pillow` for image handling, `sentence-transformers` for cosine similarity, `matplotlib` for plotting, `numpy` for numerical operations, and `requests` for fetching images from URLs.

The code also checks if a GPU (`cuda`) is available, which significantly speeds up deep learning computations, and sets the `device` variable accordingly.

In [None]:
# ========== SETUP ==========
!pip -q install transformers torch pillow sentence-transformers matplotlib numpy requests

import torch
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
import requests
from io import BytesIO
from transformers import CLIPProcessor, CLIPModel, ViTImageProcessor, ViTModel
from sentence_transformers import util

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Device: {device}")


### Vision Transformer Embeddings

Vision Transformers (ViT) are a type of neural network that applies the transformer architecture (originally designed for natural language processing) directly to images. Instead of using convolutional layers, ViTs split an image into fixed-size patches, linearly embed each patch, add position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder.

The output of a ViT model includes a special `[CLS]` token's embedding, which serves as a global representation (embedding) of the entire image. Images with similar content will have similar embeddings in this high-dimensional space.

This section demonstrates:
1.  **Loading a pre-trained ViT model:** We use `google/vit-base-patch16-224-in21k`, a base-sized ViT model pre-trained on a large dataset.
2.  **Loading sample financial images:** These images will be used to demonstrate how ViT processes different visual content.
3.  **Processing images:** The `vit_processor` converts the images into the format expected by the ViT model.
4.  **Generating embeddings:** The model processes the images and extracts a 768-dimensional embedding for each. The `[CLS]` token's embedding is typically used as the image representation.
5.  **Calculating cosine similarity:** We use cosine similarity to measure how alike the generated embeddings are. A similarity value close to 1 indicates high similarity, while a value close to 0 indicates low similarity.

In [None]:
# Vision Transformer Embeddings

# Load ViT model and processor
vit_processor = ViTImageProcessor.from_pretrained('google/vit-base-patch16-224-in21k')
vit_model = ViTModel.from_pretrained('google/vit-base-patch16-224-in21k').to(device)

# Sample financial images (you can replace with your own)
image_urls = [
    "https://cdn-icons-png.flaticon.com/512/2906/2906274.png",  # Document
    "https://images.unsplash.com/photo-1554224155-6726b3ff858f?w=400",  # Financial chart
    "https://images.unsplash.com/photo-1611974789855-9c2a0a7236a3?w=400",  # Credit cards
]

def load_image(url):
    """Load image from URL"""
    response = requests.get(url)
    return Image.open(BytesIO(response.content)).convert('RGB')

# Load and process images
images = [load_image(url) for url in image_urls]

# Display images
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for idx, img in enumerate(images):
    axes[idx].imshow(img)
    axes[idx].set_title(f"Image {idx+1}")
    axes[idx].axis('off')
plt.tight_layout()
plt.show()

# Get embeddings
inputs = vit_processor(images=images, return_tensors="pt").to(device)
with torch.no_grad():
    outputs = vit_model(**inputs)
    embeddings = outputs.last_hidden_state[:, 0, :].cpu().numpy()  # [CLS] token

print(f"Embedding shape per image: {embeddings[0].shape}")
print(f"Total embeddings: {embeddings.shape[0]}")

# Compute cosine similarity
similarity_matrix = util.cos_sim(embeddings, embeddings)
print("\nImage Similarity Matrix:")
print(similarity_matrix.numpy())

# Create a DataFrame for embeddings
import pandas as pd
embedding_df = pd.DataFrame(embeddings, index=[f"Image {i+1}" for i in range(embeddings.shape[0])])
print("\nEmbedding Vectors (DataFrame):")
print(embedding_df)


### CLIP - Text-Image Alignment

**Contrastive Language-Image Pre-training (CLIP)** is a neural network trained on a wide variety of (image, text) pairs. It learns to associate images with their descriptive texts. Unlike ViT, which generates embeddings only for images, CLIP generates embeddings for both images and text in a shared, multimodal embedding space.

The key idea behind CLIP is **contrastive learning**: it learns to predict which text snippet best matches a given image from a random set of other text snippets. This allows it to understand the semantic relationship between images and text.

In this section, we:
1.  **Load a pre-trained CLIP model:** We use `openai/clip-vit-base-patch32`.
2.  **Define text queries:** These are natural language descriptions related to financial concepts.
3.  **Process images and text:** The `clip_processor` prepares both the images (from the previous section) and the text queries for the CLIP model.
4.  **Get predictions:** The CLIP model outputs `logits_per_image`, which indicate the similarity scores between each image and each text query. These are then converted into probabilities using a softmax function.
5.  **Display results:** The output shows the probability that each text query matches each of the three images.

In [None]:
#  CLIP - Text-Image Alignment
print("\n=== Part 2: CLIP Text-Image Matching ===\n")

# Load CLIP model
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(device)
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Financial domain queries
text_queries = [
    "a financial document",
    "a stock market chart",
    "credit cards and banking",
    "a business meeting",
    "cryptocurrency trading"
]

# Process images and text
inputs = clip_processor(
    text=text_queries,
    images=images,
    return_tensors="pt",
    padding=True
).to(device)

# Get predictions
with torch.no_grad():
    outputs = clip_model(**inputs)
    logits_per_image = outputs.logits_per_image
    probs = logits_per_image.softmax(dim=1)

# Display results
print("Text-to-Image Matching Probabilities:\n")
print(f"{'Query':<30} | {'Image 1':>8} | {'Image 2':>8} | {'Image 3':>8}")
print("-" * 70)
for idx, query in enumerate(text_queries):
    print(f"{query:<30} | {probs[0][idx]:.4f} | {probs[1][idx]:.4f} | {probs[2][idx]:.4f}")


### FinTech Application - Document Classification

One powerful application of CLIP's text-image alignment capability is zero-shot image classification. "Zero-shot" means the model can classify images into categories it has never explicitly seen during training, simply by comparing the image's embedding to the embedding of text descriptions of the categories.

In this financial technology (FinTech) example, we simulate classifying financial documents:

1.  **Define document types:** We create a list of possible financial document categories (e.g., "bank statement", "investment prospectus").
2.  **Classify each image:** For each of our sample images, we:
    *   Use the `clip_processor` to prepare the image and all document type queries.
    *   Pass them to the `clip_model` to get similarity scores (`logits_per_image`).
    *   Convert these scores to probabilities using softmax.
    *   Identify the document type with the highest probability as the predicted class.

This demonstrates how CLIP can be used to automatically categorize various financial documents based on textual descriptions of those categories, without needing a large, labeled dataset of financial images for each category.

In [None]:
#  FinTech Application - Document Classification

# Simulated document types

doc_types = [
    "a financial document",
    "a stock market chart",
    "credit cards and banking",
    "a business meeting",
    "cryptocurrency trading"
]


# Classify each image
print("Document Classification Results:\n")
for img_idx, img in enumerate(images):
    inputs = clip_processor(
        text=doc_types,
        images=img,
        return_tensors="pt",
        padding=True
    ).to(device)

    with torch.no_grad():
        outputs = clip_model(**inputs)
        probs = outputs.logits_per_image.softmax(dim=1)[0]

    # Get top prediction
    top_idx = probs.argmax().item()
    confidence = probs[top_idx].item()

    print(f"Image {img_idx + 1}:")
    print(f"  Predicted: {doc_types[top_idx]} (confidence: {confidence:.2%})")
    print(f"  All probabilities: {dict(zip(doc_types, probs.tolist()))}\n")
