# Applied Machine Learning Homework 4 Question 1 

### Group 111
<br>Aryan Dhuru
<br>Saloni Gandhi
<br>Mitanshu Bhoot
<br>Shivali Mate


## CLIP (Contrastive Language-Image Pre-training)

Source: openai/clip-vit-base-patch32

### 1. Architecture

- **Image Encoder**:
    - **ResNet**:
    Includes variants like ResNet-50, ResNet-101, and scaled versions (e.g., RN50x64).
    Features ResNet-D enhancements, antialiased pooling, and attention pooling.
    - **Vision Transformer (ViT)**: The model uses a Vision Transformer as its backbone for processing images. Uses variants such as ViT-B/32 (86M parameters) and ViT-L/14 (307M parameters).

- **Text Encoder**: 
    - The text encoder is based on the GPT-2 architecture.
    - The input text is tokenized and embedded into vectors using a learned embedding layer.
    - Positional Encoding: Like the ViT, positional encodings are added to the token embeddings to retain the sequence information.


- **Shared Embedding Space**:
    - **Projection Layers**: 
    After the vision and text encoders process their respective inputs, the resulting embeddings are projected into a shared space using separate linear projection layers for images and text.
    - **Contrastive Learning**: 
    During training, CLIP uses a contrastive loss to align the image and text embeddings in this shared space. Matching pairs (image and its corresponding caption) are brought closer together, while non-matching pairs are pushed apart.

### 2. Number of Layers and Parameters

- **Vision Transformer (ViT)**: 
    - The Vision Transformer used in CLIP has 12 layers with 512 hidden units and 8 attention heads.
    - Typical parameter count: ~23M for ResNet-50.
- **Text Encoder (GPT-2)**: 
    - The text encoder used in CLIP has 24 layers.
    - Typical parameter count: ~63M.

### Parameters Breakdown

- **Embedding Layer**: Converts input images/text into embeddings.
- **Transformer Layers**: Each layer consists of multi-head self-attention mechanisms and feed-forward neural networks.
- **Parameters**: The parameters include weights for the attention mechanism, feed-forward networks, and normalization layers.
- **K, Q, V Matrices**: These are part of the self-attention mechanism, where K (Key), Q (Query), and V (Value) matrices are used to compute attention scores.

### 3. Functionality

- **Image and Text Embeddings**: The model learns to map images and text into a shared embedding space.
- **Contrastive Loss**: The model is trained using a contrastive loss function, which encourages the embeddings of matching image-text pairs to be closer together, while non-matching pairs are pushed apart.
- **Zero-Shot Transfer**: After pre-training, the model can perform tasks without additional training by leveraging natural language descriptions.

### 4. Training and Objectives
- The model is trained on a dataset of 400M image-text pairs using a contrastive loss to maximize similarity between correct pairs and minimize similarity for incorrect ones.
- Zero-shot transfer is achieved by encoding class names or descriptions as text embeddings and comparing them to image embeddings.


In [2]:
import os
import torch
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
from tqdm import tqdm

# Set up device
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load the CLIP model
model_id = "openai/clip-vit-base-patch32"
model = CLIPModel.from_pretrained(model_id).to(device)
processor = CLIPProcessor.from_pretrained(model_id)

# Define the dataset directory and categories
data_dir = "./dataset"  # Replace with your dataset directory
categories = ["airplane", "car", "chair", "cup", "dog", "donkey", "duck", "hat"]
conditions = ["blurred", "features", "geons", "realistic", "silhouettes"]

# def evaluate_model(data_dir, categories, conditions):
#     accuracy = {condition: 0 for condition in conditions}
    
#     for condition in conditions:
#         correct = 0
#         total = 0
#         condition_dir = os.path.join(data_dir, condition)
        
#         for category in categories:
#             category_images = [f for f in os.listdir(condition_dir) if f.startswith(category)]
            
#             for img_name in tqdm(category_images, desc=f"Evaluating {condition}/{category}"):
#                 img_path = os.path.join(condition_dir, img_name)
#                 image = Image.open(img_path).convert("RGB")
                
#                 inputs = processor(text=categories, images=image, return_tensors="pt", padding=True)
#                 inputs = {k: v.to(device) for k, v in inputs.items()}
#                 outputs = model(**inputs)
                
#                 logits_per_image = outputs.logits_per_image
#                 probs = logits_per_image.softmax(dim=-1)
                
#                 # Get the predicted category
#                 pred = torch.argmax(probs).item()
#                 if categories[pred] == category:
#                     correct += 1
#                 total += 1
        
#         # Avoid division by zero
#         if total > 0:
#             accuracy[condition] = correct / total
#         else:
#             accuracy[condition] = 0  # Assign 0 accuracy if no images were processed
    
#     return accuracy

# Function to evaluate the model
def evaluate_model(data_dir, categories, conditions):
    accuracy = {condition: 0 for condition in conditions}
    total_images = {condition: 0 for condition in conditions}  # New dictionary to store the count of images
    
    for condition in conditions:
        correct = 0
        total = 0
        condition_dir = os.path.join(data_dir, condition)
        
        for category in categories:
            category_images = [f for f in os.listdir(condition_dir) if f.startswith(category)]
            
            for img_name in tqdm(category_images, desc=f"Evaluating {condition}/{category}"):
                img_path = os.path.join(condition_dir, img_name)
                image = Image.open(img_path).convert("RGB")
                
                inputs = processor(text=categories, images=image, return_tensors="pt", padding=True)
                inputs = {k: v.to(device) for k, v in inputs.items()}
                outputs = model(**inputs)
                
                logits_per_image = outputs.logits_per_image
                probs = logits_per_image.softmax(dim=-1)
                
                # Get the predicted category
                pred = torch.argmax(probs).item()
                if categories[pred] == category:
                    correct += 1
                total += 1
        
        accuracy[condition] = correct / total if total > 0 else 0
        total_images[condition] = total  # Store the total number of images evaluated
    
    return accuracy, total_images

# Evaluate the model
accuracy, total_images = evaluate_model(data_dir, categories, conditions)

# Print the accuracy and number of images evaluated
for condition, acc in accuracy.items():
    print(f"Accuracy for {condition}: {acc * 100:.2f}%")
    print(f"Number of images evaluated for {condition}: {total_images[condition]}")



# Evaluate the model
accuracy = evaluate_model(data_dir, categories, conditions)

# Print the accuracy
for condition, acc in accuracy.items():
    print(f"Accuracy for {condition}: {acc * 100:.2f}%")


Evaluating blurred/airplane: 100%|██████████| 2/2 [00:00<00:00,  3.55it/s]
Evaluating blurred/car: 100%|██████████| 2/2 [00:00<00:00,  5.15it/s]
Evaluating blurred/chair: 100%|██████████| 1/1 [00:00<00:00,  4.80it/s]
Evaluating blurred/cup: 0it [00:00, ?it/s]
Evaluating blurred/dog: 0it [00:00, ?it/s]
Evaluating blurred/donkey: 0it [00:00, ?it/s]
Evaluating blurred/duck: 0it [00:00, ?it/s]
Evaluating blurred/hat: 0it [00:00, ?it/s]
Evaluating features/airplane: 100%|██████████| 1/1 [00:00<00:00,  4.56it/s]
Evaluating features/car: 100%|██████████| 1/1 [00:00<00:00,  5.79it/s]
Evaluating features/chair: 100%|██████████| 1/1 [00:00<00:00,  5.05it/s]
Evaluating features/cup: 0it [00:00, ?it/s]
Evaluating features/dog: 0it [00:00, ?it/s]
Evaluating features/donkey: 0it [00:00, ?it/s]
Evaluating features/duck: 0it [00:00, ?it/s]
Evaluating features/hat: 0it [00:00, ?it/s]
Evaluating geons/airplane: 0it [00:00, ?it/s]
Evaluating geons/car: 0it [00:00, ?it/s]
Evaluating geons/chair: 0it [00:0

Accuracy for blurred: 80.00%
Accuracy for features: 66.67%
Accuracy for geons: 36.36%
Accuracy for realistic: 0.00%
Accuracy for silhouettes: 100.00%



