**Q1) Do the visual encoders have the same architectures**
- No, CLIP does not use a single visual encoder architecture. Instead, it explores the use of both modified ResNet architectures (CNNs) and Vision Transformer architectures. These architectures differ fundamentally in their approach to processing visual information. ResNets rely on convolutional layers to extract features hierarchically, while ViTs treat an image as a sequence of patches and apply Transformer layers with self-attention mechanisms. The specific modifications made to the ResNet architecture within CLIP, such as the attention pooling, further distinguish it from a standard ResNet. The use of both types of architectures allows the CLIP model to leverage different strengths in visual representation learning.
- It is important to note that the text encoder used in CLIP is a Transformer architecture, distinct from both the ResNet and Vision Transformer-based visual encoders.The text encoder takes text as input, which is first converted into a lower-cased byte pair encoding (BPE) representation, and outputs a textual feature representation. These modality-specific feature representations are then linearly projected into a shared multi-modal embedding space where their similarity is calculated.

### Q2) ILSVRC: dataset setup
- ImageNet's label hierarchy is based on the WordNet hierarchy
- Each concept, mostly described by bunch of words, is called "synony set" is called "synset"

### Q3) Could grouping objects based on synsets lead to problems for visual recognition?
- Grouping objects based on synsets can lead to problems of visual recognition.
    - **Polysemy:** Without word context, models like CLIP would struggle to determine correct visual concepts. For example, ImageNet contains synsets for both construction cranes and birds that fly, both referred to as "cranes"
    - **Varying Granularity:** Levels of granularity used used in ImageNet may not be optimal for recognition task.
    
    - **Hierarchical Overlap:** For certain tasks, like the image classification task in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), the 1000 selected synsets are chosen such that there is no hierarchical overlap between them (no synset is an ancestor of another within this subset). This suggests that directly using the full WordNet hierarchy, where broader synsets contain more specific ones, could lead to complications or ambiguities in classification tasks if not handled carefully.

### Q4) Visual differences in same synset
- **Variation in visual characterstics:** Objects within the same synset may exhibit differences in their appearancs due to factors such as style, material and colour or pose. e.g. visually similar synset like seals and seal otters mar come closer due to sysnset postulate.
- **Differences in image context and background:** Image captured in same synset mat be captured in different environmenta dna context.
 - **Changes in scale, viewpoint and articular:** Objects in the same synset may depict different scales, viewpoints and sate of articulation, etc.

---
---

In [16]:
! pip install tensorflow_datasets

Collecting tensorflow_datasets
  Downloading tensorflow_datasets-4.9.2-py3-none-any.whl.metadata (9.0 kB)
Collecting array-record (from tensorflow_datasets)
  Downloading array_record-0.4.0-py38-none-any.whl.metadata (502 bytes)
Collecting dm-tree (from tensorflow_datasets)
  Downloading dm_tree-0.1.8-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.9 kB)
Collecting etils>=0.9.0 (from etils[enp,epath]>=0.9.0->tensorflow_datasets)
  Downloading etils-1.3.0-py3-none-any.whl.metadata (5.5 kB)
Collecting promise (from tensorflow_datasets)
  Downloading promise-2.3.tar.gz (19 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting tensorflow-metadata (from tensorflow_datasets)
  Downloading tensorflow_metadata-1.14.0-py3-none-any.whl.metadata (2.1 kB)
Collecting toml (from tensorflow_datasets)
  Using cached toml-0.10.2-py2.py3-none-any.whl.metadata (7.1 kB)
Collecting absl-py (from tensorflow_datasets)
  Downloading absl_py-1.4.0-py3-none-any.whl.metadata (2.3 kB)

In [1]:
from torchvision import datasets, transforms

from torchvision.datasets import ImageNet
import torchvision.transforms as T
import torch
import clip
from PIL import Image
from torch.utils.data import DataLoader


In [2]:


class CLIPClassifier:
    def __init__(self, imagenet_classes, model_type="transformer"):
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.imagenet_classes = imagenet_classes
        self.text_inputs = torch.cat([
            clip.tokenize(f"a photo of a {c}") for c in imagenet_classes
        ]).to(self.device)
        if model_type == "transformer":
            self.model, self.preprocess = clip.load("ViT-B/32", self.device)
        elif model_type == "rn50":
            self.model, self.preprocess = clip.load("RN50", self.device)
        else:
            raise ValueError("model_type must be 'transformer' or 'rn50'")
        self.model_type = model_type
    
    def classify_image(self, image_path):
        image = self.preprocess(Image.open(image_path)).unsqueeze(0).to(self.device)
        with torch.no_grad():
            image_features = self.model.encode_image(image)
            text_features = self.model.encode_text(self.text_inputs)
        image_features /= image_features.norm(dim=-1, keepdim=True)
        text_features /= text_features.norm(dim=-1, keepdim=True)
        logit_scale = self.model.logit_scale.exp()
        logits = logit_scale * (image_features @ text_features.T)
        probs = logits.softmax(dim=-1)
        return probs.detach().cpu().numpy()




In [3]:
imagenet_classes = ["tench", "English springer", "cassette player", "chain saw", "church", "French horn", 
                    "garbage truck", "gas pump", "golf ball", "parachute"]

In [4]:

vit_classifier = CLIPClassifier(imagenet_classes, model_type="transformer")
rn50_classifier = CLIPClassifier(imagenet_classes, model_type="rn50")

In [64]:
image_path = "/home/akash/ws/cv_assignment/assignment-5-MlLearnerAkash/Q3/ImageNetData/imagenette2/train/n03425413/ILSVRC2012_val_00005183.JPEG"

image_path = "/home/akash/ws/cv_assignment/assignment-5-MlLearnerAkash/Q3/ImageNetData/imagenette2/val/n03394916/n03394916_29142.JPEG"

In [65]:
# Get top predictions
probs = vit_classifier.classify_image(image_path)
top_indices = probs.argsort()[0][-5:][::-1]

print("Top 5 predictions:")
for idx in top_indices:
    print(f"{imagenet_classes[idx]:<25} {probs[0][idx]*100:.2f}%")

probs = rn50_classifier.classify_image(image_path)
top_indices = probs.argsort()[0][-5:][::-1]

print("Top 5 predictions:")
for idx in top_indices:
    print(f"{imagenet_classes[idx]:<25} {probs[0][idx]*100:.2f}%")


Top 5 predictions:
French horn               99.95%
garbage truck             0.02%
church                    0.01%
cassette player           0.01%
English springer          0.00%
Top 5 predictions:
French horn               92.63%
cassette player           3.54%
church                    1.40%
parachute                 0.87%
English springer          0.57%


### FP16

In [66]:
import torch
from PIL import Image
import clip

class CLIPClassifierFP16:
    def __init__(self, imagenet_classes, model_type="transformer"):
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.imagenet_classes = imagenet_classes
        # Tokenize text prompts
        self.text_inputs = torch.cat([
            clip.tokenize(f"a photo of a {c}") for c in imagenet_classes
        ]).to(self.device)
        # Load model
        if model_type == "transformer":
            self.model, self.preprocess = clip.load("ViT-B/32", self.device)
        elif model_type == "rn50":
            self.model, self.preprocess = clip.load("RN50", self.device)
        else:
            raise ValueError("model_type must be 'transformer' or 'rn50'")
        # Convert model to FP16
        self.model.half()
        assert self.model.dtype == torch.half , "model is of fp16"

    def classify_image(self, image_path):
        # Preprocess image and convert to FP16
        image = self.preprocess(Image.open(image_path)).unsqueeze(0).to(self.device).half()
        with torch.no_grad():
            with torch.cuda.amp.autocast():
                image_features = self.model.encode_image(image)
                text_features = self.model.encode_text(self.text_inputs)
            # image_features = self.model.encode_image(image)
            # text_features = self.model.encode_text(self.text_inputs).to(image_features.dtype)
        # Normalize features
        image_features /= image_features.norm(dim=-1, keepdim=True)
        text_features /= text_features.norm(dim=-1, keepdim=True)
        # Compute logits
        logit_scale = self.model.logit_scale.exp()
        logits = logit_scale * (image_features @ text_features.T)
        probs = logits.softmax(dim=-1)
        return probs.detach().cpu().numpy()


In [67]:

vit_classifier16 = CLIPClassifierFP16(imagenet_classes, model_type="transformer")
rn50_classifier16 = CLIPClassifierFP16(imagenet_classes, model_type="rn50")

In [68]:
# Get top predictions
probs = vit_classifier16.classify_image(image_path)
top_indices = probs.argsort()[0][-5:][::-1]

print("Top 5 predictions:")
for idx in top_indices:
    print(f"{imagenet_classes[idx]:<25} {probs[0][idx]*100:.2f}%")

print("="*10)
probs = rn50_classifier16.classify_image(image_path)
top_indices = probs.argsort()[0][-5:][::-1]

print("Top 5 predictions:")
for idx in top_indices:
    print(f"{imagenet_classes[idx]:<25} {probs[0][idx]*100:.2f}%")


Top 5 predictions:
French horn               99.95%
garbage truck             0.02%
church                    0.01%
cassette player           0.01%
English springer          0.00%
Top 5 predictions:
French horn               92.87%
cassette player           3.41%
church                    1.34%
parachute                 0.83%
English springer          0.55%


  with torch.cuda.amp.autocast():
