## **Embedding Generation Using DINO (Self-Distillation with No Labels)**

In this notebook, we will:
- Load the pre-processed and augmented images.
- Generate embeddings using the DINO model.
- Save the embeddings and related data for evaluation..

---

## **Table of Contents**
---
1. Import Libraries
2. Set Up Device
3. Load Augmented Image Mapping
4. Load DINO Model
5. Generate Embeddings
6. Select Representative Embeddings
7. Save Embeddings
8. Clear Memory
9. Conclusion

---
### **Step 1: Import Libraries**

We begin by importing the necessary libraries.

In [2]:
import os
import numpy as np
import pandas as pd
from PIL import Image
from tqdm import tqdm
import torch
from torchvision import transforms
import gc

---
### **Step 2: Set Up Device**

In [3]:
# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')

Using device: cpu


---
### **Step 3: Load Augmented Image Mapping**

In [4]:
# Load augmented image mapping
augmented_df = pd.read_csv('augmented_image_mapping.csv')

---
### **Step 4: Load DINO Model**

In [5]:
# Load the DINO model
model = torch.hub.load('facebookresearch/dino:main', 'dino_vitb16').to(device)
model.eval()

# Define the transform
transform = transforms.Compose([
    transforms.Resize(224),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=(0.485, 0.456, 0.406), std=(0.228, 0.224, 0.225)),
])

Using cache found in /Users/mohammedkhodorfirasal-tal/.cache/torch/hub/facebookresearch_dino_main


---
### **Step 5: Generate Embeddings**

In [10]:
batch_size = 168  # Adjust based on your GPU memory
embeddings = []
original_image_files = []
augmented_image_files = []

# Directory containing augmented images
augmented_dir = './augmented_dataset/'

# Prepare a list of all image paths and corresponding original images
image_paths = augmented_df['augmented_image'].apply(lambda x: os.path.join(augmented_dir, x)).tolist()
original_images = augmented_df['original_image'].tolist()
augmented_images = augmented_df['augmented_image'].tolist()

# Total number of images
total_images = len(image_paths)

# Generate embeddings in batches
for start_idx in tqdm(range(0, total_images, batch_size), desc='Generating Embeddings'):
    end_idx = min(start_idx + batch_size, total_images)
    batch_image_paths = image_paths[start_idx:end_idx]
    batch_original_images = original_images[start_idx:end_idx]
    batch_augmented_images = augmented_images[start_idx:end_idx]

    # Load and preprocess images
    imgs = [Image.open(p).convert('RGB') for p in batch_image_paths]
    img_tensors = torch.stack([transform(img) for img in imgs]).to(device)

    # Generate embeddings
    with torch.no_grad():
        batch_embeddings = model(img_tensors)
        batch_embeddings = batch_embeddings.cpu().numpy()
        # Normalize embeddings
        batch_embeddings = batch_embeddings / np.linalg.norm(batch_embeddings, axis=1, keepdims=True)
        embeddings.extend(batch_embeddings)
        original_image_files.extend(batch_original_images)
        augmented_image_files.extend(batch_augmented_images)

    # Clear memory
    del imgs, img_tensors, batch_embeddings
    torch.cuda.empty_cache()
    gc.collect()

# Convert embeddings to NumPy array
embeddings = np.vstack(embeddings)

Generating Embeddings: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 180/180 [38:42<00:00, 12.90s/it]


---
### **Step 6: Select Representative Embeddings**

We will use the Highest Norm Criterion to select the most representative embedding for each original image.

In [11]:
# Create a DataFrame for grouping
data = pd.DataFrame({
    'original_image': original_image_files,
    'augmented_image': augmented_image_files,
    'embedding_index': range(len(embeddings))
})

selected_embeddings = []
selected_image_files = []

# Group by original image and select the embedding with the highest norm
for original_image, group in data.groupby('original_image'):
    indices = group['embedding_index'].tolist()
    group_embeddings = embeddings[indices]
    # Compute norms of embeddings
    norms = np.linalg.norm(group_embeddings, axis=1)
    # Select the embedding with the highest norm
    best_idx_in_group = np.argmax(norms)
    best_idx = indices[best_idx_in_group]
    selected_embeddings.append(embeddings[best_idx])
    selected_image_files.append(original_image)

---
### **Step 7: Save Embeddings**

In [12]:
# Convert selected embeddings to NumPy array
selected_embeddings = np.vstack(selected_embeddings)

# Save embeddings and image files
np.save('embeddings_dino.npy', selected_embeddings)
np.save('image_files_dino.npy', selected_image_files)

print('Embeddings for DINO saved.')

Embeddings for DINO saved.


---
### **Step 8: Clear Memory**

In [13]:
# Clear variables and free memory
del embeddings, augmented_image_files, original_image_files, data
del selected_embeddings, selected_image_files, model
torch.cuda.empty_cache()
gc.collect()

0

---
### **Conclusion**

We have generated embeddings using the DINO model, selected representative embeddings, and saved them for evaluation.