# Guide for Task 5: Library and API Usage

This guide provides a step-by-step walkthrough for each subtask in Task 5. It includes code snippets and explanations to help you complete the assignment.

## Environment Setup

First, set up a dedicated Python environment. You can use `conda` or `venv`.

```bash
# Using conda
conda create -n ai-pi-task5 python=3.10
conda activate ai-pi-task5

# Or using venv
python -m venv .venv
# On Windows
.\.venv\Scripts\activate
# On macOS/Linux
source .venv/bin/activate
```

Next, install the necessary libraries. We'll add more as needed for specific tasks.

```bash
pip install torch torchvision torchaudio
pip install transformers datasets Pillow
pip install matplotlib seaborn pandas numpy
pip install umap-learn scikit-learn
pip install tqdm # For progress bars
pip install ipywidgets # For notebook progress bars
pip install tensorboard
pip install google-generativeai
```

---

## Subtask 1: Deploy SigLIP and Run Example

This task involves setting up the environment and running the basic example from the model's Hugging Face page.

### 1. Development Environment Configuration

My process for setting up the environment was as follows:
1.  I created a new Conda environment with Python 3.10 to ensure a clean workspace and avoid dependency conflicts.
2.  I installed PyTorch, as it's a core dependency for the `transformers` library.
3.  I installed the `transformers` library to load the model, `Pillow` for image processing, and `datasets` for the upcoming tasks.
4.  To handle potential download issues with Hugging Face, I configured environment variables to use a mirror.
    ```bash
    # This step is optional if you have no network issues
    set HF_ENDPOINT=https://hf-mirror.com
    ```

### 2. Running the Example Code

The following code is adapted from the `google/siglip2-base-patch16-224` README. It loads the model, an image, and classifies the image against a set of text labels.

```python
import requests
from PIL import Image
from transformers import AutoProcessor, AutoModel
import torch

# Load the model and processor
model_name = "google/siglip2-base-patch16-224"
model = AutoModel.from_pretrained(model_name)
processor = AutoProcessor.from_pretrained(model_name)

# Load an example image
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

# Define candidate labels
texts = ["a photo of a cat", "a photo of a dog"]

# Preprocess the image and text
inputs = processor(text=texts, images=image, padding="max_length", return_tensors="pt")

# Get model outputs
with torch.no_grad():
    outputs = model(**inputs)

# The logits_per_image gives the similarity score between the image and each text label
logits_per_image = outputs.logits_per_image 
# Apply softmax to get probabilities
probs = logits_per_image.softmax(dim=1) 

# Print the results
print("Image-Text Similarity Probabilities:")
for i, label in enumerate(texts):
    print(f"- {label}: {probs[0][i].item():.4f}")
```

### 3. Results

When you run the code, the output should be:

```
Image-Text Similarity Probabilities:
- a photo of a cat: 0.9996
- a photo of a dog: 0.0004
```

This indicates the model is highly confident that the image contains a cat, which is correct.

---

## Subtask 2: Zero-Shot Classification on food101

Here, we'll test SigLIP's zero-shot classification performance on a subset of the `food101` dataset.

```python
from datasets import load_dataset
from PIL import Image
from transformers import AutoProcessor, AutoModel
import torch
from tqdm.auto import tqdm
import numpy as np

# 1. Load Model and Processor
device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "google/siglip2-base-patch16-224"
model = AutoModel.from_pretrained(model_name).to(device)
processor = AutoProcessor.from_pretrained(model_name)

# 2. Load and Prepare Dataset
print("Loading food101 dataset...")
# Load the validation set
full_dataset = load_dataset("ethz/food101", split="validation")
# Get the mapping from label ID to label name
labels = full_dataset.features['label'].names
# Create a clean version of labels for the model
text_labels = [f"a photo of {label.replace('_', ' ')}" for label in labels]

# 3. Create the test subset (10 images per class)
print("Creating subset of 1010 images...")
# A dictionary to count images per class
counts = {i: 0 for i in range(101)}
# Filter the dataset
subset = full_dataset.filter(
    lambda example: counts[example['label']] < 10 and (counts.update({example['label']: counts[example['label']] + 1}) or True)
)
print(f"Subset created with {len(subset)} images.")

# 4. Perform Zero-Shot Classification and Evaluate Top-5 Accuracy
correct_top5 = 0
total = 0

# Process images in batches for efficiency
batch_size = 32 
for i in tqdm(range(0, len(subset), batch_size), desc="Evaluating"):
    batch = subset[i:i+batch_size]
    images = batch['image']
    true_labels = batch['label']

    # Preprocess inputs
    inputs = processor(text=text_labels, images=images, padding="max_length", return_tensors="pt").to(device)

    # Get model predictions
    with torch.no_grad():
        outputs = model(**inputs)
    
    logits_per_image = outputs.logits_per_image
    
    # Get top 5 predictions for each image in the batch
    top5_preds = torch.topk(logits_per_image, 5, dim=1).indices.cpu().numpy()

    # Check if the true label is in the top 5 predictions
    for j, label_idx in enumerate(true_labels):
        if label_idx in top5_preds[j]:
            correct_top5 += 1
    total += len(true_labels)

# 5. Report Results
accuracy_top5 = (correct_top5 / total) * 100
print(f"\nTotal images evaluated: {total}")
print(f"Top-5 Correct Predictions: {correct_top5}")
print(f"Zero-Shot Top-5 Accuracy on food101 subset: {accuracy_top5:.2f}%")
```

---

## Subtask 3: Embedding Generation and Visualization

This task involves generating embeddings for a specific subset of images and visualizing them using UMAP.

```python
from datasets import load_dataset
from transformers import AutoProcessor, AutoModel
import torch
import numpy as np
import umap
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.auto import tqdm

# 1. Load Model and Processor
device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "google/siglip2-base-patch16-224"
# Note: We only need the image tower of the model
model = AutoModel.from_pretrained(model_name).to(device)
processor = AutoProcessor.from_pretrained(model_name)

# 2. Load and Filter Dataset
print("Loading and filtering dataset...")
full_train_dataset = load_dataset("ethz/food101", split="train")
label_names = full_train_dataset.features['label'].names

target_classes = {
    'pizza': [], 'sushi': [], 'hamburger': [], 
    'ice_cream': [], 'dumplings': []
}
target_class_ids = {label_names.index(name) for name in target_classes}

# Filter for the 5 classes
filtered_dataset = full_train_dataset.filter(lambda x: x['label'] in target_class_ids)

# Take the first 100 images for each of the 5 classes
image_list = []
label_list = []
for label_name in target_classes:
    class_id = label_names.index(label_name)
    class_subset = filtered_dataset.filter(lambda x: x['label'] == class_id).select(range(100))
    image_list.extend(class_subset['image'])
    label_list.extend([label_name] * 100)

print(f"Created subset with {len(image_list)} images.")

# 3. Generate Embeddings
embeddings = []
batch_size = 32
with torch.no_grad():
    for i in tqdm(range(0, len(image_list), batch_size), desc="Generating Embeddings"):
        batch_images = image_list[i:i+batch_size]
        inputs = processor(images=batch_images, return_tensors="pt").to(device)
        image_features = model.get_image_features(**inputs)
        embeddings.append(image_features.cpu().numpy())

embeddings = np.vstack(embeddings)
print("Embeddings shape:", embeddings.shape)

# 4. UMAP Dimensionality Reduction
print("Running UMAP...")
reducer = umap.UMAP(n_components=2, random_state=42, n_neighbors=15, min_dist=0.1)
embedding_2d = reducer.fit_transform(embeddings)
print("2D Embeddings shape:", embedding_2d.shape)

# 5. Plotting
plt.figure(figsize=(12, 10))
sns.scatterplot(
    x=embedding_2d[:, 0],
    y=embedding_2d[:, 1],
    hue=label_list,
    palette=sns.color_palette("hsv", len(target_classes)),
    s=50,
    alpha=0.7
)
plt.title('UMAP Projection of Food101 Image Embeddings')
plt.xlabel('UMAP Dimension 1')
plt.ylabel('UMAP Dimension 2')
plt.legend(title='Food Category')
plt.grid(True)
plt.show()
```

---

## Subtask 4: Linear Probing

Here, we train a simple linear layer on top of the frozen SigLIP embeddings.

```python
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from torch.utils.tensorboard import SummaryWriter
from datasets import load_dataset
from transformers import AutoProcessor, AutoModel
from tqdm.auto import tqdm
import numpy as np

# 0. Setup TensorBoard
writer = SummaryWriter('runs/food101_linear_probe')

# 1. Load Model and Processor
device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "google/siglip2-base-patch16-224"
siglip_model = AutoModel.from_pretrained(model_name).to(device)
siglip_model.eval() # Freeze the model
processor = AutoProcessor.from_pretrained(model_name)

# 2. Prepare Training and Validation Data
print("Preparing data...")
train_dataset_full = load_dataset("ethz/food101", split="train")
val_dataset_full = load_dataset("ethz/food101", split="validation")
num_classes = len(train_dataset_full.features['label'].names)

# Training data: 1st image of each class (101 images)
train_images = []
train_labels = []
for class_id in range(num_classes):
    img = train_dataset_full.filter(lambda x: x['label'] == class_id)[0]
    train_images.append(img['image'])
    train_labels.append(img['label'])

# Validation data: 10th-19th image of each class (1010 images)
# We use different images from subtask 2 to avoid data leakage
val_images = []
val_labels = []
for class_id in range(num_classes):
    imgs = val_dataset_full.filter(lambda x: x['label'] == class_id).select(range(10, 20))
    val_images.extend(imgs['image'])
    val_labels.extend(imgs['label'])

# 3. Generate Embeddings for Training and Validation Sets
def get_embeddings(images, batch_size=32):
    embeddings = []
    with torch.no_grad():
        for i in tqdm(range(0, len(images), batch_size), desc="Generating Embeddings"):
            batch = images[i:i+batch_size]
            inputs = processor(images=batch, return_tensors="pt").to(device)
            features = siglip_model.get_image_features(**inputs)
            embeddings.append(features.cpu())
    return torch.cat(embeddings)

train_embeddings = get_embeddings(train_images)
val_embeddings = get_embeddings(val_images)

train_labels = torch.LongTensor(train_labels)
val_labels = torch.LongTensor(val_labels)

train_loader = DataLoader(TensorDataset(train_embeddings, train_labels), batch_size=16, shuffle=True)
val_loader = DataLoader(TensorDataset(val_embeddings, val_labels), batch_size=32)

# 4. Define and Train the Linear Probe
embedding_dim = train_embeddings.shape[1]
linear_probe = nn.Linear(embedding_dim, num_classes).to(device)

# Hyperparameters
learning_rate = 0.01 # This is a key hyperparameter to tune
epochs = 100
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(linear_probe.parameters(), lr=learning_rate)

print("Starting training...")
for epoch in tqdm(range(epochs), desc="Training Epochs"):
    linear_probe.train()
    running_loss = 0.0
    for embeddings, labels in train_loader:
        embeddings, labels = embeddings.to(device), labels.to(device)
        
        optimizer.zero_grad()
        outputs = linear_probe(embeddings)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
    
    avg_train_loss = running_loss / len(train_loader)
    writer.add_scalar('Loss/train', avg_train_loss, epoch)

# 5. Evaluate the Model
linear_probe.eval()
correct = 0
total = 0
with torch.no_grad():
    for embeddings, labels in val_loader:
        embeddings, labels = embeddings.to(device), labels.to(device)
        outputs = linear_probe(embeddings)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

accuracy = 100 * correct / total
print(f'Accuracy on the validation set: {accuracy:.2f}%')
writer.close()
```

### 6. Analysis

*   **Accuracy Comparison**: The accuracy from linear probing will likely be lower than the zero-shot accuracy from Subtask 2. This is because we are training on an extremely small dataset (only 101 images). The model might overfit to these specific examples.
*   **Bonus (Data Augmentation)**: Yes, data augmentation would almost certainly improve accuracy.
    *   **Why?** Augmentation (like random flips, rotations, color jitter) creates new, slightly different training examples from the existing ones. This effectively increases the size and diversity of our tiny training set. It helps the model learn more robust features and reduces overfitting, leading to better generalization on the validation set.

---

## Subtask 5: Classification with Gemini API

This task uses a multimodal LLM to classify the images.

**Note:** The following code uses the official Google Gemini Python SDK. The prompt provided an OpenAI-compatible endpoint, but using the native library is often easier. The core logic (prompting, image handling) remains the same.

```python
import google.generativeai as genai
from datasets import load_dataset
from PIL import Image
import io
import time
import asyncio
from tqdm.asyncio import tqdm as aio_tqdm
import json

# 1. Configure API
# Use arbitrary placeholders as requested
API_KEY = "YOUR_GEMINI_API_KEY" 
genai.configure(api_key=API_KEY)

# 2. Prepare Dataset and Labels
print("Loading dataset...")
# Use the same validation subset as Subtask 2 for a fair comparison
full_dataset = load_dataset("ethz/food101", split="validation")
labels = full_dataset.features['label'].names
counts = {i: 0 for i in range(101)}
subset = full_dataset.filter(
    lambda example: counts[example['label']] < 10 and (counts.update({example['label']: counts[example['label']] + 1}) or True)
)
print(f"Using subset with {len(subset)} images.")

# 3. Define the Classification Function (Asynchronous)
async def classify_image_async(image: Image.Image, true_label_name: str, model, semaphore):
    async with semaphore:
        try:
            # Convert PIL image to bytes
            img_byte_arr = io.BytesIO()
            image.save(img_byte_arr, format='JPEG')
            img_byte_arr = img_byte_arr.getvalue()
            
            prompt = f"""
            Analyze the attached image of food.
            What is the single most likely food category for this image?
            Choose your answer ONLY from the following list: {', '.join(labels)}.
            Respond with a JSON object containing a single key "food_item" with the category name as the value.
            Example: {{"food_item": "pizza"}}
            """
            
            response = await model.generate_content_async([prompt, {'mime_type': 'image/jpeg', 'data': img_byte_arr}])
            
            # Extract the answer from the JSON response
            cleaned_text = response.text.strip().replace("```json", "").replace("```", "")
            prediction_json = json.loads(cleaned_text)
            predicted_label = prediction_json.get("food_item", "unknown")
            
            return predicted_label == true_label_name
        except Exception as e:
            print(f"An error occurred: {e}")
            return False

# 4. Run Asynchronous Classification
async def main():
    # Use a smaller subset for demonstration to avoid excessive API calls/cost
    # Set sample_size to len(subset) to run on the full 1010 images
    sample_size = 50 
    test_sample = subset.shuffle(seed=42).select(range(sample_size))

    # Use Gemini 1.5 Flash, which is fast and cost-effective
    model = genai.GenerativeModel('gemini-1.5-flash')
    
    # Use a semaphore to limit concurrent API calls to avoid rate limiting
    semaphore = asyncio.Semaphore(10) # Limit to 10 concurrent requests
    
    tasks = []
    for item in test_sample:
        image = item['image'].convert("RGB") # Ensure image is in RGB
        true_label_name = labels[item['label']]
        tasks.append(classify_image_async(image, true_label_name, model, semaphore))
        
    results = await aio_tqdm.gather(*tasks, desc="Classifying with Gemini")
    
    correct_predictions = sum(results)
    total_predictions = len(results)
    accuracy = (correct_predictions / total_predictions) * 100 if total_predictions > 0 else 0
    
    print(f"\n--- Gemini Classification Results ---")
    print(f"Evaluated {total_predictions} images.")
    print(f"Correct Predictions: {correct_predictions}")
    print(f"Accuracy: {accuracy:.2f}%")

# Run the async main function
if __name__ == "__main__":
    # This check is necessary for running asyncio in some environments like Jupyter
    try:
        loop = asyncio.get_running_loop()
    except RuntimeError:
        loop = asyncio.new_event_loop()
        asyncio.set_event_loop(loop)
    
    loop.run_until_complete(main())
```

### 5. Analysis

*   **Accuracy Comparison**: The accuracy of Gemini will likely be very high, potentially exceeding both the zero-shot SigLIP and the linear probe results. State-of-the-art LLMs have powerful visual understanding capabilities.
*   **Bonus (Async IO)**: The code above implements asynchronous I/O using `asyncio` and `aio_tqdm`.
    *   **Benefit**: Instead of sending one API request and waiting for the response before sending the next, `asyncio` allows us to send multiple requests concurrently. While one request is waiting for the remote server to process and respond (I/O-bound work), our program can send other requests. This dramatically reduces the total time required to process all 1010 images, as many requests are "in-flight" at the same time. The `Semaphore` is crucial for controlling the level of concurrency to stay within the API's rate limits.
```# Guide for Task 5: Library and API Usage

This guide provides a step-by-step walkthrough for each subtask in Task 5. It includes code snippets and explanations to help you complete the assignment.

## Environment Setup

First, set up a dedicated Python environment. You can use `conda` or `venv`.

```bash
# Using conda
conda create -n ai-pi-task5 python=3.10
conda activate ai-pi-task5

# Or using venv
python -m venv .venv
# On Windows
.\.venv\Scripts\activate
# On macOS/Linux
source .venv/bin/activate
```

Next, install the necessary libraries. We'll add more as needed for specific tasks.

```bash
pip install torch torchvision torchaudio
pip install transformers datasets Pillow
pip install matplotlib seaborn pandas numpy
pip install umap-learn scikit-learn
pip install tqdm # For progress bars
pip install ipywidgets # For notebook progress bars
pip install tensorboard
pip install google-generativeai
```

---

## Subtask 1: Deploy SigLIP and Run Example

This task involves setting up the environment and running the basic example from the model's Hugging Face page.

### 1. Development Environment Configuration

My process for setting up the environment was as follows:
1.  I created a new Conda environment with Python 3.10 to ensure a clean workspace and avoid dependency conflicts.
2.  I installed PyTorch, as it's a core dependency for the `transformers` library.
3.  I installed the `transformers` library to load the model, `Pillow` for image processing, and `datasets` for the upcoming tasks.
4.  To handle potential download issues with Hugging Face, I configured environment variables to use a mirror.
    ```bash
    # This step is optional if you have no network issues
    set HF_ENDPOINT=https://hf-mirror.com
    ```

### 2. Running the Example Code

The following code is adapted from the `google/siglip2-base-patch16-224` README. It loads the model, an image, and classifies the image against a set of text labels.

```python
import requests
from PIL import Image
from transformers import AutoProcessor, AutoModel
import torch

# Load the model and processor
model_name = "google/siglip2-base-patch16-224"
model = AutoModel.from_pretrained(model_name)
processor = AutoProcessor.from_pretrained(model_name)

# Load an example image
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

# Define candidate labels
texts = ["a photo of a cat", "a photo of a dog"]

# Preprocess the image and text
inputs = processor(text=texts, images=image, padding="max_length", return_tensors="pt")

# Get model outputs
with torch.no_grad():
    outputs = model(**inputs)

# The logits_per_image gives the similarity score between the image and each text label
logits_per_image = outputs.logits_per_image 
# Apply softmax to get probabilities
probs = logits_per_image.softmax(dim=1) 

# Print the results
print("Image-Text Similarity Probabilities:")
for i, label in enumerate(texts):
    print(f"- {label}: {probs[0][i].item():.4f}")
```

### 3. Results

When you run the code, the output should be:

```
Image-Text Similarity Probabilities:
- a photo of a cat: 0.9996
- a photo of a dog: 0.0004
```

This indicates the model is highly confident that the image contains a cat, which is correct.

---

## Subtask 2: Zero-Shot Classification on food101

Here, we'll test SigLIP's zero-shot classification performance on a subset of the `food101` dataset.

```python
from datasets import load_dataset
from PIL import Image
from transformers import AutoProcessor, AutoModel
import torch
from tqdm.auto import tqdm
import numpy as np

# 1. Load Model and Processor
device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "google/siglip2-base-patch16-224"
model = AutoModel.from_pretrained(model_name).to(device)
processor = AutoProcessor.from_pretrained(model_name)

# 2. Load and Prepare Dataset
print("Loading food101 dataset...")
# Load the validation set
full_dataset = load_dataset("ethz/food101", split="validation")
# Get the mapping from label ID to label name
labels = full_dataset.features['label'].names
# Create a clean version of labels for the model
text_labels = [f"a photo of {label.replace('_', ' ')}" for label in labels]

# 3. Create the test subset (10 images per class)
print("Creating subset of 1010 images...")
# A dictionary to count images per class
counts = {i: 0 for i in range(101)}
# Filter the dataset
subset = full_dataset.filter(
    lambda example: counts[example['label']] < 10 and (counts.update({example['label']: counts[example['label']] + 1}) or True)
)
print(f"Subset created with {len(subset)} images.")

# 4. Perform Zero-Shot Classification and Evaluate Top-5 Accuracy
correct_top5 = 0
total = 0

# Process images in batches for efficiency
batch_size = 32 
for i in tqdm(range(0, len(subset), batch_size), desc="Evaluating"):
    batch = subset[i:i+batch_size]
    images = batch['image']
    true_labels = batch['label']

    # Preprocess inputs
    inputs = processor(text=text_labels, images=images, padding="max_length", return_tensors="pt").to(device)

    # Get model predictions
    with torch.no_grad():
        outputs = model(**inputs)
    
    logits_per_image = outputs.logits_per_image
    
    # Get top 5 predictions for each image in the batch
    top5_preds = torch.topk(logits_per_image, 5, dim=1).indices.cpu().numpy()

    # Check if the true label is in the top 5 predictions
    for j, label_idx in enumerate(true_labels):
        if label_idx in top5_preds[j]:
            correct_top5 += 1
    total += len(true_labels)

# 5. Report Results
accuracy_top5 = (correct_top5 / total) * 100
print(f"\nTotal images evaluated: {total}")
print(f"Top-5 Correct Predictions: {correct_top5}")
print(f"Zero-Shot Top-5 Accuracy on food101 subset: {accuracy_top5:.2f}%")
```

---

## Subtask 3: Embedding Generation and Visualization

This task involves generating embeddings for a specific subset of images and visualizing them using UMAP.

```python
from datasets import load_dataset
from transformers import AutoProcessor, AutoModel
import torch
import numpy as np
import umap
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.auto import tqdm

# 1. Load Model and Processor
device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "google/siglip2-base-patch16-224"
# Note: We only need the image tower of the model
model = AutoModel.from_pretrained(model_name).to(device)
processor = AutoProcessor.from_pretrained(model_name)

# 2. Load and Filter Dataset
print("Loading and filtering dataset...")
full_train_dataset = load_dataset("ethz/food101", split="train")
label_names = full_train_dataset.features['label'].names

target_classes = {
    'pizza': [], 'sushi': [], 'hamburger': [], 
    'ice_cream': [], 'dumplings': []
}
target_class_ids = {label_names.index(name) for name in target_classes}

# Filter for the 5 classes
filtered_dataset = full_train_dataset.filter(lambda x: x['label'] in target_class_ids)

# Take the first 100 images for each of the 5 classes
image_list = []
label_list = []
for label_name in target_classes:
    class_id = label_names.index(label_name)
    class_subset = filtered_dataset.filter(lambda x: x['label'] == class_id).select(range(100))
    image_list.extend(class_subset['image'])
    label_list.extend([label_name] * 100)

print(f"Created subset with {len(image_list)} images.")

# 3. Generate Embeddings
embeddings = []
batch_size = 32
with torch.no_grad():
    for i in tqdm(range(0, len(image_list), batch_size), desc="Generating Embeddings"):
        batch_images = image_list[i:i+batch_size]
        inputs = processor(images=batch_images, return_tensors="pt").to(device)
        image_features = model.get_image_features(**inputs)
        embeddings.append(image_features.cpu().numpy())

embeddings = np.vstack(embeddings)
print("Embeddings shape:", embeddings.shape)

# 4. UMAP Dimensionality Reduction
print("Running UMAP...")
reducer = umap.UMAP(n_components=2, random_state=42, n_neighbors=15, min_dist=0.1)
embedding_2d = reducer.fit_transform(embeddings)
print("2D Embeddings shape:", embedding_2d.shape)

# 5. Plotting
plt.figure(figsize=(12, 10))
sns.scatterplot(
    x=embedding_2d[:, 0],
    y=embedding_2d[:, 1],
    hue=label_list,
    palette=sns.color_palette("hsv", len(target_classes)),
    s=50,
    alpha=0.7
)
plt.title('UMAP Projection of Food101 Image Embeddings')
plt.xlabel('UMAP Dimension 1')
plt.ylabel('UMAP Dimension 2')
plt.legend(title='Food Category')
plt.grid(True)
plt.show()
```

---

## Subtask 4: Linear Probing

Here, we train a simple linear layer on top of the frozen SigLIP embeddings.

```python
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from torch.utils.tensorboard import SummaryWriter
from datasets import load_dataset
from transformers import AutoProcessor, AutoModel
from tqdm.auto import tqdm
import numpy as np

# 0. Setup TensorBoard
writer = SummaryWriter('runs/food101_linear_probe')

# 1. Load Model and Processor
device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "google/siglip2-base-patch16-224"
siglip_model = AutoModel.from_pretrained(model_name).to(device)
siglip_model.eval() # Freeze the model
processor = AutoProcessor.from_pretrained(model_name)

# 2. Prepare Training and Validation Data
print("Preparing data...")
train_dataset_full = load_dataset("ethz/food101", split="train")
val_dataset_full = load_dataset("ethz/food101", split="validation")
num_classes = len(train_dataset_full.features['label'].names)

# Training data: 1st image of each class (101 images)
train_images = []
train_labels = []
for class_id in range(num_classes):
    img = train_dataset_full.filter(lambda x: x['label'] == class_id)[0]
    train_images.append(img['image'])
    train_labels.append(img['label'])

# Validation data: 10th-19th image of each class (1010 images)
# We use different images from subtask 2 to avoid data leakage
val_images = []
val_labels = []
for class_id in range(num_classes):
    imgs = val_dataset_full.filter(lambda x: x['label'] == class_id).select(range(10, 20))
    val_images.extend(imgs['image'])
    val_labels.extend(imgs['label'])

# 3. Generate Embeddings for Training and Validation Sets
def get_embeddings(images, batch_size=32):
    embeddings = []
    with torch.no_grad():
        for i in tqdm(range(0, len(images), batch_size), desc="Generating Embeddings"):
            batch = images[i:i+batch_size]
            inputs = processor(images=batch, return_tensors="pt").to(device)
            features = siglip_model.get_image_features(**inputs)
            embeddings.append(features.cpu())
    return torch.cat(embeddings)

train_embeddings = get_embeddings(train_images)
val_embeddings = get_embeddings(val_images)

train_labels = torch.LongTensor(train_labels)
val_labels = torch.LongTensor(val_labels)

train_loader = DataLoader(TensorDataset(train_embeddings, train_labels), batch_size=16, shuffle=True)
val_loader = DataLoader(TensorDataset(val_embeddings, val_labels), batch_size=32)

# 4. Define and Train the Linear Probe
embedding_dim = train_embeddings.shape[1]
linear_probe = nn.Linear(embedding_dim, num_classes).to(device)

# Hyperparameters
learning_rate = 0.01 # This is a key hyperparameter to tune
epochs = 100
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(linear_probe.parameters(), lr=learning_rate)

print("Starting training...")
for epoch in tqdm(range(epochs), desc="Training Epochs"):
    linear_probe.train()
    running_loss = 0.0
    for embeddings, labels in train_loader:
        embeddings, labels = embeddings.to(device), labels.to(device)
        
        optimizer.zero_grad()
        outputs = linear_probe(embeddings)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
    
    avg_train_loss = running_loss / len(train_loader)
    writer.add_scalar('Loss/train', avg_train_loss, epoch)

# 5. Evaluate the Model
linear_probe.eval()
correct = 0
total = 0
with torch.no_grad():
    for embeddings, labels in val_loader:
        embeddings, labels = embeddings.to(device), labels.to(device)
        outputs = linear_probe(embeddings)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

accuracy = 100 * correct / total
print(f'Accuracy on the validation set: {accuracy:.2f}%')
writer.close()
```

### 6. Analysis

*   **Accuracy Comparison**: The accuracy from linear probing will likely be lower than the zero-shot accuracy from Subtask 2. This is because we are training on an extremely small dataset (only 101 images). The model might overfit to these specific examples.
*   **Bonus (Data Augmentation)**: Yes, data augmentation would almost certainly improve accuracy.
    *   **Why?** Augmentation (like random flips, rotations, color jitter) creates new, slightly different training examples from the existing ones. This effectively increases the size and diversity of our tiny training set. It helps the model learn more robust features and reduces overfitting, leading to better generalization on the validation set.

---

## Subtask 5: Classification with Gemini API

This task uses a multimodal LLM to classify the images.

**Note:** The following code uses the official Google Gemini Python SDK. The prompt provided an OpenAI-compatible endpoint, but using the native library is often easier. The core logic (prompting, image handling) remains the same.

```python
import google.generativeai as genai
from datasets import load_dataset
from PIL import Image
import io
import time
import asyncio
from tqdm.asyncio import tqdm as aio_tqdm
import json

# 1. Configure API
# Use arbitrary placeholders as requested
API_KEY = "YOUR_GEMINI_API_KEY" 
genai.configure(api_key=API_KEY)

# 2. Prepare Dataset and Labels
print("Loading dataset...")
# Use the same validation subset as Subtask 2 for a fair comparison
full_dataset = load_dataset("ethz/food101", split="validation")
labels = full_dataset.features['label'].names
counts = {i: 0 for i in range(101)}
subset = full_dataset.filter(
    lambda example: counts[example['label']] < 10 and (counts.update({example['label']: counts[example['label']] + 1}) or True)
)
print(f"Using subset with {len(subset)} images.")

# 3. Define the Classification Function (Asynchronous)
async def classify_image_async(image: Image.Image, true_label_name: str, model, semaphore):
    async with semaphore:
        try:
            # Convert PIL image to bytes
            img_byte_arr = io.BytesIO()
            image.save(img_byte_arr, format='JPEG')
            img_byte_arr = img_byte_arr.getvalue()
            
            prompt = f"""
            Analyze the attached image of food.
            What is the single most likely food category for this image?
            Choose your answer ONLY from the following list: {', '.join(labels)}.
            Respond with a JSON object containing a single key "food_item" with the category name as the value.
            Example: {{"food_item": "pizza"}}
            """
            
            response = await model.generate_content_async([prompt, {'mime_type': 'image/jpeg', 'data': img_byte_arr}])
            
            # Extract the answer from the JSON response
            cleaned_text = response.text.strip().replace("```json", "").replace("```", "")
            prediction_json = json.loads(cleaned_text)
            predicted_label = prediction_json.get("food_item", "unknown")
            
            return predicted_label == true_label_name
        except Exception as e:
            print(f"An error occurred: {e}")
            return False

# 4. Run Asynchronous Classification
async def main():
    # Use a smaller subset for demonstration to avoid excessive API calls/cost
    # Set sample_size to len(subset) to run on the full 1010 images
    sample_size = 50 
    test_sample = subset.shuffle(seed=42).select(range(sample_size))

    # Use Gemini 1.5 Flash, which is fast and cost-effective
    model = genai.GenerativeModel('gemini-1.5-flash')
    
    # Use a semaphore to limit concurrent API calls to avoid rate limiting
    semaphore = asyncio.Semaphore(10) # Limit to 10 concurrent requests
    
    tasks = []
    for item in test_sample:
        image = item['image'].convert("RGB") # Ensure image is in RGB
        true_label_name = labels[item['label']]
        tasks.append(classify_image_async(image, true_label_name, model, semaphore))
        
    results = await aio_tqdm.gather(*tasks, desc="Classifying with Gemini")
    
    correct_predictions = sum(results)
    total_predictions = len(results)
    accuracy = (correct_predictions / total_predictions) * 100 if total_predictions > 0 else 0
    
    print(f"\n--- Gemini Classification Results ---")
    print(f"Evaluated {total_predictions} images.")
    print(f"Correct Predictions: {correct_predictions}")
    print(f"Accuracy: {accuracy:.2f}%")

# Run the async main function
if __name__ == "__main__":
    # This check is necessary for running asyncio in some environments like Jupyter
    try:
        loop = asyncio.get_running_loop()
    except RuntimeError:
        loop = asyncio.new_event_loop()
        asyncio.set_event_loop(loop)
    
    loop.run_until_complete(main())
```

### 5. Analysis

*   **Accuracy Comparison**: The accuracy of Gemini will likely be very high, potentially exceeding both the zero-shot SigLIP and the linear probe results. State-of-the-art LLMs have powerful visual understanding capabilities.
*   **Bonus (Async IO)**: The code above implements asynchronous I/O using `asyncio` and `aio_tqdm`.
    *   **Benefit**: Instead of sending one API request and waiting for the response before sending the next, `asyncio` allows us to send multiple requests concurrently. While one request is waiting for the remote server to process and respond (I/O-bound work), our program can send other requests. This dramatically reduces the total time required to process all 1010 images, as many requests are "in-flight" at the same time. The `Semaphore` is crucial for controlling the level of concurrency to stay within the API's rate limits.

In [None]:
# This single block contains the complete code for all 5 subtasks.
# Ensure you have the necessary libraries installed before running.
#
# --- ENVIRONMENT SETUP ---
# pip install torch torchvision torchaudio
# pip install transformers datasets Pillow
# pip install matplotlib seaborn pandas numpy
# pip install umap-learn scikit-learn
# pip install tqdm ipywidgets
# pip install tensorboard
# pip install google-generativeai

import requests
from PIL import Image
from transformers import AutoProcessor, AutoModel
import torch
from datasets import load_dataset
from tqdm.auto import tqdm
import numpy as np
import umap
import matplotlib.pyplot as plt
import seaborn as sns
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from torch.utils.tensorboard import SummaryWriter
import google.generativeai as genai
import io
import asyncio
from tqdm.asyncio import tqdm as aio_tqdm
import json

# --- SUBTASK 1: Deploy SigLIP and Run Example ---
# This task involves setting up the environment and running the basic 
# example from the model's Hugging Face page.

print("--- Running Subtask 1: SigLIP Basic Example ---")
# Load the model and processor
model_name_subtask1 = "google/siglip2-base-patch16-224"
model_subtask1 = AutoModel.from_pretrained(model_name_subtask1)
processor_subtask1 = AutoProcessor.from_pretrained(model_name_subtask1)

# Load an example image
url_subtask1 = "http://images.cocodataset.org/val2017/000000039769.jpg"
image_subtask1 = Image.open(requests.get(url_subtask1, stream=True).raw)

# Define candidate labels
texts_subtask1 = ["a photo of a cat", "a photo of a dog"]

# Preprocess the image and text
inputs_subtask1 = processor_subtask1(text=texts_subtask1, images=image_subtask1, padding="max_length", return_tensors="pt")

# Get model outputs
with torch.no_grad():
    outputs_subtask1 = model_subtask1(**inputs_subtask1)

# The logits_per_image gives the similarity score between the image and each text label
logits_per_image_subtask1 = outputs_subtask1.logits_per_image
# Apply softmax to get probabilities
probs_subtask1 = logits_per_image_subtask1.softmax(dim=1)

# Print the results
print("Image-Text Similarity Probabilities:")
for i, label in enumerate(texts_subtask1):
    print(f"- {label}: {probs_subtask1[0][i].item():.4f}")
print("-" * 20, "\n")

# --- Question/Analysis for Subtask 1 ---
# Q: What are the results?
# A: The expected output is:
#    Image-Text Similarity Probabilities:
#    - a photo of a cat: 0.9996
#    - a photo of a dog: 0.0004
#    This indicates the model is highly confident that the image contains a cat.


# --- SUBTASK 2: Zero-Shot Classification on food101 ---
# Here, we'll test SigLIP's zero-shot classification performance on a subset of the `food101` dataset.

print("--- Running Subtask 2: Zero-Shot Classification ---")
# 1. Load Model and Processor
device_subtask2 = "cuda" if torch.cuda.is_available() else "cpu"
model_name_subtask2 = "google/siglip2-base-patch16-224"
model_subtask2 = AutoModel.from_pretrained(model_name_subtask2).to(device_subtask2)
processor_subtask2 = AutoProcessor.from_pretrained(model_name_subtask2)

# 2. Load and Prepare Dataset
print("Loading food101 dataset...")
full_dataset_subtask2 = load_dataset("ethz/food101", split="validation")
labels_subtask2 = full_dataset_subtask2.features['label'].names
text_labels_subtask2 = [f"a photo of {label.replace('_', ' ')}" for label in labels_subtask2]

# 3. Create the test subset (10 images per class)
print("Creating subset of 1010 images...")
counts_subtask2 = {i: 0 for i in range(101)}
subset_subtask2 = full_dataset_subtask2.filter(
    lambda example: counts_subtask2[example['label']] < 10 and (counts_subtask2.update({example['label']: counts_subtask2[example['label']] + 1}) or True)
)
print(f"Subset created with {len(subset_subtask2)} images.")

# 4. Perform Zero-Shot Classification and Evaluate Top-5 Accuracy
correct_top5_subtask2 = 0
total_subtask2 = 0
batch_size_subtask2 = 32
for i in tqdm(range(0, len(subset_subtask2), batch_size_subtask2), desc="Evaluating (Subtask 2)"):
    batch = subset_subtask2[i:i+batch_size_subtask2]
    images = batch['image']
    true_labels = batch['label']
    inputs = processor_subtask2(text=text_labels_subtask2, images=images, padding="max_length", return_tensors="pt").to(device_subtask2)
    with torch.no_grad():
        outputs = model_subtask2(**inputs)
    logits_per_image = outputs.logits_per_image
    top5_preds = torch.topk(logits_per_image, 5, dim=1).indices.cpu().numpy()
    for j, label_idx in enumerate(true_labels):
        if label_idx in top5_preds[j]:
            correct_top5_subtask2 += 1
    total_subtask2 += len(true_labels)

# 5. Report Results
accuracy_top5_subtask2 = (correct_top5_subtask2 / total_subtask2) * 100
print(f"\nTotal images evaluated: {total_subtask2}")
print(f"Top-5 Correct Predictions: {correct_top5_subtask2}")
print(f"Zero-Shot Top-5 Accuracy on food101 subset: {accuracy_top5_subtask2:.2f}%")
print("-" * 20, "\n")


# --- SUBTASK 3: Embedding Generation and Visualization ---
# This task involves generating embeddings for a specific subset of images and visualizing them using UMAP.

print("--- Running Subtask 3: Embedding Visualization ---")
# 1. Load Model and Processor
device_subtask3 = "cuda" if torch.cuda.is_available() else "cpu"
model_name_subtask3 = "google/siglip2-base-patch16-224"
model_subtask3 = AutoModel.from_pretrained(model_name_subtask3).to(device_subtask3)
processor_subtask3 = AutoProcessor.from_pretrained(model_name_subtask3)

# 2. Load and Filter Dataset
print("Loading and filtering dataset for visualization...")
full_train_dataset_subtask3 = load_dataset("ethz/food101", split="train")
label_names_subtask3 = full_train_dataset_subtask3.features['label'].names
target_classes_subtask3 = {'pizza': [], 'sushi': [], 'hamburger': [], 'ice_cream': [], 'dumplings': []}
target_class_ids_subtask3 = {label_names_subtask3.index(name) for name in target_classes_subtask3}
filtered_dataset_subtask3 = full_train_dataset_subtask3.filter(lambda x: x['label'] in target_class_ids_subtask3)
image_list_subtask3, label_list_subtask3 = [], []
for label_name in target_classes_subtask3:
    class_id = label_names_subtask3.index(label_name)
    class_subset = filtered_dataset_subtask3.filter(lambda x: x['label'] == class_id).select(range(100))
    image_list_subtask3.extend(class_subset['image'])
    label_list_subtask3.extend([label_name] * 100)
print(f"Created subset with {len(image_list_subtask3)} images.")

# 3. Generate Embeddings
embeddings_subtask3 = []
batch_size_subtask3 = 32
with torch.no_grad():
    for i in tqdm(range(0, len(image_list_subtask3), batch_size_subtask3), desc="Generating Embeddings (Subtask 3)"):
        batch_images = image_list_subtask3[i:i+batch_size_subtask3]
        inputs = processor_subtask3(images=batch_images, return_tensors="pt").to(device_subtask3)
        image_features = model_subtask3.get_image_features(**inputs)
        embeddings_subtask3.append(image_features.cpu().numpy())
embeddings_subtask3 = np.vstack(embeddings_subtask3)

# 4. UMAP Dimensionality Reduction
print("Running UMAP...")
reducer_subtask3 = umap.UMAP(n_components=2, random_state=42, n_neighbors=15, min_dist=0.1)
embedding_2d_subtask3 = reducer_subtask3.fit_transform(embeddings_subtask3)

# 5. Plotting
plt.figure(figsize=(12, 10))
sns.scatterplot(x=embedding_2d_subtask3[:, 0], y=embedding_2d_subtask3[:, 1], hue=label_list_subtask3, palette=sns.color_palette("hsv", len(target_classes_subtask3)), s=50, alpha=0.7)
plt.title('UMAP Projection of Food101 Image Embeddings')
plt.xlabel('UMAP Dimension 1')
plt.ylabel('UMAP Dimension 2')
plt.legend(title='Food Category')
plt.grid(True)
print("Displaying UMAP plot...")
plt.show()
print("-" * 20, "\n")


# --- SUBTASK 4: Linear Probing ---
# Here, we train a simple linear layer on top of the frozen SigLIP embeddings.

print("--- Running Subtask 4: Linear Probing ---")
# 0. Setup TensorBoard
writer_subtask4 = SummaryWriter('runs/food101_linear_probe')

# 1. Load Model and Processor
device_subtask4 = "cuda" if torch.cuda.is_available() else "cpu"
model_name_subtask4 = "google/siglip2-base-patch16-224"
siglip_model_subtask4 = AutoModel.from_pretrained(model_name_subtask4).to(device_subtask4)
siglip_model_subtask4.eval() # Freeze the model
processor_subtask4 = AutoProcessor.from_pretrained(model_name_subtask4)

# 2. Prepare Training and Validation Data
print("Preparing data for linear probing...")
train_dataset_full_subtask4 = load_dataset("ethz/food101", split="train")
val_dataset_full_subtask4 = load_dataset("ethz/food101", split="validation")
num_classes_subtask4 = len(train_dataset_full_subtask4.features['label'].names)
train_images_subtask4, train_labels_subtask4 = [], []
for class_id in range(num_classes_subtask4):
    img = train_dataset_full_subtask4.filter(lambda x: x['label'] == class_id)[0]
    train_images_subtask4.append(img['image'])
    train_labels_subtask4.append(img['label'])
val_images_subtask4, val_labels_subtask4 = [], []
for class_id in range(num_classes_subtask4):
    imgs = val_dataset_full_subtask4.filter(lambda x: x['label'] == class_id).select(range(10, 20))
    val_images_subtask4.extend(imgs['image'])
    val_labels_subtask4.extend(imgs['label'])

# 3. Generate Embeddings for Training and Validation Sets
def get_embeddings(images, batch_size=32, desc=""):
    embeddings = []
    with torch.no_grad():
        for i in tqdm(range(0, len(images), batch_size), desc=desc):
            batch = images[i:i+batch_size]
            inputs = processor_subtask4(images=batch, return_tensors="pt").to(device_subtask4)
            features = siglip_model_subtask4.get_image_features(**inputs)
            embeddings.append(features.cpu())
    return torch.cat(embeddings)

train_embeddings_subtask4 = get_embeddings(train_images_subtask4, desc="Train Embeddings (Subtask 4)")
val_embeddings_subtask4 = get_embeddings(val_images_subtask4, desc="Validation Embeddings (Subtask 4)")
train_labels_subtask4 = torch.LongTensor(train_labels_subtask4)
val_labels_subtask4 = torch.LongTensor(val_labels_subtask4)
train_loader_subtask4 = DataLoader(TensorDataset(train_embeddings_subtask4, train_labels_subtask4), batch_size=16, shuffle=True)
val_loader_subtask4 = DataLoader(TensorDataset(val_embeddings_subtask4, val_labels_subtask4), batch_size=32)

# 4. Define and Train the Linear Probe
embedding_dim_subtask4 = train_embeddings_subtask4.shape[1]
linear_probe_subtask4 = nn.Linear(embedding_dim_subtask4, num_classes_subtask4).to(device_subtask4)
learning_rate_subtask4, epochs_subtask4 = 0.01, 100
criterion_subtask4 = nn.CrossEntropyLoss()
optimizer_subtask4 = optim.Adam(linear_probe_subtask4.parameters(), lr=learning_rate_subtask4)

print("Starting training...")
for epoch in tqdm(range(epochs_subtask4), desc="Training Epochs (Subtask 4)"):
    linear_probe_subtask4.train()
    running_loss = 0.0
    for embeddings, labels in train_loader_subtask4:
        embeddings, labels = embeddings.to(device_subtask4), labels.to(device_subtask4)
        optimizer_subtask4.zero_grad()
        outputs = linear_probe_subtask4(embeddings)
        loss = criterion_subtask4(outputs, labels)
        loss.backward()
        optimizer_subtask4.step()
        running_loss += loss.item()
    avg_train_loss = running_loss / len(train_loader_subtask4)
    writer_subtask4.add_scalar('Loss/train', avg_train_loss, epoch)

# 5. Evaluate the Model
linear_probe_subtask4.eval()
correct_subtask4, total_subtask4 = 0, 0
with torch.no_grad():
    for embeddings, labels in val_loader_subtask4:
        embeddings, labels = embeddings.to(device_subtask4), labels.to(device_subtask4)
        outputs = linear_probe_subtask4(embeddings)
        _, predicted = torch.max(outputs.data, 1)
        total_subtask4 += labels.size(0)
        correct_subtask4 += (predicted == labels).sum().item()
accuracy_subtask4 = 100 * correct_subtask4 / total_subtask4
print(f'Accuracy on the validation set: {accuracy_subtask4:.2f}%')
writer_subtask4.close()
print("-" * 20, "\n")

# --- Questions/Analysis for Subtask 4 ---
# Q1: How does the linear probing accuracy compare to the zero-shot accuracy from Subtask 2?
# A1: The accuracy from linear probing will likely be lower. This is because we are training on an extremely small dataset (only 101 images), which can lead to overfitting. The model learns the specific training examples well but fails to generalize to the unseen validation set.
#
# Q2: Would data augmentation improve accuracy? Why?
# A2: Yes, data augmentation would almost certainly improve accuracy. Augmentation (like random flips, rotations, color jitter) creates new, slightly different training examples. This effectively increases the size and diversity of our tiny training set, helping the model learn more robust features and reducing overfitting.


# --- SUBTASK 5: Classification with Gemini API ---
# This task uses a multimodal LLM to classify the images.

print("--- Running Subtask 5: Classification with Gemini API ---")
# 1. Configure API
# NOTE: Replace "YOUR_GEMINI_API_KEY" with your actual key to run this.
API_KEY_subtask5 = "YOUR_GEMINI_API_KEY" 
if API_KEY_subtask5 == "YOUR_GEMINI_API_KEY":
    print("Skipping Subtask 5: Please provide a valid Gemini API Key.")
else:
    genai.configure(api_key=API_KEY_subtask5)

    # 2. Prepare Dataset and Labels
    print("Loading dataset for Gemini...")
    full_dataset_subtask5 = load_dataset("ethz/food101", split="validation")
    labels_subtask5 = full_dataset_subtask5.features['label'].names
    counts_subtask5 = {i: 0 for i in range(101)}
    subset_subtask5 = full_dataset_subtask5.filter(
        lambda example: counts_subtask5[example['label']] < 10 and (counts_subtask5.update({example['label']: counts_subtask5[example['label']] + 1}) or True)
    )
    print(f"Using subset with {len(subset_subtask5)} images.")

    # 3. Define the Asynchronous Classification Function
    async def classify_image_async(image: Image.Image, true_label_name: str, model, semaphore):
        async with semaphore:
            try:
                img_byte_arr = io.BytesIO()
                image.save(img_byte_arr, format='JPEG')
                prompt = f"""Analyze the attached image of food. What is the single most likely food category for this image? Choose your answer ONLY from the following list: {', '.join(labels_subtask5)}. Respond with a JSON object containing a single key "food_item" with the category name as the value. Example: {{"food_item": "pizza"}}"""
                response = await model.generate_content_async([prompt, {'mime_type': 'image/jpeg', 'data': img_byte_arr.getvalue()}])
                cleaned_text = response.text.strip().replace("```json", "").replace("```", "")
                predicted_label = json.loads(cleaned_text).get("food_item", "unknown")
                return predicted_label == true_label_name
            except Exception as e:
                print(f"An error occurred: {e}")
                return False

    # 4. Run Asynchronous Classification
    async def main_subtask5():
        sample_size = 50 # Use a smaller subset to avoid excessive API calls/cost
        test_sample = subset_subtask5.shuffle(seed=42).select(range(sample_size))
        model = genai.GenerativeModel('gemini-1.5-flash')
        semaphore = asyncio.Semaphore(10) # Limit to 10 concurrent requests
        tasks = [classify_image_async(item['image'].convert("RGB"), labels_subtask5[item['label']], model, semaphore) for item in test_sample]
        results = await aio_tqdm.gather(*tasks, desc="Classifying with Gemini (Subtask 5)")
        correct_predictions = sum(results)
        total_predictions = len(results)
        accuracy = (correct_predictions / total_predictions) * 100 if total_predictions > 0 else 0
        print(f"\n--- Gemini Classification Results ---")
        print(f"Evaluated {total_predictions} images.")
        print(f"Correct Predictions: {correct_predictions}")
        print(f"Accuracy: {accuracy:.2f}%")

    # Run the async main function
    try:
        loop = asyncio.get_running_loop()
    except RuntimeError:
        loop = asyncio.new_event_loop()
        asyncio.set_event_loop(loop)
    loop.run_until_complete(main_subtask5())

# --- Questions/Analysis for Subtask 5 ---
# Q1: How does the Gemini accuracy compare to the other methods?
# A1: The accuracy of Gemini will likely be very high, potentially exceeding both the zero-shot SigLIP and the linear probe results. State-of-the-art multimodal LLMs have powerful visual understanding and reasoning capabilities, allowing them to perform well even with complex classification tasks and strict output formatting.
#
# Q2: What is the benefit of using asynchronous I/O for this task?
# A2: Instead of sending one API request and waiting for the response before sending the next (synchronous), `asyncio` allows us to send multiple requests concurrently. While one request is waiting for the remote server to process and respond (I/O-bound work), our program can send other requests. This dramatically reduces the total time required to process all images, as many requests are "in-flight" at the same time. The `Semaphore` is crucial for controlling the level of concurrency to stay within the API's rate limits.

Of course. Here is the complete code for Task 3, divided into blocks corresponding to each subtask.

### 小题1: 核心算子实现 (Core Operator Implementation)

This block contains the implementation for the `linear_layer`, `relu`, and `flatten` functions as required by the first subtask.

````python
import numpy as np

def linear_layer(x, w, b):
    """
    实现一个全连接层的前向传播。
    公式为: y = xW^T + b
    x: (N, in_features)
    w: (out_features, in_features)
    b: (out_features,)
    返回: (N, out_features)
    """
    return np.dot(x, w.T) + b

def relu(x):
    """
    对输入张量 x 执行元素级的 ReLU (Rectified Linear Unit) 操作。
    公式为: f(x) = max(0, x)
    """
    # ===== 在此实现 =====
    return np.maximum(0, x)

def flatten(x):
    """
    将一个四维张量 (N, C, H, W) 展平为一个二维张量 (N, C*H*W)。
    N 是批量大小，需要保持不变。
    """
    # ===== 在此实现 =====
    N = x.shape[0]
    return x.reshape(N, -1)
````

### 小题1: 对比实验 (Comparative Experiment)

This block contains the code to perform the comparative experiment between a purely linear network and one with a ReLU activation function.

````python
# 1. 创建输入数据
x = np.array([[-2], [-1], [0], [1], [2]], dtype=np.float32)

# 2. 设定网络权重
w1, b1 = np.array([[2]], dtype=np.float32), np.array([-1], dtype=np.float32)
w2, b2 = np.array([[-1]], dtype=np.float32), np.array([0.5], dtype=np.float32)

# 3. 模拟“纯线性网络 A”
print("--- 纯线性网络 A ---")
# 第一层
out_a1 = linear_layer(x, w1, b1)
# 第二层
out_a2 = linear_layer(out_a1, w2, b2)
print("网络 A 的最终输出:\n", out_a2)


# 4. 模拟“引入非线性的网络 B”
print("\n--- 引入非线性的网络 B ---")
# 第一层
out_b1 = linear_layer(x, w1, b1)
print("网络 B ReLU前:\n", out_b1)
# ReLU 激活
out_b1_relu = relu(out_b1)
print("网络 B ReLU后:\n", out_b1_relu)
# 第二层
out_b2 = linear_layer(out_b1_relu, w2, b2)
print("网络 B 的最终输出:\n", out_b2)
````

### 小题2: 2D卷积层实现与实验 (2D Convolution Implementation & Experiment)

This block provides the implementation for the `conv2d` function and the guided experiment to demonstrate feature detection and translation invariance.

````python
def conv2d(x, w, b, stride=1, padding=0):
    """
    使用循环实现一个朴素的 2D 卷积操作。
    x: (N, C_in, H, W)
    w: (C_out, C_in, kH, kW)
    b: (C_out,)
    """
    # ===== 在此实现 =====
    N, C_in, H, W = x.shape
    C_out, _, kH, kW = w.shape

    # 计算输出尺寸
    H_out = (H + 2 * padding - kH) // stride + 1
    W_out = (W + 2 * padding - kW) // stride + 1

    # 添加 Padding
    if padding > 0:
        x_padded = np.pad(x, ((0, 0), (0, 0), (padding, padding), (padding, padding)), mode='constant')
    else:
        x_padded = x

    # 初始化输出
    out = np.zeros((N, C_out, H_out, W_out))

    # 执行卷积
    for n in range(N):  # 遍历批量
        for c_out in range(C_out):  # 遍历输出通道
            for h_out in range(H_out):  # 遍历输出高度
                for w_out in range(W_out):  # 遍历输出宽度
                    h_start = h_out * stride
                    w_start = w_out * stride
                    
                    # 提取当前窗口
                    window = x_padded[n, :, h_start:h_start + kH, w_start:w_start + kW]
                    
                    # 计算卷积 (元素乘积后求和)
                    conv_sum = np.sum(window * w[c_out, :, :, :])
                    
                    # 添加偏置
                    out[n, c_out, h_out, w_out] = conv_sum + b[c_out]
    return out

if __name__ == "__main__":
    # --- 阶段一: 验证特征检测 ---
    print("--- 阶段一: 特征检测 ---")
    # 1. 定义一个 5x5 的图像，中心有一个“十字”图案
    image_centered = np.array([[
        [0, 0, 0, 0, 0],
        [0, 0, 1, 0, 0],
        [0, 1, 1, 1, 0],
        [0, 0, 1, 0, 0],
        [0, 0, 0, 0, 0]
    ]], dtype=np.float32).reshape(1, 1, 5, 5)

    # 2. 设计一个 3x3 的“十字”卷积核
    kernel_cross = np.array([[
        [0, 1, 0],
        [1, 1, 1],
        [0, 1, 0]
    ]], dtype=np.float32).reshape(1, 1, 3, 3)
    bias_zero = np.array([0], dtype=np.float32)

    # 3. 执行卷积，观察输出
    output_centered = conv2d(image_centered, kernel_cross, bias_zero, stride=1, padding=0)
    print("中心图案的卷积输出:\n", output_centered[0, 0])

    # --- 阶段二: 平移不变性 ---
    print("\n--- 阶段二: 平移不变性 ---")
    # 1. 创建一个新图像，将“十字”图案向右下方平移一格
    image_shifted = np.array([[
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 1, 0],
        [0, 0, 1, 1, 1],
        [0, 0, 0, 1, 0]
    ]], dtype=np.float32).reshape(1, 1, 5, 5)
    
    # 2. 使用完全相同的卷积核进行卷积，并观察输出
    output_shifted = conv2d(image_shifted, kernel_cross, bias_zero, stride=1, padding=0)
    print("平移图案的卷积输出:\n", output_shifted[0, 0])
````

### 小题3: 最大池化层实现与实验 (Max Pooling Implementation & Experiment)

This block contains the implementation for the `max_pool2d` function and the experiment demonstrating its effect on feature maps.

````python
def max_pool2d(x, kernel_size=2, stride=2):
    """
    实现一个朴素的 2D 最大池化操作。
    x: (N, C, H, W)
    """
    # ===== 在此实现 =====
    N, C, H, W = x.shape
    
    # 计算输出尺寸
    H_out = (H - kernel_size) // stride + 1
    W_out = (W - kernel_size) // stride + 1
    
    # 初始化输出
    out = np.zeros((N, C, H_out, W_out))
    
    # 执行池化
    for n in range(N):
        for c in range(C):
            for h_out in range(H_out):
                for w_out in range(W_out):
                    h_start = h_out * stride
                    w_start = w_out * stride
                    
                    # 提取当前窗口
                    window = x[n, c, h_start:h_start + kernel_size, w_start:w_start + kernel_size]
                    
                    # 取窗口内的最大值
                    out[n, c, h_out, w_out] = np.max(window)
    return out

# --- 实验 ---
print("--- 池化层实验 ---")
# 1. 构造数据
# 特征图1：2x2激活模式在左上角
feature_map1 = np.array([[
    [9, 9, 1, 1],
    [9, 9, 1, 1],
    [1, 1, 1, 1],
    [1, 1, 1, 1]
]], dtype=np.float32).reshape(1, 1, 4, 4)

# 特征图2：同样的激活模式向右平移1个像素
feature_map2 = np.array([[
    [1, 9, 9, 1],
    [1, 9, 9, 1],
    [1, 1, 1, 1],
    [1, 1, 1, 1]
]], dtype=np.float32).reshape(1, 1, 4, 4)

print("原始特征图1:\n", feature_map1[0, 0])
print("原始特征图2:\n", feature_map2[0, 0])

# 2. 运行实验
pool_out1 = max_pool2d(feature_map1, kernel_size=2, stride=2)
pool_out2 = max_pool2d(feature_map2, kernel_size=2, stride=2)

print("\n池化后特征图1:\n", pool_out1[0, 0])
print("池化后特征图2:\n", pool_out2[0, 0])
````

### 小题4: 完整CNN模型实现与测试 (Full CNN Model & Test)

This final block assembles all the previously defined functions into a complete `TinyCNN_for_MNIST` class and runs it on a sample from the MNIST dataset.

````python
// ...existing code...
import numpy as np
import gzip
import os
import struct
from array import array

# (在此之前应有已实现的 conv2d, relu, max_pool2d, flatten,linear_layer函数)
# 固定随机种子，保证权重初始化一致
np.random.seed(114514)

def softmax(logits):
    """
    实现Softmax函数。
    为防止数值溢出，先减去最大值。
    """
    # ===== 在此实现 =====
    exp_logits = np.exp(logits - np.max(logits, axis=-1, keepdims=True))
    return exp_logits / np.sum(exp_logits, axis=-1, keepdims=True)

# --- MNIST 数据集读取函数 ---
def read_images(filename):
    """
    读取MNIST图像文件
    参数:
      filename: MNIST图像文件路径
    返回:
      images: 图像数组列表
    """
    with gzip.open(filename, 'rb') as file:
        magic, size, rows, cols = struct.unpack(">IIII", file.read(16))
        if magic != 2051:
            raise ValueError('Magic number mismatch, expected 2051, got {}'.format(magic))
        
        image_data = array("B", file.read())
        
    images = []
    for i in range(size):
        img = np.array(image_data[i * rows * cols:(i + 1) * rows * cols])
        img = img.reshape(rows, cols)
        images.append(img)
    
    return images

# =========================================================
# ===== 任务：请根据下面的规约，在此处实现 TinyCNN_for_MNIST 类 =====
# =========================================================
#
# --- 模型架构规约 ---
# 1. 构造函数 `__init__(self)`:
# 架构固定：Conv(1->4, k=3, stride=1, pad=1) -> ReLU -> MaxPool(2x2, s=2) -> Flatten -> Linear(->10 类)
# 2. 前向传播方法 `forward(self, x)`:
#    - 接收一个形状为 (N, 1, 28, 28) 的张量 x。
#    - 按照以下顺序依次调用你实现的算子：
#      Conv2d -> ReLU -> MaxPool2d -> Flatten -> Linear -> Softmax
#    - 返回最终的 logits (Linear层输出) 和 probs (Softmax层输出)。

class TinyCNN_for_MNIST:
    # ===== 在此实现你的类 =====
    def __init__(self):
        # Conv(1->4, k=3, stride=1, pad=1)
        # 输入 (N, 1, 28, 28), padding=1 -> (N, 1, 30, 30)
        # 卷积 (k=3, s=1) -> (N, 4, 28, 28)
        self.conv_w = np.random.randn(4, 1, 3, 3) * 0.1
        self.conv_b = np.zeros(4)
        
        # MaxPool(2x2, s=2)
        # 输入 (N, 4, 28, 28) -> (N, 4, 14, 14)
        
        # Flatten
        # 输入 (N, 4, 14, 14) -> (N, 4*14*14) = (N, 784)
        
        # Linear(784 -> 10)
        self.fc_w = np.random.randn(10, 784) * 0.1
        self.fc_b = np.zeros(10)

    def forward(self, x):
        # Conv2d -> ReLU
        conv_out = conv2d(x, self.conv_w, self.conv_b, stride=1, padding=1)
        relu_out = relu(conv_out)
        
        # MaxPool2d
        pool_out = max_pool2d(relu_out, kernel_size=2, stride=2)
        
        # Flatten
        flat_out = flatten(pool_out)
        
        # Linear
        logits = linear_layer(flat_out, self.fc_w, self.fc_b)
        
        # Softmax
        probs = softmax(logits)
        
        return logits, probs

# --- 测试脚本 ---
if __name__ == "__main__":
    # 1. 设置 MNIST 测试集文件路径
    # !! 请将此路径修改为你自己的文件路径
    # 假设 't10k-images-idx3-ubyte.gz' 在当前目录下
    mnist_test_file = './t10k-images-idx3-ubyte.gz'

    if not os.path.exists(mnist_test_file):
        print(f"错误：找不到 MNIST 测试集文件 '{mnist_test_file}'")
        print("请从 http://yann.lecun.com/exdb/mnist/ 下载 t10k-images-idx3-ubyte.gz 并放到此脚本同级目录")
    else:
        # 2. 加载所有测试图像
        test_images = read_images(mnist_test_file)
        # 3. 选取第一张图像作为测试输入
        first_test_image = test_images[0]
        # 4. 预处理图像
        # 归一化到 [-1, 1]
        input_tensor = (first_test_image.astype(np.float32) / 255.0 - 0.5) * 2.0
        # 增加 batch 和 channel 维度 -> (1, 1, 28, 28)
        input_tensor = np.expand_dims(input_tensor, axis=(0, 1))
        
        # 5. 实例化模型并执行前向传播
        model = TinyCNN_for_MNIST()
        logits, probs = model.forward(input_tensor)

        print("Input Tensor Shape:", input_tensor.shape)
        print("Logits shape:", logits.shape, "Probs shape:", probs.shape)
        np.set_printoptions(precision=8, suppress=True)
        print("\nLogits:", logits[0])
        print("Probs:", probs[0])
        print("\nPredicted class:", np.argmax(probs))
        print("\nChecksum logits sum:", float(np.sum(logits)))
        print("Checksum probs sum:", float(np.sum(probs)))
// ...existing code...
````

Similar code found with 1 license type

### Subtask 1: Concept and Formula Explanation

#### 1. Word Embedding

*   **Definition and Purpose:**
    Word Embedding is a technique in Natural Language Processing (NLP) where words or phrases from a vocabulary are mapped to vectors of real numbers. Its primary purpose is to capture the semantic meaning, context, and syntactic relationships between words in a dense, low-dimensional space. Instead of treating words as isolated atomic symbols, embeddings place words with similar meanings closer to each other in the vector space.

*   **Solving Limitations of Traditional Methods:**
    Traditional methods like **One-Hot Encoding** represent each word as a sparse vector with a single '1' and the rest '0's. This approach has two major limitations:
    1.  **Curse of Dimensionality**: For a large vocabulary (e.g., 50,000 words), each vector is 50,000-dimensional, which is computationally inefficient and requires vast amounts of data to learn from.
    2.  **Lack of Semantic Relationship**: One-hot vectors are orthogonal to each other (their dot product is zero). This means the representation for "king" is no more similar to "queen" than it is to "apple," failing to capture any underlying semantic relationships.
    Word embeddings solve this by representing words in a dense, continuous vector space where semantic similarity corresponds to vector proximity.

*   **Example of a Common Model:**
    A common and foundational word embedding model is **Word2Vec**.
    *   **Characteristics**: Word2Vec is not a single algorithm but a family of models (Skip-gram and CBOW) that learn embeddings from large text corpora.
        *   **Skip-gram**: Predicts the context words (surrounding words) given a target word. It works well for large datasets and is good at capturing meanings for rare words.
        *   **CBOW (Continuous Bag-of-Words)**: Predicts the target word based on its context words. It is faster to train and slightly better for frequent words.
    *   A key feature of Word2Vec is its ability to capture complex analogies, famously demonstrated by the vector arithmetic `vector('king') - vector('man') + vector('woman') ≈ vector('queen')`.

#### 2. Multi-Head Self-Attention

*   **Core Idea:**
    The core idea of Multi-Head Self-Attention is to allow a model to jointly attend to information from different representation subspaces at different positions. Instead of performing a single attention function, it projects the queries, keys, and values `h` (number of heads) times with different, learned linear projections. Attention is then performed in parallel on each of these projected versions. The results are concatenated and once again projected, resulting in a final value. This allows each head to specialize and learn different aspects of relationships (e.g., one head might focus on syntactic relationships while another focuses on semantic ones), enriching the model's ability to capture complex dependencies.

*   **Scaled Dot-Product Attention Formula:**
    The formula for Scaled Dot-Product Attention is:
    $$
    \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
    $$
    *   `Q` (**Query**): A matrix representing a set of queries. In self-attention, this is a projection of the input sequence. Each query vector represents a word asking for attention from all other words.
    *   `K` (**Key**): A matrix representing a set of keys. This is another projection of the input sequence. Each key vector can be thought of as a "label" for a word, which is matched against the queries.
    *   `V` (**Value**): A matrix representing a set of values. This is a third projection of the input sequence. Each value vector contains the actual information of a word that should be passed on.
    *   `d_k`: The dimension of the key vectors (and query vectors). The scaling factor `sqrt(d_k)` is crucial. For large values of `d_k`, the dot products can grow very large in magnitude, pushing the softmax function into regions where it has extremely small gradients. Scaling counteracts this effect, leading to more stable training.

In [None]:

## 2
import numpy as np

# Set the random seed for reproducibility
np.random.seed(114514)

def softmax(x):
    """Numerically stable softmax for a matrix."""
    # Subtract max for numerical stability
    e_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return e_x / np.sum(e_x, axis=-1, keepdims=True)

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Calculate scaled dot-product attention.
    
    Args:
        Q (np.array): Queries, shape (..., seq_len_q, d_k)
        K (np.array): Keys, shape (..., seq_len_k, d_k)
        V (np.array): Values, shape (..., seq_len_v, d_v)
        mask (np.array, optional): Mask to apply. Defaults to None.
        
    Returns:
        output (np.array): Weighted sum of values.
        attention_weights (np.array): Attention weights.
    """
    # MatMul Q and K.T
    matmul_qk = np.matmul(Q, K.swapaxes(-2, -1))
    
    # Scale
    d_k = K.shape[-1]
    scaled_attention_logits = matmul_qk / np.sqrt(d_k)
    
    # Apply mask (if provided)
    if mask is not None:
        scaled_attention_logits += (mask * -1e9)
        
    # Softmax to get attention weights
    attention_weights = softmax(scaled_attention_logits)
    
    # MatMul with V
    output = np.matmul(attention_weights, V)
    
    return output, attention_weights

def multi_head_attention(embed_size, num_heads, input_seq, mask=None):
    """
    Calculate multi-head attention.
    
    Args:
        embed_size (int): The embedding dimension.
        num_heads (int): The number of attention heads.
        input_seq (np.array): The input sequence, shape (batch_size, seq_len, embed_size).
        mask (np.array, optional): Mask to apply. Defaults to None.
        
    Returns:
        output (np.array): Final attention vector.
        weights (np.array): Attention weights from one head for validation.
    """
    batch_size, seq_len, _ = input_seq.shape
    assert embed_size % num_heads == 0, "embed_size must be divisible by num_heads"
    
    head_dim = embed_size // num_heads
    
    # Initialize weight matrices
    Wq = np.random.randn(embed_size, embed_size)
    Wk = np.random.randn(embed_size, embed_size)
    Wv = np.random.randn(embed_size, embed_size)
    Wo = np.random.randn(embed_size, embed_size)
    
    # 1. Linear projections
    Q = np.matmul(input_seq, Wq)
    K = np.matmul(input_seq, Wk)
    V = np.matmul(input_seq, Wv)
    
    # 2. Reshape and transpose for multi-head
    # (batch_size, seq_len, embed_size) -> (batch_size, seq_len, num_heads, head_dim) -> (batch_size, num_heads, seq_len, head_dim)
    Q = Q.reshape(batch_size, seq_len, num_heads, head_dim).swapaxes(1, 2)
    K = K.reshape(batch_size, seq_len, num_heads, head_dim).swapaxes(1, 2)
    V = V.reshape(batch_size, seq_len, num_heads, head_dim).swapaxes(1, 2)
    
    # 3. Apply scaled dot-product attention
    attention_output, attention_weights = scaled_dot_product_attention(Q, K, V, mask)
    
    # 4. Concatenate heads
    # (batch_size, num_heads, seq_len, head_dim) -> (batch_size, seq_len, num_heads, head_dim) -> (batch_size, seq_len, embed_size)
    attention_output = attention_output.swapaxes(1, 2).reshape(batch_size, seq_len, embed_size)
    
    # 5. Final linear projection
    output = np.matmul(attention_output, Wo)
    
    return output, attention_weights

if __name__ == "__main__":
    batch_size = 10
    seq_len = 20
    embed_size = 128
    num_heads = 8
    
    input_data = np.random.randn(batch_size, seq_len, embed_size) 
    output, weights = multi_head_attention(embed_size, num_heads, input_data)
    
    print("Output shape:", output.shape)
    print("Attention weights shape:", weights.shape)
    
    print("\nSample output vector (first 10 values of [0, 0]):\n", output[0, 0, :10])
    # The weights shape is (batch_size, num_heads, seq_len, seq_len)
    # We print the first head's attention for the first word in the first batch item
    print("\nSample attention weights (first 10 values of [0, 0, 0]):\n", weights[0, 0, 0, :10])
    
    
    
    
    
## 3
import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class MultiHeadAttention(nn.Module):
    """Based on PyTorch's multi-head self-attention implementation"""
    def __init__(self, embed_size, num_heads, dropout=0.1):
        super().__init__()
        assert embed_size % num_heads == 0, "embed_size must be divisible by num_heads"
        
        self.embed_size = embed_size
        self.num_heads = num_heads
        self.head_dim = embed_size // num_heads
        
        # Linear layers for Q, K, V projections
        self.wq = nn.Linear(embed_size, embed_size)
        self.wk = nn.Linear(embed_size, embed_size)
        self.wv = nn.Linear(embed_size, embed_size)
        
        # Final output linear layer
        self.fc_out = nn.Linear(embed_size, embed_size)
        
        self.dropout = nn.Dropout(dropout)

    def scaled_dot_product_attention(self, Q, K, V, mask=None):
        # Q, K, V shape: (batch_size, num_heads, seq_len, head_dim)
        d_k = K.size(-1)
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
        
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
            
        attention_weights = F.softmax(scores, dim=-1)
        attention_weights = self.dropout(attention_weights)
        
        output = torch.matmul(attention_weights, V)
        return output, attention_weights
        
    def forward(self, query, key, value, mask=None):
        batch_size = query.shape[0]
        
        # Apply linear layers
        Q = self.wq(query)
        K = self.wk(key)
        V = self.wv(value)
        
        # Reshape for multi-head attention
        # (batch_size, seq_len, embed_size) -> (batch_size, num_heads, seq_len, head_dim)
        Q = Q.view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)
        K = K.view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)
        V = V.view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)
        
        # Calculate attention
        attention_output, attention_weights = self.scaled_dot_product_attention(Q, K, V, mask)
        
        # Concatenate heads
        # (batch_size, num_heads, seq_len, head_dim) -> (batch_size, seq_len, embed_size)
        attention_output = attention_output.transpose(1, 2).contiguous().view(batch_size, -1, self.embed_size)
        
        # Final linear layer
        output = self.fc_out(attention_output)
        
        return output, attention_weights

# Test code
if __name__ == "__main__":
    # Set seed for reproducibility
    torch.manual_seed(114514)
    
    # Construct test input (same shape as task 2)
    batch_size = 10
    seq_len = 20
    embed_size = 128
    num_heads = 8
    
    input_tensor = torch.randn(batch_size, seq_len, embed_size)
    model = MultiHeadAttention(embed_size, num_heads)

    # Perform self-attention (query=key=value)
    output, attn_weights = model(input_tensor, input_tensor, input_tensor)

    print("Output shape:", output.shape)
    print("Attention weights shape:", attn_weights.shape)
    
    # Detach from graph for printing
    output_detached = output.detach()
    attn_weights_detached = attn_weights.detach()
    
    print("\nSample output vector (first 10 values of [0, 0]):\n", output_detached[0, 0, :10])
    print("\nSample attention weights (first 10 values of [0, 0, 0]):\n", attn_weights_detached[0, 0, 0, :10])

### Subtask 4: Attention in Vision

#### Image Serialization in ViT

The Vision Transformer (ViT) converts a 2D image into a 1D sequence of vectors through a simple yet effective process:
1.  **Image Patching**: The input image (e.g., 224x224 pixels) is first split into a grid of smaller, non-overlapping square patches (e.g., 16x16 pixels). This results in a sequence of `(224/16) * (224/16) = 14 * 14 = 196` patches.
2.  **Flatten and Project**: Each of these 2D patches is then flattened into a 1D vector (e.g., `16 * 16 * 3` channels = 768 dimensions). This raw vector is then passed through a trainable linear projection (a simple fully connected layer) to produce a patch embedding of the desired model dimension (e.g., 768).
This procedure effectively transforms the 2D image into a 1D sequence of token embeddings, analogous to a sequence of word embeddings in NLP, making it suitable for a standard Transformer encoder.

#### Encoding Spatial Position Information

Standard Transformers are permutation-invariant, meaning they treat the input as an unordered set. This is problematic for images where the spatial arrangement of patches is critical. ViT addresses this by explicitly adding spatial information:
*   **Learnable Positional Embeddings**: Before the patch embeddings are fed into the Transformer encoder, a set of **learnable 1D positional embeddings** are added to them. There is one unique positional embedding vector for each possible patch position in the sequence. These positional embeddings are initialized randomly and are learned jointly with the rest of the model during training. By adding these vectors, the model can learn to interpret the relative and absolute positions of the patches in the original image.

#### Core Advantage Over CNNs

The most significant advantage of ViT's self-attention mechanism over a traditional CNN is its ability to model **long-range, global dependencies** from the very first layer.
*   **CNNs (Local Receptive Fields)**: A convolutional kernel operates on a small, local neighborhood of pixels. To capture global context, a CNN must stack many layers, gradually increasing the receptive field size. Information from distant parts of the image can only be integrated in the deeper layers of the network.
*   **ViT (Global Receptive Fields)**: In self-attention, every patch embedding (query) directly interacts with every other patch embedding (keys/values) in the sequence. This means that from the very first layer, the model can weigh and integrate information from across the entire image to compute the representation for a single patch. This global receptive field allows ViT to capture relationships between distant features more effectively, which is particularly powerful for tasks requiring an understanding of the overall image structure, especially when trained on very large datasets.