# Task VIII QML-HEP :  Vision transformer/Quantum Vision Transformer
<br>

Implement a classical Vision transformer and apply it to MNIST. Show its performance on the test data. Comment on potential ideas to extend this classical vision transformer architecture to a quantum vision transformer and sketch out the architecture in detail.



In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from transformers import ViTModel, ViTConfig

In [None]:
# Load MNIST dataset 
transform = transforms.Compose([
    transforms.Resize((32, 32)),  
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])

train_dataset = torchvision.datasets.MNIST(root="./data", train=True, transform=transform, download=True)
test_dataset = torchvision.datasets.MNIST(root="./data", train=False, transform=transform, download=True)

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=32, shuffle=False)


In [None]:
# Define Vision Transformer model 
class ViTClassifier(nn.Module):
    def __init__(self, num_classes=10):
        super(ViTClassifier, self).__init__()
        self.vit = ViTModel(ViTConfig(image_size=32, num_labels=num_classes))  # Adjusted image size
        self.classifier = nn.Linear(self.vit.config.hidden_size, num_classes)
    
    def forward(self, x):
        outputs = self.vit(pixel_values=x).last_hidden_state[:, 0, :]
        return self.classifier(outputs)


In [None]:
# Initialize model, loss function, and optimizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = ViTClassifier().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-4)


In [3]:

# Training model
epochs = 5
for epoch in range(epochs):
    model.train()
    total_loss = 0
    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)
        images = images.repeat(1, 3, 1, 1)  # Convert grayscale to RGB
        
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    
    print(f"Epoch {epoch+1}, Loss: {total_loss/len(train_loader):.4f}")

# Evaluation
model.eval()
correct = 0
total = 0
with torch.no_grad():
    for images, labels in test_loader:
        images, labels = images.to(device), labels.to(device)
        images = images.repeat(1, 3, 1, 1)
        outputs = model(images)
        _, predicted = torch.max(outputs, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print(f"Test Accuracy: {100 * correct / total:.2f}%")



Epoch 1, Loss: 0.2887
Epoch 2, Loss: 0.1393
Epoch 3, Loss: 0.1214
Epoch 4, Loss: 0.1044
Epoch 5, Loss: 0.1002
Test Accuracy: 97.12%


**Comment on potential ideas to extend this classical vision transformer architecture to a quantum vision transformer and sketch out the architecture in detail.**
<br>
A quantum Vision Transformer (QViT) can surely provide more efficency and effectiveness using its quantum computing principles like superposition, entanglement, quantum parellelism and mechanism, etc. I have worked with transformers and they are super slow and exhausting but QViT is here for rescue!
<br>
QViT have benefits like **exponential speedup, efficent learning, better feature representation, etc, better generalization, memory efficiency**, etc.
<br>
<br>
**Sketch of QViT architecture:** <br>
* Quantum Data Encoding: Translating Images into Quantum States.  Quantum computers don’t process regular pixel values. Instead, they work with quantum states, so we must first convert our image into a form that a quantum processor can understand. Instead of storing pixel values as numbers, we encode them into quantum amplitudes or angles of qubits:
<br>Amplitude Encoding: Represents pixel values as quantum state probabilities (efficient but hard to implement).<br>
Angle Encoding: Maps pixel values to rotation angles of qubits (simpler and more hardware-friendly).
<br>
* Quantum Patch Embeddings: Converting Image Patches into Quantum Circuits. In classical ViTs, an image is divided into small square patches, which are then flattened and projected into a feature space. In QViT, we use quantum circuits instead. 
<br>
Benefit: Instead of manually learning filters like in CNNs, quantum operations naturally mix and entangle features, making feature extraction more efficient.
<br>
* Quantum Self-Attention Mechanism: Understanding Patch Relationships. Self-attention is the heart of transformers—it determines how important each patch is relative to others. Instead of traditional matrix multiplications, QViT uses quantum operations to compute attention more efficiently.
Use Parameterized Quantum Circuits (PQCs) to process information across all patches in parallel.
 Benefit: This enables faster and more memory-efficient attention mechanisms because quantum states store relationships intrinsically, rather than computing them explicitly.
<br>
* Hybrid Quantum-Classical MLP Head: Making the Final Decision . Since quantum computers are still limited in size, we use a hybrid approach where:  The early layers (feature extraction & attention) are quantum-based and The final classification is done with a classical neural network or a simple quantum classifier.
<br>
 Benefit: This combines the best of both worlds—quantum efficiency for feature extraction and classical robustness for decision-making.

**Why our used approach in this task is best?**
* Efficient Image Processing
* Fixing Grayscale Input for ViT
* Using a Vision Transformer (ViT) Backbone
* Optimized Training Process
* Fast and Good Accuracy of results!