## **Vision Transformers (ViT)**

Vision Transformers (ViT) apply transformer architectures, originally designed for NLP, to computer vision tasks. By dividing images into patches and processing them as sequences, ViTs achieve state-of-the-art performance in image classification, object detection, and segmentation tasks.

**Imports**

In [3]:
from transformers import ViTForImageClassification, ViTFeatureExtractor
from PIL import Image
import torch

**Load Pre-trained ViT Model and Feature Extractor**

In [None]:
model = ViTForImageClassification.from_pretrained("google/vit-base-patch16-224")
feature_extractor = ViTFeatureExtractor.from_pretrained("google/vit-base-patch16-224")


**Prepare Input Image**

In [None]:
image_path = "sample_image.jpg"  # Replace with your image path
image = Image.open(image_path).convert("RGB")
inputs = feature_extractor(images=image, return_tensors="pt")

**Perform Inference**

In [None]:
outputs = model(**inputs)
logits = outputs.logits
predicted_class = logits.argmax(-1).item()

**Print Predicted Class**

In [None]:
labels = model.config.id2label  # Map class indices to labels
print(f"Predicted Class: {labels[predicted_class]}")