# Vision Transformer for Classification Example
## Chapter 3 Module 3

Now that we are aquainted with Transformers and FiftyOne, we can make our first go at using a Transformer on a dataset. We are going to jump right in with a sample of [Imagenet](https://www.image-net.org/) and the [ViT](https://huggingface.co/google/vit-base-patch16-224).

## Loading a Dataset

In [None]:
import fiftyone as fo
import fiftyone.zoo as foz 

dataset = foz.load_zoo_dataset(
    "imagenet-sample",
    dataset_name="Imagenet-Sample",
    max_samples=50,
    shuffle=True,
    overwrite=True,
)
session = fo.launch_app(dataset)

## Running Inference with the Help of FiftyOne

You can remember from last chapter ...

In [None]:
from transformers import ViTForImageClassification

model = ViTForImageClassification.from_pretrained(
    "google/vit-base-patch16-224"
)
dataset.apply_model(model, label_field="ViT_predictions")



session.show()

## Running Inference Manually

We will show how to leverage FiftyOne dataset to help store all of our model predictions and allow us to visualize our results.

But this time we will do it ourselves!

Remember the pipeline almost always looks like:

1. Load your image
2. Preprocess your image
3. Inference on your image
4. Decode the prediction

Let's first start with just a single image from our dataset.

In [None]:
# Grab just the first image
sample = dataset.first()
print(sample)

# Get the image file path
filepath = sample.filepath
print(filepath)

In [None]:
from transformers import ViTForImageClassification
from transformers import ViTImageProcessor
from PIL import Image

# Load the matching processor
processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224")

# Load a pretrained ViT model
model = ViTForImageClassification.from_pretrained("google/vit-base-patch16-224")

# 1. Load our image for inference
image = Image.open(filepath)

# 2. Preprocess
inputs = processor(images=image, return_tensors="pt")

# 3. Run inference
outputs = model(**inputs)

# 4. Decode prediction
predicted_class_idx = outputs.logits.argmax(-1).item()
predicted_label = model.config.id2label[predicted_class_idx]

print(f"Predicted class: {predicted_label}")

Now that we have that label, let's add it to our FiftyOne dataset!

In [None]:
sample["Manual_ViT_Predictions"] = fo.Classification(label=predicted_label)
sample.save()
print(sample.field_names)

Now that it is saved to our dataset, we can view on our app to compare it to the ground truths!

In [None]:
session.show()

Now lets predict an entire dataset!

In [None]:
for sample in dataset:
    # 1. Load our image for inference
    image = Image.open(sample.filepath)

    # 2. Preprocess
    inputs = processor(images=image, return_tensors="pt")

    # 3. Run inference
    outputs = model(**inputs)

    # 4. Decode prediction
    predicted_class_idx = outputs.logits.argmax(-1).item()
    predicted_label = model.config.id2label[predicted_class_idx]

    # 5. Save the prediction to the sample
    sample["Manual_ViT_Predictions"] = fo.Classification(label=predicted_label)
    sample.save()

Ta da! We have now inferenced over an entire dataset! We can load images and inference in many ways:

In [None]:
import cv2
import requests
from io import BytesIO

# Load from a file
image = Image.open("path/to/your/image.jpg").convert("RGB")

# Load from a URL
url = "https://upload.wikimedia.org/wikipedia/commons/3/3a/Cat03.jpg"
response = requests.get(url)
image = Image.open(BytesIO(response.content)).convert("RGB")

# Open the default webcam (camera 0)
cap = cv2.VideoCapture(0)

ret, frame = cap.read()
cap.release()

if ret:
    # OpenCV loads in BGR — convert to RGB
    frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    image = Image.fromarray(frame_rgb)

    # Show the captured image
    image.show()
else:
    print("Failed to capture image from camera.")