# Workshop: Object Detection with Hugging Face, Ultralytics YOLOv8, and Lightning

In this workshop we will explore three approaches for object detection using PyTorch:

1. **Inference with Hugging Face DETR:** Using a pre-trained DETR model via a pipeline.
2. **Inference & Training with Ultralytics YOLOv8:** Running inference and training a YOLO model on your own dataset (requires a YOLO-formatted dataset and a `data.yaml` configuration file).
3. **Training Faster‑R‑CNN with PyTorch Lightning:** Wrapping a TorchVision Faster‑R‑CNN model in a LightningModule and training it on the PennFudanPed dataset, with data augmentation via TorchVision Transforms v2.

Follow along for hands-on experience!

## Installation

Install the required packages by running the cell below.

In [None]:
# !pip install -U transformers pillow matplotlib ultralytics timm lightning gdown

## Organize Imports

In [None]:
import os
import gc
import sys

In [None]:
import requests, zipfile, io
from pathlib import Path

In [None]:
import numpy as np

In [None]:
import matplotlib.pyplot as plt

In [None]:
import requests
import glob
from PIL import Image
from IPython.display import display

In [None]:
import torch
from torch.utils.data import DataLoader, random_split
import torch.optim as optim

In [None]:
import torchvision
# Use TorchVision Transforms v2 for data augmentation
from torchvision.transforms import v2 as T2
from torchvision.transforms import v2 as transforms
from torchvision.utils import draw_bounding_boxes
import torchvision.transforms.functional as F
from torchvision import models, datasets, ops, utils

In [None]:
from transformers import pipeline

In [None]:
import lightning as pl
import lightning as L

In [None]:
from ultralytics import YOLO

## Initialize Device

In [None]:
def init_device():
    # For the most part I'll try to import functions and classes near
    # where they are used
    # to make it clear where they come from.
    if torch.backends.mps.is_available():
        device = 'mps'
    else:
        device = 'cuda' if torch.cuda.is_available() else 'cpu'

    print(f'Device: {device}')

    return device

In [None]:
device = init_device()
device

## Initialize Paths

In [None]:
PATH = Path('data')
DATA = PATH
africanw = PATH / 'africanw' / 'african-wildlife.yaml'
pennfped = PATH / 'PennFudanPed' / 'PennFudanPed'
models_path = PATH / 'models'
models_path.mkdir(exist_ok=True, parents=True)

In [None]:
pennfped.parent

In [None]:
! ls {pennfped}

In [None]:
africanw

## Download Datasets

In [None]:
africanw.parent.mkdir(exist_ok=True, parents=True)

url = "https://raw.githubusercontent.com/ultralytics/ultralytics/refs/heads/main/ultralytics/cfg/datasets/african-wildlife.yaml"

if africanw.exists():
    print(f'File {africanw} exists')
else:
    response = requests.get(url)
    
    if response.status_code == 200:
        with africanw.open(mode="w") as f:
            f.write(response.text)
        print("File downloaded successfully as 'african-wildlife.yaml'!")
    else:
        print("Failed to download file. Status code:", response.status_code)

In [None]:
# URL for the PennFudanPed dataset zip file
url = "https://www.cis.upenn.edu/~jshi/ped_html/PennFudanPed.zip"

if pennfped.exists():
    print('Data folder exists')
else:
    print("Downloading PennFudanPed dataset...")
    r = requests.get(url)
    if r.status_code == 200:
        z = zipfile.ZipFile(io.BytesIO(r.content))
        z.extractall(pennfped.parent)
        print("Downloaded and extracted PennFudanPed dataset to './PennFudanPed'")
    else:
        print("Download failed with status code:", r.status_code)

## Part 1: Inference with Hugging Face DETR

In this section we load a pre-trained DETR model via Hugging Face’s pipeline and run inference on a sample image.

In [None]:
# Download a sample image
url = "https://ultralytics.com/images/bus.jpg"
image = Image.open(requests.get(url, stream=True).raw)

# Initialize the Hugging Face object detection pipeline (using DETR)
detr_detector = pipeline("object-detection", model="facebook/detr-resnet-50")

# Run inference
results = detr_detector(image)

print("DETR Inference Results:")
for r in results:
    print(r)

display(image)

## Part 2: Inference with Ultralytics YOLOv8

Next, we use Ultralytics YOLOv8 to run inference on the same sample image.

In [None]:
# Load the pre-trained YOLOv8 nano model
yolo_model = YOLO("yolov8n.pt")

# Run inference on the sample image
results_yolo = yolo_model("https://ultralytics.com/images/bus.jpg")

# Print YOLOv8 results
# print(results_yolo)

# Plot the image with predictions
plt.figure(figsize=(10, 10))
plt.imshow(results_yolo[0].plot())
plt.axis('off')
plt.title('Ultralytics YOLOv8 Inference')
plt.show()

## Part 3: Training with Ultralytics YOLOv8

To train a YOLO model using Ultralytics, you need a dataset in YOLO format along with a YAML configuration file (e.g., `data/my_dataset/data.yaml`).

For example, your `data.yaml` might look like:

```yaml
train: data/my_dataset/images/train
val: data/my_dataset/images/val
nc: 2
names: ['class1', 'class2']
```

Make sure that the file exists at the specified path. Then run the cell below to start training for 5 epochs.

In [None]:
# Initialize the YOLOv8 nano model with pre-trained weights
yolov8n = YOLO('yolov8n.pt')

In [None]:
yolov8n = yolov8n.to(device)

In [None]:
yolov8n.train(
    data=africanw, 
    epochs=5, 
    imgsz=640,
    device=device,
    workers=8
)

In [None]:
yolov8n.save(models_path / 'yolov8n_afrwld.pt')

In [None]:
yolov8n.export(format='onnx')

## Part 4: Training Faster‑R‑CNN with PyTorch Lightning and TorchVision Transforms v2

In this section we train a Faster‑R‑CNN model on the PennFudanPed dataset using PyTorch Lightning. We use a new data augmentation pipeline built with TorchVision Transforms v2. Make sure the PennFudanPed dataset is downloaded and extracted into a folder named `PennFudanPed`.

The data augmentation pipeline includes random resized cropping, horizontal flipping, and color jitter. These augmentations help improve model robustness.

#### Train Face DSetectors

In [None]:
gc.collect()

#### Initialize Face Dataset

In [None]:
class FacesData(L.LightningDataModule):
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
        transforms.Resize(size=(800,), max_size=1333),
    ])

    @staticmethod
    def convert_inputs(imgs, annot, device, small_thr=0.001):
        """Conver dataset item to accepted target struture."""
        images, targets = [], []
        for img, annot in zip(imgs, annot):
            bbox = annot['bbox']
            small = (bbox[:, 2] * bbox[:, 3]) <= (img.size[1] * img.size[0] * small_thr)
            boxes = ops.box_convert(bbox[~small], in_fmt='xywh', out_fmt='xyxy')
            output_dict = FacesData.transform({"image": img, "boxes": boxes})
            images.append(output_dict['image'].to(device))
            targets.append({
                'boxes': output_dict['boxes'].to(device),
                'labels': torch.ones(len(boxes), dtype=int, device=device)
            })
        return images, targets
    
    @staticmethod
    def _collate_fn(batch):
        """Define a collate function to handle batches."""
        return tuple(zip(*batch))

    def train_dataloader(self):# Step 4: Load the WIDERFace dataset using torchvision.datasets
        train_dataset = datasets.WIDERFace(root=DATA, split='train', download=True)

        # Step 5: Set up the DataLoader and train the model
        return DataLoader(
            train_dataset, batch_size=8, shuffle=True, num_workers=4, collate_fn=self._collate_fn
        )

    def val_dataloader(self):# Step 4: Load the WIDERFace dataset using torchvision.datasets
        val_dataset = datasets.WIDERFace(root=DATA, split='val', download=True)

        # Step 5: Set up the DataLoader and train the model
        return DataLoader(
            val_dataset, batch_size=8, shuffle=True, num_workers=4, collate_fn=self._collate_fn
        )

    def test_dataloader(self):# Step 4: Load the WIDERFace dataset using torchvision.datasets
        test_dataset = datasets.WIDERFace(root=DATA, split='val', download=True)

        # Step 5: Set up the DataLoader and train the model
        return DataLoader(
            test_dataset, batch_size=8, shuffle=True, num_workers=4, collate_fn=self._collate_fn
        )

#### Initialize the Model

In [None]:
# Use a pretrained Faster R-CNN model from torchvision and modify it
class FaceDetectionModel(L.LightningModule):
    def __init__(self):
        super(FaceDetectionModel, self).__init__()
        self.model = models.detection.fasterrcnn_mobilenet_v3_large_fpn(weights="DEFAULT")

    def forward(self, images, targets=None):
        if targets is None:
            return self.model(images)
        return self.model(images, targets)

    def training_step(self, batch, batch_idx):
        imgs, annot = batch
        images, targets = FacesData.convert_inputs(imgs, annot, device=self.device)
        loss_dict = self.model(images, targets)
        losses = sum(loss for loss in loss_dict.values())
        self.log('train_loss', losses)
        return losses

    def validation_step(self, batch, batch_idx):
        imgs, annot = batch
        images, targets = FacesData.convert_inputs(imgs, annot, device=self.device)
        loss_dict = self.model(images, targets)
        losses = sum(loss for loss in loss_dict.values())
        self.log('val_loss', losses)
        return losses

    def test_step(self, batch, batch_idx):
        imgs, annot = batch
        images, targets = FacesData.convert_inputs(imgs, annot, device=self.device)
        loss_dict = self.model(images, targets)
        losses = sum(loss for loss in loss_dict.values())
        self.log('test_loss', losses)
        return losses

    def configure_optimizers(self):
        return optim.SGD(self.parameters(), lr=0.001, momentum=0.9, weight_decay=0.0005)

#### Run Training

In [None]:
gc.collect()

In [None]:
data = FacesData()
model = FaceDetectionModel()
trainer = L.Trainer(
    max_epochs=5, 
    precision='16-mixed', 
    log_every_n_steps=10
)

In [None]:
trainer.fit(model, data)

In [None]:
gc.collect()

## Test the Model

In [None]:
plt.rcParams["savefig.bbox"] = "tight"
sample_idx = 0
print(f"selected image sample: {sample_idx}")

def show(imgs):
    if not isinstance(imgs, list):
        imgs = [imgs]
    fig, axs = plt.subplots(ncols=len(imgs), squeeze=False, figsize=(7 * len(imgs), 8))
    for i, img in enumerate(imgs):
        img = img.detach()
        img = F.to_pil_image(img)
        axs[0, i].imshow(np.asarray(img))
        axs[0, i].set(xticklabels=[], yticklabels=[], xticks=[], yticks=[])
    return fig

# Step 1: Define the transform
transform = transforms.Compose([transforms.ToTensor()])
# define the transform
normalize = transforms.Compose([
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    transforms.Resize(size=(800,), max_size=1333),
])

# Step 2: Load the WIDERFace dataset using torchvision.datasets
test_dataset = datasets.WIDERFace(root=DATA, split='test', download=True, transform=transform)
img, target = train_dataset[sample_idx]
img = F.convert_image_dtype(img, dtype=torch.uint8)
boxes = ops.box_convert(target['bbox'], in_fmt='xywh', out_fmt='xyxy')

# visualize the annotation
annot = utils.draw_bounding_boxes(img, boxes, colors="red", width=5)

# Replace with path to your trained checkpoint 'lightning_logs/version_x/checkpoints/...'
checkpoint_path = glob.glob("lightning_logs/version_6/checkpoints/*.ckpt")[0]
print(f"loading model from checkpoint '{checkpoint_path}'")

# Load the model
model = FaceDetectionModel.load_from_checkpoint(checkpoint_path).cpu()
model.eval()

# Get the model prediction
img2, _ = train_dataset[sample_idx]
with torch.no_grad():
    output = model.model([normalize(img2)])
print(f"predistions: {output}")
boxes = output[0]['boxes'][output[0]['scores'] >= 0.15]
# visualize the predictions
preds = utils.draw_bounding_boxes(img, boxes, colors="pink", width=5)

# export figure
fig = show([annot, preds])
fig.savefig('figure.png')

#### Initialize Transforms for Data Augmentations

In [None]:
gc.collect()

In [None]:
def get_transform(train: bool):
    if train:
        return T2.Compose([
            T2.RandomResizedCrop(size=(300, 300)),
            T2.RandomHorizontalFlip(p=0.5),
            T2.ColorJitter(brightness=0.3, contrast=0.3, saturation=0.3, hue=0.1),
            T2.ToTensor(),
            T2.ConvertImageDtype(torch.float),
            T2.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
        ])
    else:
        return T2.Compose([
            T2.Resize((300, 300)),
            T2.ToTensor(),
            T2.ConvertImageDtype(torch.float),
            T2.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
        ])

# Minimal collate function for variable number of targets per image
def collate_fn(batch):
    return tuple(zip(*batch))

## Train Model on PennFudanPed Dataset

#### Initialize Dataset

In [None]:
# Define the PennFudanPed Dataset (adapted from TorchVision tutorials)
class PennFudanDataset(torch.utils.data.Dataset):
    def __init__(self, root, transforms):
        self.root = root
        self.transforms = transforms
        self.imgs = list(sorted(os.listdir(os.path.join(root, "PNGImages"))))
        self.masks = list(sorted(os.listdir(os.path.join(root, "PedMasks"))))

    def __getitem__(self, idx):
        img_path = os.path.join(self.root, "PNGImages", self.imgs[idx])
        mask_path = os.path.join(self.root, "PedMasks", self.masks[idx])
        img = Image.open(img_path).convert("RGB")
        mask = Image.open(mask_path)
        mask = np.array(mask)

        # Instances are encoded as different colors
        obj_ids = np.unique(mask)[1:]
        masks = mask == obj_ids[:, None, None]

        boxes = []
        for i in range(len(obj_ids)):
            pos = np.where(masks[i])
            xmin = np.min(pos[1])
            xmax = np.max(pos[1])
            ymin = np.min(pos[0])
            ymax = np.max(pos[0])
            boxes.append([xmin, ymin, xmax, ymax])
        boxes = torch.as_tensor(boxes, dtype=torch.float32)
        num_objs = len(obj_ids)
        labels = torch.ones((num_objs,), dtype=torch.int64)  # one class: person

        target = {}
        target["boxes"] = boxes
        target["labels"] = labels
        target["image_id"] = torch.tensor([idx])

        if self.transforms is not None:
            img = self.transforms(img)

        return img, target

    def __len__(self):
        return len(self.imgs)

In [None]:
# Prepare the dataset using the new transforms v2
dataset_full = PennFudanDataset(pennfped.absolute(), get_transform(train=True))
n = len(dataset_full)
n_train = int(0.8 * n)
n_val = n - n_train
dataset_train, dataset_val = random_split(dataset_full, [n_train, n_val])

workers = 0

train_loader = DataLoader(
    dataset_train, 
    batch_size=2, 
    shuffle=True, 
    num_workers=workers,
    # persistent_workers=True,
    collate_fn=collate_fn
)
val_loader = DataLoader(
    dataset_val, 
    batch_size=4, 
    shuffle=False, 
    num_workers=workers,
    # persistent_workers=True,
    collate_fn=collate_fn
)

#### Initialize the Model

In [None]:
# Define a PyTorch Lightning Module for Faster-RCNN
class FasterRCNNLightning(pl.LightningModule):
    def __init__(self, num_classes=2, lr=0.005):
        super().__init__()
        # Load pre-trained Faster-RCNN model
        self.model = torchvision.models.detection.fasterrcnn_resnet50_fpn(
            weights=torchvision.models.detection.faster_rcnn.FasterRCNN_ResNet50_FPN_Weights.DEFAULT
        )
        in_features = self.model.roi_heads.box_predictor.cls_score.in_features
        self.model.roi_heads.box_predictor = torchvision.models.detection.faster_rcnn.FastRCNNPredictor(in_features, num_classes)
        self.lr = lr

    def forward(self, images, targets=None):
        return self.model(images, targets)

    def training_step(self, batch, batch_idx):
        images, targets = batch
        images = [img.to(self.device) for img in images]
        targets = [{k: v.to(self.device) for k, v in t.items()} for t in targets]
        loss_dict = self.model(images, targets)
        loss = sum(loss for loss in loss_dict.values())
        self.log("train_loss", loss, prog_bar=True)
        return loss

    def validation_step(self, batch, batch_idx):
        images, targets = batch
        images = [img.to(self.device) for img in images]
        targets = [{k: v.to(self.device) for k, v in t.items()} for t in targets]
        loss_dict = self.model(images, targets)
        loss = sum(loss for loss in loss_dict.values())
        self.log("val_loss", loss, prog_bar=True)
        return loss

    def configure_optimizers(self):
        optimizer = torch.optim.SGD(self.model.parameters(), lr=self.lr, momentum=0.9, weight_decay=0.0005)
        scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=3, gamma=0.1)
        return [optimizer], [scheduler]

#### Train the Model

In [None]:
# Instantiate the Lightning module
model_lightning = FasterRCNNLightning(num_classes=2)

In [None]:
# Initialize a PyTorch Lightning Trainer
trainer = pl.Trainer(
    max_epochs=2, 
    accelerator='auto', 
    devices=1
)

In [None]:
gc.collect()

In [None]:
# Train the Faster-RCNN model
trainer.fit(
    model_lightning, 
    train_loader, 
    val_loader
)

## Visualize the Results

In [None]:
batch = next(iter(val_loader))

In [None]:
# Ensure the underlying TorchVision model is in eval mode
model_lightning.model.eval()

# Define mean and std used during training (for un-normalization)
mean = torch.tensor([0.485, 0.456, 0.406]).view(3,1,1)
std = torch.tensor([0.229, 0.224, 0.225]).view(3,1,1)

def unnormalize(img):
    """Reverse the normalization on an image tensor."""
    return img * std + mean

# Get a batch from the validation DataLoader (val_loader from training section)
batch = next(iter(val_loader))
images, targets = batch
# Move images to device (assumed same device as model)
images = [img.to(model_lightning.device) for img in images]

# Run inference (without gradients)
with torch.no_grad():
    outputs = model_lightning.model(images)

# Loop over each image in the batch and plot predictions
for i, img in enumerate(images):
    # Unnormalize the image for visualization
    img_unnorm = unnormalize(img).clamp(0, 1)
    # Convert tensor to uint8 for drawing
    img_uint8 = (img_unnorm * 255).type(torch.uint8)
    
    # Get predictions for the image and filter by confidence threshold (e.g., 0.5)
    boxes = outputs[i]["boxes"].detach().cpu()
    scores = outputs[i]["scores"].detach().cpu()
    keep = scores >= 0.5
    boxes = boxes[keep]
    
    # Draw boxes on the image
    drawn_img = draw_bounding_boxes(img_uint8, boxes, colors="red", width=2)
    
    # Convert to PIL image and display
    plt.figure(figsize=(8, 8))
    plt.imshow(F.to_pil_image(drawn_img))
    plt.title(f"Validation Image {i} Predictions")
    plt.axis("off")
    plt.show()

## Conclusion

In this notebook we demonstrated:

- **Inference with Hugging Face DETR:** Running inference on a sample image using a DETR model via Transformers.
- **Inference & Training with Ultralytics YOLOv8:** Running inference on a sample image and training a YOLO model using Ultralytics (ensure your dataset YAML file exists at the specified path).
- **Training Faster‑R‑CNN with PyTorch Lightning:** Wrapping TorchVision’s Faster‑R‑CNN in a LightningModule, using TorchVision Transforms v2 for data augmentation on the PennFudanPed dataset, and training the model.

Feel free to experiment further with hyperparameters, dataset splits, and alternative models. Happy detecting and training!

In [None]:
# Ensure the underlying TorchVision model is in eval mode
model_lightning.model.eval()

# Define mean and std used during training (for un-normalization)
mean = torch.tensor([0.485, 0.456, 0.406]).view(3,1,1)
std = torch.tensor([0.229, 0.224, 0.225]).view(3,1,1)

def unnormalize(img):
    """Reverse the normalization on an image tensor."""
    return img * std + mean

# Get a batch from the validation DataLoader (val_loader from training section)
batch = next(iter(val_loader))
images, targets = batch
# Move images to device (assumed same device as model)
images = [img.to(model_lightning.device) for img in images]

# Run inference (without gradients)
with torch.no_grad():
    outputs = model_lightning.model(images)

# Loop over each image in the batch and plot predictions
for i, img in enumerate(images):
    # Unnormalize the image for visualization
    img_unnorm = unnormalize(img).clamp(0, 1)
    # Convert tensor to uint8 for drawing
    img_uint8 = (img_unnorm * 255).type(torch.uint8)
    
    # Get predictions for the image and filter by confidence threshold (e.g., 0.5)
    boxes = outputs[i]["boxes"].detach().cpu()
    scores = outputs[i]["scores"].detach().cpu()
    keep = scores >= 0.5
    boxes = boxes[keep]
    
    # Draw boxes on the image
    drawn_img = draw_bounding_boxes(img_uint8, boxes, colors="red", width=2)
    
    # Convert to PIL image and display
    plt.figure(figsize=(8, 8))
    plt.imshow(F.to_pil_image(drawn_img))
    plt.title(f"Validation Image {i} Predictions")
    plt.axis("off")
    plt.show()