<a href="https://www.kaggle.com/code/aisuko/object-detection?scriptVersionId=164769914" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Overview

Object detection is the computer vision task of detecting instances in an image. Object detection models reveive an image as input and output coordinates of the bounding boxes and associated labels of the detected objects. An image can contain multiple objects, each with its own bounding box and a label, and each object can be present in different parts of an image. This task is commonly used in autonomous driving for detecting things like pedestrians, road signs, and traffic lights. Other application include counting objects in images, image search, and more.

Here we are going to use model [`End-to-End Object Detection with Transformers`](https://arxiv.org/abs/2005.12872), a model that combines a convolutional backbone with an encoder-decoder Transformer, on the dataset with `Object Detection` label.


- **timm:** PyTorch Image Models collection.
- **albumentations:** Albumentations is a Python library for image augmentation. It is used in deep learning and computer vision task to increase the quality of trained model.


In [1]:
%%capture
!pip install transformers==4.35.2
!pip install datasets==2.15.0
!pip install evaluate==0.4.1
!pip install timm==0.9.12
!pip install albumentations==1.3.1
!pip install pycocotools==2.0.7

In [2]:
import os
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient

user_secrets = UserSecretsClient()

# login(token=user_secrets.get_secret("HUGGINGFACE_TOKEN"))

os.environ["WANDB_API_KEY"]=user_secrets.get_secret("WANDB_API_KEY")
os.environ["WANDB_PROJECT"] = "Fine-tune-models"
os.environ["WANDB_NOTES"] = "Fine tune model distilbert base uncased"
os.environ["WANDB_NAME"] = "ft-detr-with-cppe-5"

# Loading the CPPE-5 dataset

The CPPE-5 dataset contains images with annotations identifying medical personal protective equipment (PPE) in the context of the COVID-19 pandemic.

In [3]:
from datasets import load_dataset

cppe5=load_dataset("cppe-5")
cppe5

HfHubHTTPError: 503 Server Error: Service Temporarily Unavailable for url: https://huggingface.co/api/datasets/cppe-5

We will see that this dataset already comes with a training set containing 1000 images and a test set with 29 images. The examples in the dataset have the following fields:
- `image_id`: the example image id
- `image`: a PIL.Image.Image object containing the image
- `width`: width of the image
- `height`: height of the image
- `objects`: a dictionary containing bounding box metadata for the objects in the image
   - `id`: the annotation id
   - `area`: the area of the bounding box
   - `bbox`: the objecte's bounding box (in the (COCO)Common Objects in Context format)
   - `category`: the object's category, with possible values including
      - Coverall (0)
      - Face_Shield (1)
      - Gloves (2)
      - Goggles (3)
      - Mask (4)

In [None]:
cppe5["train"][0]

# Preprocessing data

The `bbox` field follows the COCO format, which is the format that the DETR model expects. However, the grouping of the fields inside `objects` differs from the annotation format `DETR` requires. We will need to apply some preprocessing transformations before using this data for training.

## Visualization Sample

To get an even better understanding of the data, visualize an example in the dataset.

In [None]:
import numpy as np
from PIL import Image, ImageDraw

image=cppe5["train"][0]["image"]
annotations=cppe5["train"][0]["objects"]
draw=ImageDraw.Draw(image)

categories=cppe5["train"].features["objects"].feature["category"].names

id2label={index: x for index, x in enumerate(categories, start=0)}
label2id={v:k for k,v in id2label.items()}

for i in range(len(annotations["id"])):
    box=annotations["bbox"][i]
    class_idx=annotations["category"][i]
    x,y,w,h=tuple(box)
    draw.rectangle((x,y,x+w,y+h), outline="red", width=1)
    draw.text((x,y), id2label[class_idx], fill="white")

image

To visualize the bounding boxes with associated labels, we can get the labels from the dataset's metadata, specifically the `category` field. We will also want to create dictionaries that map a label id to a label class(id2label) and the other way around(label2id). We can use them later when setting up the model. Including these maps will make our model reusable by others if you share it on the Hugging Face Hub.

As a final step of getting familiar with the data, explore it for potential issues. One common problem with datasets for object detection is bounding boxes that "sterch" beyond the edge of the image. Such "runaway" bounding boxes can raise errors during training and should be addressed at this stage. There are a few examples with this issue in this dataset. To keep things simple, we remove these images from the data.

In [None]:
remove_idx=[590, 821, 822, 875, 876,878,879]
keep=[i for i in range(len(cppe5["train"])) if i not in remove_idx]
cppe5["train"]=cppe5["train"].select(keep)
cppe5

## Loading AutoImageProcessor

Here we are going to use AutoImageProcessor takes care of processing image data to create pixel_values, pixel_mask, and labels that a DETR model can train with. The image processor has attributes:

- image_mean=[0.485, 0.456, 0.406]
- image_std=[0.229, 0.224, 0.225]

These are the mean and standard deviation used to normalize images during the model pre-training and crucial to replicate when doing inference or finetuning a pre-trained image model.

In [None]:
from transformers import AutoImageProcessor

model_checkpoint="facebook/detr-resnet-50"
image_processor=AutoImageProcessor.from_pretrained(model_checkpoint)

Before we passing the images to the image_processor, apply two preprocessing transformations to the dataset:

- Augmenting images
- Reformatting annotations to meet DETR expectations


## Augmenting Images

To make sure the model does not overfit on the training data, we can apply image augmentation with any data augmentation library. Here we use Albumentations, this library ensures that transformations affect the image and update the bounding boxes accordingly. Here is an [example](https://huggingface.co/docs/datasets/object_detection). We apply the same approach here, resize each image to (480, 480), flip it horizontally, and brighten it.

In [None]:
import albumentations
import numpy as np
import torch

transform=albumentations.Compose(
    [
        albumentations.Resize(480, 480),
        albumentations.HorizontalFlip(p=1.0),
        albumentations.RandomBrightnessContrast(p=1.0),
    ],
    bbox_params=albumentations.BboxParams(format="coco", label_fields=["category"]),
)

## Reformatting Annotations

The `image_processor` expects the annotations to be in the following format `{'image_id':int, 'annotations': List[Dict]}`, where each dictionary is a COCO object annotation. Let's add a function to reformat annotations for a single example:

In [None]:
def formatted_anns(image_id, category, area, bbox):
    annotations=[]
    for i in range(0, len(category)):
        new_ann={
            "image_id":image_id,
            "category_id":category[i],
            "isCrowd":0,
            "area":area[i],
            "bbox": list(bbox[i]),
        }
        annotations.append(new_ann)
    return annotations

Let's combine the image and annotation transformations to use on a batch of examples:

In [None]:
# transforming a batch
def transform_aug_ann(examples):
    image_ids=examples["image_id"]
    images, bboxes, area, categories=[],[],[],[]
    for image, objects in zip(examples["image"], examples["objects"]):
        image=np.array(image.convert("RGB"))[:,:,::-1]
        out=transform(image=image, bboxes=objects["bbox"], category=objects["category"])
        
        area.append(objects["area"])
        images.append(out["image"])
        bboxes.append(out["bboxes"])
        categories.append(out["category"])

    targets=[
        {"image_id": id_, "annotations": formatted_anns(id_, cat_, ar_, box_)}
        for id_, cat_, ar_, box_ in zip(image_ids, categories, area, bboxes)
    ]
    
    return image_processor(images=images, annotations=targets, return_tensors="pt")


cppe5["train"]=cppe5["train"].with_transform(transform_aug_ann)
cppe5["train"][15]

## Batch Data

We have successfully augmented the individual images and prepared their annotations. However, preprocessing isn't complete yet. In the final step, create a custom collate_fn to batch images together. Pad images to the largest image in a batch, and create a corresponding pixel_mask to indicate which pixels are real(1)  and which are padding(0).

In [None]:
def collate_fn(batch):
    pixel_values=[item["pixel_values"] for item in batch]
    encoding=image_processor.pad(pixel_values, return_tensors="pt")
    labels=[item["labels"] for item in batch]
    batch={}
    batch["pixel_values"]=encoding["pixel_values"]
    batch["pixel_mask"]=encoding["pixel_mask"]
    batch["labels"]=labels
    return batch

# Training

In [None]:
from transformers import AutoModelForObjectDetection

model=AutoModelForObjectDetection.from_pretrained(
    model_checkpoint,
    id2label=id2label,
    label2id=label2id,
    ignore_mismatched_sizes=True,
)

print(model.config)

In [None]:
from transformers import TrainingArguments, Trainer

training_args=TrainingArguments(
    output_dir=os.getenv("WANDB_NAME"),
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    gradient_checkpointing=4,
    num_train_epochs=5,
    fp16=True,
    save_steps=200,
    logging_steps=50,
    learning_rate=1e-5,
    weight_decay=1e-4,
    save_total_limit=2,
    remove_unused_columns=False,
    push_to_hub=False,
    report_to="wandb",
    run_name=os.getenv("WANDB_NAME"),
)

trainer=Trainer(
    model=model,
    args=training_args,
    data_collator=collate_fn,
    train_dataset=cppe5["train"],
    tokenizer=image_processor,
)

# Related to the issue https://github.com/huggingface/transformers/issues/13197
trainer.train()

In [None]:
image_processor.push_to_hub(os.getenv("WANDB_NAME"))
trainer.push_to_hub(os.getenv("WANDB_NAME"))

# Evaluate

Object detection models are commonly evaluated with a set of COCO-style metrics but here wel will use the one from torchvision to evaluate the final model.

We need to to prepare a ground truth COCO dataset. The API to build a COCO dataset requires the data to be stored in a certain format, so we will need to save images and annotations to disk first. Here are three major steps:

- Prepare the cppe5["test"] set: format the annotations and save the data to disk.

In [None]:
import json
import torchvision


# foramt annotations the same as for training, no need for data augmentation
def val_formatted_anns(image_id, objects):
    annotations=[]
    for i in range(0, len(objects["id"])):
        new_ann={
            "id": objects["id"][i],
            "category_id": objects["category"][i],
            "iscrowd":0,
            "image_id":image_id,
            "area":objects["area"][i],
            "bbox":objects["bbox"][i],
        }
        annotations.append(new_ann)
    return annotations


# Save images and annotations into the files
def save_cppe5_annotation_file_images(cppe5):
    output_json={}
    path_output_cppe5=f"{os.getcwd()}/cppe5/"
    
    if not os.path.exists(path_output_cppe5):
        os.makedirs(path_output_cppe5)
    
    path_anno=os.path.join(path_output_cppe5, "cppe5_ann.json")
    categories_json=[{"supercategory": "none", "id":id, "name": id2label[id]} for id in id2label]
    output_json["images"]=[]
    output_json["annotations"]=[]
    for example in cppe5:
        ann=val_formatted_anns(example["image_id"], example["objects"])
        output_json["images"].append(
            {
                "id": example["image_id"],
                "width":example["image"].width,
                "height": example["image"].height,
                "file_name":f"{example['image_id']}.png",
            }
        )
        output_json["annotations"].extend(ann)
    output_json["categories"]=categories_json
    
    with open(path_anno, "w") as file:
        json.dump(output_json, file, ensure_ascii=False, indent=4)
    
    for im, img_id in zip(cppe5["image"], cppe5["image_id"]):
        path_img=os.path.join(path_output_cppe5, f"{img_id}.png")
        im.save(path_img)
    
    return path_output_cppe5, path_anno



# Preparing an instance of a CocoDetection class that can be used with cocoevaluator.
class CocoDetection(torchvision.datasets.CocoDetection):
    def __init__(self, img_folder, image_processor, ann_file):
        super().__init__(img_folder, ann_file)
        self.image_processor=image_processor
        
    def __getitem__(self, idx):
        img, target=super(CocoDetection, self).__getitem__(idx)
        
        image_id=self.ids[idx]
        target={"image_id": image_id,"annotations": target}
        encoding=self.image_processor(images=img, annotations=target, return_tensors="pt")
        pixel_values=encoding["pixel_values"].squeeze()
        target=encoding["labels"][0]
        
        return {"pixel_values":pixel_values, "labels":target}


im_processor=AutoImageProcessor.from_pretrained(os.getenv("WANDB_NAME"))

path_output_cppe5, path_anno=save_cppe5_annotation_file_images(cppe5["test"])
test_ds_coco_format=CocoDetection(path_output_cppe5, im_processor, path_anno)

Finally, load the metrics and run the evaluation:

In [None]:
import evaluate
from tqdm import tqdm

model=AutoModelForObjectDetection.from_pretrained(os.getenv("WANDB_NAME"))
module=evaluate.load("ybelkada/cocoevaluate", coco=test_ds_coco_format.coco)
val_dataloader=torch.utils.data.DataLoader(
    test_ds_coco_format, batch_size=8, shuffle=False, num_workers=4, collate_fn=collate_fn
)

with torch.no_grad():
    for idx, batch in enumerate(tqdm(val_dataloader)):
        pixel_values=batch["pixel_values"]
        pixel_mask=batch["pixel_mask"]
        
        labels=[
            {k: v for k,v in t.items()} for t in batch["labels"]
        ]
        
        outputs=model(pixel_values=pixel_values, pixel_mask=pixel_mask)
        
        orig_target_sizes=torch.stack([target["orig_size"] for target in labels], dim=0)
        results=im_processor.post_process(outputs, orig_target_sizes)
        
        module.add(prediction=results, reference=labels)
        del batch

results=module.compute()
print(results)

# Inference

In [None]:
from transformers import pipeline
import requests

url="https://i.imgur.com/2lnWoly.jpg"
image=Image.open(requests.get(url, stream=True).raw)

obj_detector=pipeline("object-detection", model=os.getenv("WANDB_NAME"))
obj_detector(image)

We can also manually replicate the results of the pipeline.

In [None]:
image_processor=AutoImageProcessor.from_pretrained(os.getenv("WANDB_NAME"))
model=AutoModelForObjectDetection.from_pretrained(os.getenv("WANDB_NAME"))

with torch.no_grad():
    inputs=image_processor(images=image, return_tensors="pt")
    outputs=model(**inputs)
    target_sizes=torch.tensor([image.size[::-1]])
    results=image_processor.post_process_object_detection(outputs, threshold=0.5, target_sizes=target_sizes)[0]

for  score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
    box=[round(i,2) for i in box.tolist()]
    print(
        f"Detected {model.config.id2label[label.item()]} with confidence"
        f"{round(score.item(), 3)} at location {box}"
    )

In [None]:
draw=ImageDraw.Draw(image)

for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
    box=[round(i,2) for i in box.tolist()]
    x,y,x2,y2=tuple(box)
    draw.rectangle((x,y,x2,y2), outline="red", width=1)
    draw.text((x,y), model.config.id2label[label.item()], fill="white")

image