# Tutorial 12 - Object Detection

## Dr. David C. Schedl

This tutorial is geared towards students **experienced in programming** and aims to introduce you to **Digital Imaging / Computer Vision** techniques.

We will look at object detection with a very modern network architecture (DETR a Vision Transformer) and the HuggingFace API (HuggingFace's `transformers` building on PyTorch).

For faster processing, it is recommended to use a **GPU**. In Google Colab go to the menu and select **Edit** -> **Notebook settings** -> **Hardware accelerator** -> switch to **GPU**.

Useful links:
* [DETR on HuggingFace](https://huggingface.co/facebook/detr-resnet-50)
* [DETR on Github (by Facebook research)](https://github.com/facebookresearch/detr)
* [PyTorch documentation](https://pytorch.org/docs/stable/index.html)


##### Acknowledgements
The code of this tutorial is based on the example code on [HuggingFace](https://huggingface.co/facebook/detr-resnet-50).

We will work with images today. So let's download some with `curl`.

In [None]:
!mkdir "data"
!curl -o "./data/couch.jpg" "http://images.cocodataset.org/val2017/000000039769.jpg" --silent
!curl -o "./data/teddy.jpg" "https://farm5.staticflickr.com/4100/4893226511_941ce57389_z.jpg" --silent
!curl -o "./data/safari.jpg" "https://farm4.staticflickr.com/3380/3519870985_2d2b50338d_z.jpg" --silent
!curl -o "./data/rider.jpg" "http://images.cocodataset.org/val2017/000000439715.jpg" --silent

## Initilization

As always let's import useful libraries, first.
HuggingFace transformers are not installed (per default) on Colab. So let's install some requirements with `pip install transformers timm` (if the import fails). 
If you get an error in the next cells, try to restart your runtime!

In [None]:
try:
  from transformers import DetrImageProcessor, DetrForObjectDetection
except:
  !pip install transformers timm # first install the HuggingFace transformers API
  from transformers import DetrImageProcessor, DetrForObjectDetection
import torch
from PIL import Image
import requests
from matplotlib import pyplot as plt


device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
if device.type != "cuda":
    print("Using CPU! things will be slow! :(")

## Init the Model 

In [None]:
processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")
class_names = model.config.id2label
model.to(device)  # send model to GPU
print(class_names)  # display what our detector can detect

## Load the images

Let's load some images and display them.

In [None]:
images = [
    Image.open(f"./data/{name}.jpg") for name in ["couch", "teddy", "rider", "safari"]
]

# display images with matplotlib
for image in images:
    plt.imshow(image)
    plt.show()

torch.cuda.empty_cache()  # free up GPU memory
inputs = [processor(images=img, return_tensors="pt") for img in images]

In [None]:
predictions = []
for i, inp in enumerate(inputs):
    torch.cuda.empty_cache()  # free up GPU memory
    inp.to(device)  # send inputs to GPU

    outputs = model(**inp)
    # move outputs to CPU
    for k in outputs.keys():
        if outputs[k] is not None:
            outputs[k] = outputs[k].to("cpu")

    # display the image
    image = images[i]
    plt.imshow(image)
    plt.show()

    # convert outputs (bounding boxes and class logits) to COCO API
    # let's only keep detections with score > 0.9
    target_sizes = torch.tensor([image.size[::-1]])
    results = processor.post_process_object_detection(
        outputs, target_sizes=target_sizes, threshold=0.9
    )[0]

    for score, label, box in zip(
        results["scores"], results["labels"], results["boxes"]
    ):
        box = [round(i, 2) for i in box.tolist()]
        print(
            f"Detected {model.config.id2label[label.item()]} with confidence "
            f"{round(score.item(), 3)} at location {box}"
        )
    predictions.append(results)

### Exercise 01 📝: Visualizing the bounding boxes. 

Draw the bounding boxes with labels and confidence scores on the images.

You can use the `images` and `predictions` list (one element for each image) from above.
One prediction is a dictionary with the following keys: `labels`, `scores`, `boxes`.
Boxes are in the format `[x0, y0, x1, y1]` (top-left and bottom-right corner).

You can use pyplot's `plot` and `text` functions to draw the boxes (and text) on the images.

Optionally, you can also use the `cv2.rectangle` function to draw the boxes. Note that OpenCV draws in the image directly, so you don't need to return the image.


In [None]:
def visualize_detection(image, results):
    for score, label, box in zip(
        results["scores"], results["labels"], results["boxes"]
    ):
        box = [round(i, 2) for i in box.tolist()]
        x1, y1, x2, y2 = box

        # TODO: add predictions to images

        # print(
        #     f"Detected {class_names[label.item()]} with confidence "
        #     f"{round(score.item(), 3)} at location {box}"
        # )


for i, image in enumerate(images):
    plt.imshow(image)
    visualize_detection(image, predictions[i])
    plt.show()