# Object Detection

Object detection networks are used to detect and localize objects within an image or video. These networks can be used in a wide range of applications, such as autonomous vehicles, surveillance systems, object tracking in videos, human-computer interaction, and advanced driver assistance systems. Object detection has become a crucial part of computer vision and has seen significant improvements in recent years, thanks to deep learning.

In [None]:
import torch
import torchvision
from torchvision.models.detection import fasterrcnn_resnet50_fpn
from pycocotools.coco import COCO
import torchvision.transforms as transforms
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt
import cv2
from torchvision.utils import draw_bounding_boxes
from torchvision.transforms.functional import to_pil_image
from torchvision.transforms.functional import pil_to_tensor

First, let's set up some transforms we are going to use on our input image. We need to resize the image and convert it to a torch tensor so that we can pass it to the model. We will also need a separate version of the image, which we keep as a PIL image , so that we can use it to plot the bounding boxes of the items detected by the model.

In [None]:
# This transform resizes the image, and converts it to a tensor to use as an input to the model
transform = transforms.Compose([
    transforms.Resize(800),
    transforms.ToTensor(),
])

# This transform just resizes the image, and keeps it as a PIL image. We will use it for plotting the bounding boxes
resize_transform=transforms.Resize(800)

Next, we need to import the image, ensuring it is converted to RGB format, and then use the transforms we composed in the previous code block to transform the image.


In [None]:
img = Image.open('street_scene.jpg').convert('RGB')
image_for_output=pil_to_tensor(img)

image_for_output=resize_transform(image_for_output)
img_tensor = transform(img).unsqueeze(0)   

img.show()

Next, we download our pretrained model. We will be using `fasterrcnn_resnet50_fpn`, a popular object detection architecture. It is a two-stage object detector that first generates region proposals and then classifies the regions using a convolutional neural network. The ResNet-50 is a deep residual network that has 50 layers and is used as the backbone network to extract features from the image. The FPN (Feature Pyramidal Network) is used to incorporate features from different scales, making the network more robust to objects of different sizes.

Once the model is downloaded, we set it to evaluation mode using the `eval` command, and pass it our image tensor to predict the object bounding boxes.

In [None]:
object_detection_model = fasterrcnn_resnet50_fpn(pretrained=True, progress=False)
object_detection_model.eval()

# Run inference
with torch.no_grad():
    street_preds = object_detection_model(img_tensor)

The model will output a number of boxes around items it thinks are in the image, each with different confidence scores. We can view the confidence scores as follows. Confidence varies between 0 and 1: 

In [None]:
street_preds[0]["scores"]

We are not too interested in objects it predicted with a low confidence, so let's set a threshold of 0.8 to just select the ones it is pretty certain are there:

In [None]:
# get separate bits, over threshold score
street_preds[0]["boxes"] = street_preds[0]["boxes"][street_preds[0]["scores"] > 0.8]
street_preds[0]["labels"] = street_preds[0]["labels"][street_preds[0]["scores"] > 0.8]
street_preds[0]["scores"] = street_preds[0]["scores"][street_preds[0]["scores"] > 0.8]

As usual for a machine learning model, the labels are currently in integer format, so not very human readable!

In [None]:
street_preds[0]["labels"]

We can import the decoder dictionary, and make the labels human readable. We can also format them for display on the image when we visualise it:

In [None]:
annFile='instances_val2017.json'
coco=COCO(annFile)

street_labels = coco.loadCats(street_preds[0]["labels"].numpy())
street_annot_labels = ["{}-{:.2f}".format(label["name"], prob) for label, prob in zip(street_labels, street_preds[0]["scores"].detach().numpy())]

street_labels

Finally, we use the `draw_bounding_boxes` method from `torchvision.utils` to plot the bounding boxes on the image, and display it.

In [None]:
street_output = draw_bounding_boxes(image=image_for_output,
                             boxes=street_preds[0]["boxes"],
                             labels=street_annot_labels,
                             colors=["red" if label["name"]=="person" else "green" for label in street_labels],
                             width=2
                            )

to_pil_image(street_output)