## Object Detection using PyTorch

This workshop instructs how to use PyTorch to detect objects within an image. 

Upon completion you will have a basic understanding of:

1. Image loading and manipulation in Python and PyTorch
2. Loading pretrained models with Torchvision 
3. Batch processing in deep learning models
4. Inference and post-processing with object detection models 

**Note:** This file is intended to be run on [Google Colab](https://colab.research.google.com). If you're viewing this file on github, [click here](https://githubtocolab.com/fbsamples/mit-dl-workshop/blob/main/object-detection/exercise.ipynb) to load it into google colab.

### 1. Import necessary libraries

To complete this workshop import the following libraries.

In [1]:
import torch
import torchvision

from PIL import Image
from pprint import pprint
from collections import Counter
import requests
import ast

With the required libraries loaded, you will need to create a device for training. The GPU is more efficient than the CPU, but it may not always be available. The following code will use the GPU if it's available. Otherwise, it will use the CPU.

In [2]:
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

### 2. Load a pretrained torchvision model

Now that you've created the device, it's time to load a pretrained object detection model. For this workshop, you will use the `fasterrccn_resnet50_fpn`. It's built into `torchvision`, so it became immediately available to you when you imported it above.

So how do you load it? Like this `torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)`

For this section you will update the `load_model` function by replacing `# write code here` with the coad to load faster rcnn.

In [17]:
# EXERCISE: Write a function to load a pretrained object detection model from torchvision in eval mode

def load_model():
    model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
    # Set it to `eval` mode because we aren't training the model
    model.eval()
    return model

You can test `load_model` using the code below:

In [4]:
model = load_model()

# print(model)



🏆🏆 Do you see "Downloading" and "100% 160M/160M [00:09<00:00, 20.3MB/s]" in the output above? That means you have downloaded the model's pretrained weights. Congrats, you have successfully completed this exercise! 🏆🏆

### 3. Get images to analyze

Now that you've got the model trained, it's time to source some images to detect objects from. We've prepared two for this workshop. 

To download them you will use the curl command. It's a tool for transferring data from or to a server. 

In [5]:
!curl "https://www.sfmta.com/sites/default/files/imce-images/2021/pedestrian_scramble.jpg" -o pedestrian_scramble.jpg
!curl "https://static.wixstatic.com/media/0b1913_a8d6b79a2f624015b42ecf5b8efa54fc~mv2.jpg" -o cats.jpg

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  188k  100  188k    0     0   132k      0  0:00:01  0:00:01 --:--:--  132k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 17.0M  100 17.0M    0     0  5079k      0  0:00:03  0:00:03 --:--:-- 5085k


Now that you've downloaded the images, try opening it and previewing it using the code below:

In [6]:
from IPython.display import Image
Image('pedestrian_scramble.jpg', width=240)

### 4. Preprocess the images

Now that you have two images, you will need to preprocess them before you can have the model detect objects within them. To do that you will need to convert the images into a tensor. 
- PIL (Python Imaging Library) contains helper functions to read, manipulate and write images from disk. 
- TorchVision includes a `transforms` class that you can use to convert PIL objects into tensors.

For this exercise you will need to write the `load_as_tensor` function that:

1. Loads an image path as a PIL object. 
2. Transform it to a tensor.

To load image you can use `Image.open`. Once opened you can use `torchvision.transforms.ToTensor` to convert it to a tensor.

In [7]:
# EXERCISE: Write a function that accepts the image file path and returns a tensor

def load_as_tensor(img_path):
    image = Image.open(img_path) # Load as PIL image
    image = torchvision.transforms.ToTensor()(image) # Convert PIL image to tensor
    return image

Once you've completed the function, try loading the images one by one:

In [8]:
img1 = load_as_tensor("pedestrian_scramble.jpg")
print(img1.size())

torch.Size([3, 806, 1200])


In [9]:
img2 = load_as_tensor("cats.jpg")
print(img2.size())

torch.Size([3, 5074, 5074])


🏆🏆 Do you see torch.Size([3, 806, 1200]) and torch.Size([3, 5074, 5074])? If so, congrats! 🏆🏆

Now it's time to start detecting objects.

### 5. Batchify

Our example only includes 2 images, but in the real world it's not uncommon to process thousands to millions of images. Processing each image one at a time is incredibly inefficient (especially with modern GPU memory capacities) and slow. 

Is there a way to speed this up? Yes! The answer is to batchify it! 

The operations on each image are identical and independent of each other, so they can be performed in parallel. This is why inputs to deep learning models are batches of images (or text or audio or whatever your model consumes).

To batchify your images create an array of images and convert it to a tensor.

In [10]:
# Create list of all images of a batch
batch = [img1, img2]

# Convert list to tensor
input_batch = torch.stack(batch)

RuntimeError: stack expects each tensor to be equal size, but got [3, 806, 1200] at entry 0 and [3, 5074, 5074] at entry 1

Oh no! You just got an error! Don't fret, let's figure out what went wrong...

The stacktrace says we couldn't create a batch because the image sizes are different.

When sizes are different, the operations are no longer identical (large images will need more operations). For parallel processing, the batch must contain images of the same size.

Also, in the real world it's unlikely to always get images of the same size. Our preprocessing function should also resize images to the same size. We can use `torchvision.transforms.Resize` in our preprocessing function. Let's try that!

### 6. Update the preprocessing function

Rewrite the preprocessing function from above so that after the image is loaded as a tensor and resize it to 224 pixels in height and width.

Use [torchvision.transforms.Resize](https://pytorch.org/vision/main/generated/torchvision.transforms.Resize.html)

In [11]:
# EXERCISE: Update `load_as_tensor` to resize the image tensor to 224x224

def load_as_tensor(img_path):
    image = Image.open(img_path) # Load as PIL image
    image = torchvision.transforms.ToTensor()(image) # Convert PIL image to tensor
    image = torchvision.transforms.Resize(size=(224,224))(image)
    return image

Once you've resized the tensor, test it to make sure the image tensor sizes are the same.

In [12]:
img1 = load_as_tensor("pedestrian_scramble.jpg")
print(img1.size())

img2 = load_as_tensor("cats.jpg")
print(img2.size())

torch.Size([3, 224, 224])
torch.Size([3, 224, 224])


**Question:** Why did we choose 224 x 224 for an image of size  (3, 224, 224) 

**Answer:** It is the smallest permissible image that pretrained models support.

### 7. Batchify... again

Now that you've updated `load_as_tensor` to resize images to 224 x 224, try batching them again. That pesky error message should go away.

In [13]:
batch = [img1, img2]
input_batch = torch.stack(batch)

In [14]:
# EXERCISE: What is the size of the `input_batch` tensor?

print(input_batch.size())

torch.Size([2, 3, 224, 224])


🏆🏆 Did you get *torch.Size([2, 3, 224, 224])*? If so, congrats! 🏆🏆

The input batch tensor resembles the classic (N, C, H, W) format you will encounter often in your computer vision journey.

N: Number of images 

C: Channels (like RGB, or CMYK)

H: Height  

W: Width  

Now it's time to ...

### 8. Run inference on the image
Pass the input batch through the model

In [15]:
predictions = model(input_batch)

In [18]:
# EXERCISE: How many elements does `predictions` contain? 

print(len(predictions))  # must match input batch size

2


How does the number of elements in `prediction` relate to the number of images in the input batch?

In [19]:
# EXERCISE: Explore what each prediction contains. What do you think all these numbers mean?

p0 = predictions[0]

# print(p0)
print(p0.keys())

dict_keys(['boxes', 'labels', 'scores'])


The model returns 3 things:
- boxes: coordinates of the bounding boxes around detected objects
- labels: what it thinks the detected object is 
- scores: confidence in the predicted label (ranging from 0 - 1, higher is more confident)

In [20]:
# EXERCISE: See what objects have been detected in the first image

p0['labels']

tensor([ 3,  3, 10,  8,  3,  1,  1,  1,  1,  3,  1,  1,  1,  3, 10, 10,  6,  3,
         1,  6,  3,  1,  3,  1,  8,  1,  1,  1,  3,  1,  1, 10,  6,  3,  8, 10,
         3,  3,  1,  1,  3, 10,  3,  8,  3,  8,  3,  1,  3,  1, 10,  3,  6,  1,
         8,  1,  1,  1,  3,  6,  6,  1,  8,  3,  3,  1,  3, 10,  1,  1,  1,  1,
        31,  6, 10,  1,  3,  6,  1, 33,  3,  1, 10, 41, 27,  1, 14, 31,  8,  8,
         3,  8,  6, 31,  6,  2,  1,  3,  1,  6])

In [21]:
# EXERCISE: What are the scores of the most-confident and least-confident predictions?

print("Score of most-confident prediction: ", max(p0['scores']).item())
print("Score of least-confident prediction: ", min(p0['scores']).item())

Score of most-confident prediction:  0.973721981048584
Score of least-confident prediction:  0.05608990788459778


### 9. Post-process output

The model has given us integers for labels. These integers are indices that map to object names in the CoCo dataset.

Here's a function to load the lookup map:

In [22]:
def get_mapping_dict():
    idx_to_labels_url = "https://gist.githubusercontent.com/suraj813/1fe4c9dd0bc7e1dd1ce79462712ac9ce/raw/0e2c65813946769a375d673a34a1c0236b0505f1/coco_idx_to_labels.txt"
    r = requests.get(idx_to_labels_url).text
    map = {int(k) : v for k,v in ast.literal_eval(r).items()}
    return map

label_lookup = get_mapping_dict()

Try it out! `1` seems to a common label in the first image, what does it correspond to?

In [23]:
# EXERCISE: What is the object the model predicts as `1`?

print(label_lookup[1])

person


### 10. Build a report

Now that you know how to  translate the model's output labels to actual object names, try to build a report for each image.

The report should contain all the objects in the image BUT the model isn't confident about every prediction it has made. So you should ignore predictions below a certain threshold.

There might be multiple occurences of an object in the image; instead of listing every occurrence of the object, the report can just contain an aggregate count of the object.

In [24]:


def create_detection_report(model_output, confidence_threshold=0.8):
    # Unpack the output dictionary to get the bbox, labels, and confidence values
    bbox, labels, confidence = model_output.values()
    
    # Convert the labels and confidence arrays to lists for easier processing
    labels = labels.tolist()
    confidence = confidence.tolist()

    # Get a lookup dictionary for the class labels
    label_lookup = get_mapping_dict()

    # Loop through each label and its corresponding confidence value
    detected_objects = []
    for label, confidence in zip(labels, confidence):
        # Check if the confidence value is above the threshold
        if confidence > confidence_threshold:
            # Use the label lookup to get the class name and add it to the list of detected objects
            classname = label_lookup[label]
            detected_objects.append((classname, confidence,))
    
    # Use a Counter object to count the number of times each class appears in the detected_objects list
    counts = Counter([x[0] for x in detected_objects])

    # Return a tuple containing the list of detected objects and the class counts
    return detected_objects, counts 


In [29]:
for c, pred in enumerate(predictions):
    detected_objects, counts = create_detection_report(pred, confidence_threshold=0.85)   

    print(f"Objects detected in image {c+1}:\n", "="*20)
    pprint(detected_objects)
    print()

    print("Count of objects:\n", "="*20)
    pprint(counts)
    
    print("\n\n")


Objects detected in image 1:
[('car', 0.973721981048584),
 ('car', 0.9571205973625183),
 ('traffic light', 0.9557090401649475),
 ('truck', 0.9524416327476501),
 ('car', 0.9516623020172119),
 ('person', 0.9422258734703064),
 ('person', 0.9256057739257812),
 ('person', 0.9153063893318176),
 ('person', 0.8720185160636902),
 ('car', 0.8719595074653625),
 ('person', 0.863501250743866)]

Count of objects:
Counter({'person': 5, 'car': 4, 'traffic light': 1, 'truck': 1})



Objects detected in image 2:
[('cat', 0.9777542352676392), ('cat', 0.967362105846405)]

Count of objects:
Counter({'cat': 2})





### Take-home assignment

Improve this report by drawing boxes on the input image and labelling each box with the detected object and confidence score.

HINT: https://pytorch.org/vision/main/generated/torchvision.utils.draw_bounding_boxes.html