### Importing the libraries

- **`torch`** is by far the best weapon to build neural network and computer vision because pytorch contains the dynamic graphs which we are able to compute very efficiently the gradients of composition functions in backward propagation. 
- **`autograd`** is the module responsible for gradient descent. We are import the caiable class which will be used to convert the tensors into some torch variables that will contain both the tensor and a gradient and then the torch variable containing the tensor in the gradients will be on element of the graph.
- **`data`** is a folder that contains the classes `BaseTransform` and `VOC_CLASSES`: 
    - `BaseTransform` is a class that will do the required transformations so that the input images will be compatible with the neural network. (When we eed the neural network with the input images, they have to have a certain format and `BaseTransform` will be used to transform the images in this format so that they can be accepted into the neural network)
    - `VOC_CLASSES` is a dictionary that will do the encoding of the classes, (for example: planes will be encoded as one) which is the idea of doing a mapping because we want to work with numbers and not text.
- **`SSD`** is the library of the single shot multi-box action model and then `build_ssd` as that we import from the `SSD` library will be the constructor the architecture of single shot (not box detection) of the SSD neural network.
- **`imageio`** is the library that we will use to process the images of the video and applying the detect function that will implement on the images.

In [2]:
import torch
from torch.autograd import Variable
import cv2
from data import BaseTransform, VOC_CLASSES as labelmap
from ssd import build_ssd
import imageio
imageio.plugins.ffmpeg.download()

### Doing single-shot multi-box detection through deep learning instead of OpenCV

- The function will be working frame by frame. Instead of doing directly on the video, it will do on each single image.
- Using tricks from `imageio` to extract images from the video and implement interesting functions.
- Reassemble the whole thing to make the video with the rectangles indicating detections.

now we are going to do several transformations to go from the original image to a torch variable that will be accepted into the SSD neural network. There is a series of transformations to do before getting to this torch variable.
1. apply the transform transformation to make sure that the image has the 
    - right format, 
    - right dimensions 
    - right colour values
2. once done this transformation, we need to conver this tranformed frame from a **numpy array** to a **torch tensor** (a tensor is a more advanced matrix)
3. add a fake dimension to the troch sensor and that fake dimension will correspond to the batch
4. convert into a torch variable that contains both the tensor and the gradient. (The torch variable will be an element of the dynamic graph which will allow us later to do fast adn efficient computation of the gradients during backward propogation)

#### First transformation

`transform()` function returns two elements but we are only interested in the first element which is actually the reansform frame with the right format. In order to get the first element, we add index function which is `[0]`:

`frame_t = transform(frame)[0]`

#### Second transformation

- Changing numpy array to torch tensor: `x = torch.from_numpy(frame_t)`

- The output is RBG but the neural network training is in GRB, needs simple bit transformation: `x = torch.from_numpy(frame_t).permute(2, 0, 1)`

#### Third transformation

The neural network cannot actually accept single inputs like a single input vector or a single input image. It only accepts them into some batches. That's why we have to create a structure with the first dimension corresponding to the batch and the other dimension corresponding to the input.

`x.unsqueeze(0)`

#### Fourth transformation

Convert the batch of torch and input into a torch variable (which is a highly advanced variable contains both a tensor and a gradient). This torch variable will become an element of the dynamic graph wich will compute very efficiently the gradients of any composition functions during backward propagation.

Variable class will create an object which will be the torch variable and therefore since we are creating a new object we need to overwrite the previous variable x:

`x = Variable(x.unsqueeze(0))`

Then it's ready to be fed into the SSD neural network that has been pre-trained.

because the position of the detected objects inside the image has to be normailsed between 0 and 1, to do this normalisation, we will need this scale tensor with these four dimensions. 

`scale = torch.Tensor([width, height, width, height])`

The reason of double width height is that the first two width height will correspond to the scale of values of the upper left corner of the rectangle detector and the second width height will correspond to the scale of values of the lower right corner of this same retangle detector.

In [None]:
# Defining a function that will do the detections
# net is the neural network
# transform makes sure image compatible with the neural network
def detect(frame, net, transform):
    height, width = frame.shape[:2]
    frame_t = transform(frame)[0]  # correspond to the future transform frame
    x = torch.from_numpy(frame_t).permute(2, 0, 1)  # numpy array to torch tensor; RBG to GRB
    x = Variable(x.unsqueeze(0))  # add fake dimension corresponding to the batch
    y = net(x)  # feed x to the neural network
    detections = y.data
    scale = torch.Tensor([width, height, width, height])
    # detections = [batch, number of classes, number of occurence, (score, x0, y0, x1, y1)]
    for i in range(detections.size(1)):
        j = 0  # occurence of a class
        while detections[0, i, j, 0] >= 0.6:
            pt = (detections[0, i, j, 1:] * scale).numpy()
            cv2.retangle(frame, (int(pt[0]), int(pt[1])), (int(pt[2]), int(pt[3])), (255, 0, 0), 2)
            cv2.putText(frame, labelmap[i - 1], (int(pt[0]), int(pt[1])), cv2.FONT_HERSHEY_SIMPLEX, 
                        2, (255, 255, 255), 2, cv2.LINE_AA)
            j += 1
    return frame

# Creating the SSD neural network
net = build_ssd('test')
net.load_state_dict(torch.load('ssd300_mAP_77.43_v2.pth', map_location = lambda storage, loc: storage))  
# not only we have a tensor that contains these weights but also these weights are attributed to our SSD net object