# Tello for AI-Hackers

## Overview
Now, we are going to move Tello with out hands!!

We'll use a [Neural Network (NN)](https://en.wikipedia.org/wiki/Neural_network) to predict what we want the drone to do; whether it should move forward, backward or stop.

### Artificial Neural Networks
An Artificial Neural Network (ANN) models the connections of the biological neurons as weights between nodes. An ANN is usually composed by multiple layers of nodes, because this lets us insert multiple non-linearities thus allowing a hierarchical decomposition of the input, possibly reducing the number of parameters necessary to learn a certain task.

<img src="resources/nn.png" width="600"/>

Each node computes a linear combination of its inputs and the output then passes through an activation function, used to insert non-linearities.

<img src="resources/nn_function.png" width="600"/>

#### Our model: [ResNet-34](https://pytorch.org/hub/pytorch_vision_resnet/)
In the years different type of ANN have been created and most of them are available in deep learning libraries, such as [PyTorch](https://pytorch.org/) or [Keras](https://keras.io/).

We have decided to use a pre-trained version of ResNet-34 available in PyTorch, this is a model taken from [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385) and trained on [ImageNet](https://www.image-net.org/), a dataset used for an annual competition between 2010 and 2017. 

Given that the model is pre-trained on a huge amount of images, we can assume that it has already learned a lot of useful notions on how object in images are represented and we may hope that these features will in same way be similar to those of our few images.

Otherwise, it would be extremely difficult to train such a big network with a few pictures.

### Dataset introduction
Our task falls into the category of [Supervised Learning](https://en.wikipedia.org/wiki/Supervised_learning), because we want the NN to learn a function which maps 3 hands gestures (classes) to 3 commands using a set of example (image)-label (class) pairs, as shown in [Visualize some of the images in training](#visualize).

- **Fist** -> move forward
- **One Open Hand** -> move backward
- **Two Open Hands** -> stop

Actually, we're also going to create a 4th class, named "other", which will cover all the other cases and that is going to be associated to the "stop" command.

This set of example-label pairs will be our [dataset](#dataset), we'll make the model predict a label for each example (image) and then we're going to modifiy its parameters, i.e. the weights between nodes, based on a [loss function](https://en.wikipedia.org/wiki/Loss_function).

### Loss Function & Metric
We want to maximise the accuracy, our metric, and in order to do so we have chosen to minimise the [cross entropy loss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html).



We'll use this loss function instead of the accuracy to update the parameters of the model, because, in contrast with out metric, it is continous.

### Training
To [train](#train) the model one has to create a training loop with the following steps:
1) sample a minibatch from the dataset, where a minibatch is a set of example-label pairs

2) perform inference with the model

3) computing the loss between predicted and true labels

4) computing the gradient

5) using the gradient to update the model's parameters

6) go back to 1) until the end of the dataset

7) with gradient disabled, perform inference on all the validation dataset

8) if the loss on the validation dataset plateaus or starts to diverge, end traing

TIPS:
- Usually the minibatch size is chosen to be a power of 2, for performances reason on a GPU.
- Evaluating the performance of the model on a validation dataset is needed to reduce the effect of overfitting, i.e. learning regularities that are present only in the training set.


### Have Fun!
It's time to use the trained model to control our Tello.

Complete the code below and ask us to test it ;)

## Packages imports

In [None]:
import os
import sys
import traceback
import time
import copy
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim import lr_scheduler
import torch.backends.cudnn as cudnn
import torchvision
from torchvision import datasets, models, transforms
import cv2

# custom libraries
from pkgs.telloCV import TelloCV

from sys import platform
if platform == "win32":
    os.environ['KMP_DUPLICATE_LIB_OK']='True'

<a id="dataset"></a>

## Load Data

In [None]:
DATA_PATH = '../data'
MODEL_PATH = '../models/best_model.th'
CLASS_NAMES = ["forward", "backward", "stop", "other"]

In [None]:
BATCH_SIZE = 8

# Data augmentation for training
# Just resize and gray scale for validation
data_transforms = {
    'train': transforms.Compose([
        transforms.Resize((256, 256)),
        transforms.CenterCrop((224, 224)),
        transforms.RandomHorizontalFlip(p=0.5),
        transforms.ColorJitter(brightness=.2, hue=.1),
        transforms.RandomAffine(degrees=(0, 10), translate=(0.2, 0.2), scale=(0.75, 1.0)),
        transforms.GaussianBlur((3, 3), sigma=(1.5, 2.5)),
        transforms.Grayscale(1),
        transforms.ToTensor(),
    ]),
    'val': transforms.Compose([
        transforms.Resize((256, 256)),
        transforms.CenterCrop((224, 224)),
        transforms.Grayscale(1),
        transforms.ToTensor(),
    ]),
}


image_datasets = {x: datasets.ImageFolder(os.path.join(DATA_PATH, x), 
                                          data_transforms[x]) for x in ['train', 'val']}

dataloaders = {x: torch.utils.data.DataLoader(image_datasets[x], 
                                              batch_size=BATCH_SIZE, 
                                              shuffle=True, 
                                              num_workers=2) for x in ['train', 'val']}
dataset_sizes = {x: len(image_datasets[x]) for x in ['train', 'val']}
class_names = image_datasets['train'].classes

In [None]:
# check if it is possible to use the GPU
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

<a id="visualize"></a>

## Visualize some of the images in training

In [None]:
def imshow(inp, title=None):
    """Imshow for Tensor."""
    inp = inp.numpy().transpose((1, 2, 0))
    inp = np.clip(inp, 0, 1).squeeze()
    f = plt.figure()
    f.set_figwidth(8)
    f.set_figheight(6)
    plt.imshow(inp, cmap='gray')
    if title is not None:
        plt.title(title)
    plt.pause(0.001)  # pause a bit so that plots are updated


# Get a batch of training data
inputs, classes = next(iter(dataloaders['train']))
inputs = inputs[:5]
classes = classes[:5]

# Make a grid from batch
out = torchvision.utils.make_grid(inputs)

imshow(out, title=[class_names[x] for x in classes])

<a id="model"></a>

## Baseline Model
The trainer class will read the dataset, create the dataloaders and the model. If 'saved_model' is used, the given weights will be loaded.

[Convolutional layers](https://en.wikipedia.org/wiki/Convolutional_neural_network)

In [None]:
model = models.resnet34(pretrained=True)
# freeze all the parameters of the model
for i, param in enumerate(model.parameters()):
        param.requires_grad = False
    
num_ftrs = model.fc.in_features

model.conv1 = nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3, bias=False)
model.fc = nn.Linear(num_ftrs, len(CLASS_NAMES))

model = model.to(device)

In [None]:
# selection of the Loss function
criterion = nn.CrossEntropyLoss()

# Observe that all parameters are being optimized
optimizer = optim.Adam(model.parameters(), lr=0.0001, amsgrad=True)

# Decay LR by a factor of 0.1 every 7 epochs
exp_lr_scheduler = lr_scheduler.StepLR(optimizer, step_size=7, gamma=0.1)

<a id="train"></a>

### Training Loop

In [None]:
def train_model(model, dataloaders,  criterion, optimizer, scheduler, num_epochs=25):
    since = time.time()

    best_model_wts = copy.deepcopy(model.state_dict())
    best_acc = 0.0

    for epoch in range(num_epochs):
        print('Epoch {}/{}'.format(epoch, num_epochs - 1))
        print('-' * 10)

        # Each epoch has a training and validation phase
        for phase in ['train', 'val']:
            if phase == 'train':
                model.train()  # Set model to training mode
            else:
                model.eval()   # Set model to evaluate mode

            running_loss = 0.0
            running_corrects = 0

            # Iterate over data.
            for inputs, labels in dataloaders[phase]:
                inputs = inputs.to(device)
                labels = labels.to(device)

                # zero the parameter gradients
                optimizer.zero_grad()

                # forward
                # track history if only in train
                with torch.set_grad_enabled(phase == 'train'):
                    outputs = model(inputs)
                    _, preds = torch.max(outputs, 1)
                    loss = criterion(outputs, labels)

                    # backward + optimize only if in training phase
                    if phase == 'train':
                        loss.backward()
                        optimizer.step()

                # statistics
                running_loss += loss.item() * inputs.size(0)
                running_corrects += torch.sum(preds == labels.data)
            if phase == 'train':
                scheduler.step()

            epoch_loss = running_loss / dataset_sizes[phase]
            epoch_acc = running_corrects.double() / dataset_sizes[phase]

            print('{} Loss: {:.4f} Acc: {:.4f}'.format(
                phase, epoch_loss, epoch_acc))

            # deep copy the model
            if phase == 'val' and epoch_acc > best_acc:
                best_acc = epoch_acc
                best_model_wts = copy.deepcopy(model.state_dict())

        print()

    time_elapsed = time.time() - since
    print('Training complete in {:.0f}m {:.0f}s'.format(
        time_elapsed // 60, time_elapsed % 60))
    print('Best val Acc: {:4f}'.format(best_acc))

    # load best model weights
    model.load_state_dict(best_model_wts)
    return model

In [None]:
model = train_model(model, dataloaders, criterion, optimizer, exp_lr_scheduler, num_epochs=1)

### Run offline inference

In [None]:
def visualize_model(model, dataloaders, num_images=6):
    was_training = model.training
    model.eval()
    images_so_far = 0
    fig = plt.figure()

    with torch.no_grad():
        for i, (inputs, labels) in enumerate(dataloaders['val']):
            inputs = inputs.to(device)
            labels = labels.to(device)

            outputs = model(inputs).cpu()
            outputs = nn.functional.softmax(outputs, dim=-1)
            _, preds = torch.max(outputs, 1)

            for j in range(inputs.size()[0]):
                images_so_far += 1
                ax = plt.subplot(num_images//2, 2, images_so_far)
                ax.axis('off')
                ax.set_title('predicted: {}'.format(class_names[preds[j]]))
                imshow(inputs.cpu().data[j])

                if images_so_far == num_images:
                    model.train(mode=was_training)
                    return
        model.train(mode=was_training)

In [None]:
visualize_model(model, dataloaders)

### Test live inference
Lets test the model from the computer's camera

In [None]:
model.load_state_dict(torch.load(MODEL_PATH, map_location=device))
model.eval();

In [None]:
def preprocess(img):
    x = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    x = cv2.resize(x, (224, 224))
    x = cv2.medianBlur(x, 9)  # Reduce impulse noise
    x = cv2.GaussianBlur(x, (3, 3), 3.0)  # Reduce linear noise
    x = x/255.0
    x = x[None, None, ...]  # Adding batch and channel dimensions
    x = torch.from_numpy(x).float()
    x = x.to(device)
    return x

In [None]:
cam = cv2.VideoCapture(0)
cv2.namedWindow("test_model", cv2.WINDOW_NORMAL)
cv2.resizeWindow('test_model', 800, 600)
cont = 0
scores = np.zeros(len(CLASS_NAMES))
probs = np.zeros(len(CLASS_NAMES))
stats = []

while True:
    ret, frame = cam.read()

    cont += 1
    if not ret:
        print("failed to grab frame")
        break
    
    frame2 = cv2.cvtColor(frame, cv2.COLOR_RGB2BGR)
    process_frame = preprocess(frame2)

    with torch.no_grad():
        output = model(process_frame).cpu()
        output = nn.functional.softmax(output, dim=-1)
        index_pred = np.argmax(output)
        scores[index_pred] += 1
        probs += output.detach().numpy()[0]
            
    if cont >= 5:  # Voting across 5 frames
        index_pred = np.argmax(scores)
        pred_str = CLASS_NAMES[index_pred]
        stats = ["", "Prediction: " + str(pred_str)]
        stats.append("Output: " + str(probs/5))
        cont = 0
        frame_window = []
        scores = np.zeros(len(CLASS_NAMES))
        probs = np.zeros(len(CLASS_NAMES))
        
    for idx, text in enumerate(stats):
        cv2.putText(frame, text, (0, 30 + (idx * 30)), 
                    cv2.FONT_HERSHEY_SIMPLEX, 
                    0.5, (0, 0, 255), lineType=30)

    cv2.imshow("test_model", frame)

    k = cv2.waitKey(1)
    if k%256 == 27:
        # ESC pressed
        print("Escape hit, closing...")
        break

cam.release()
cv2.destroyAllWindows()

## Inference and Command

Receiving images from Tello, performing inference with the model and sending a command back to Tello.

In [None]:
model.load_state_dict(torch.load(MODEL_PATH, map_location=device))
model.eval();

In [None]:
# initilize drone
tellotrack = TelloCV()
tellotrack.init_drone()

In [None]:
cv2.namedWindow("main_loop", cv2.WINDOW_NORMAL)
cv2.resizeWindow('main_loop', 800, 600)
cont = 0
scores = np.zeros(len(CLASS_NAMES))
probs = np.zeros(len(CLASS_NAMES))
stats = []

tellotrack.drone.takeoff()
try:
    # skip first 300 frames
    frame_skip = 300
    while True:
        for frame in tellotrack.container.decode(video=0):
            if 0 < frame_skip:
                frame_skip = frame_skip - 1
                continue
            start_time = time.time()
            img, frame = tellotrack.process_frame(frame)
            frame2 = np.array(frame.to_image())
            img = preprocess(img)
            cont += 1
            
            with torch.no_grad():
                print(img.shape)
                output = model(img).cpu()
                output = nn.functional.softmax(output, dim=-1)
                index_pred = np.argmax(output)
                scores[index_pred] += 1
                probs += output.detach().numpy()[0]
                
            if cont >= 5:  # Voting across 5 frames
                index_pred = np.argmax(scores)
                pred_str = CLASS_NAMES[index_pred]
                stats = ["", "Prediction: " + str(pred_str)]
                stats.append("Output: " + str(probs/5))
                cont = 0
                frame_window = []
                scores = np.zeros(len(CLASS_NAMES))
                probs = np.zeros(len(CLASS_NAMES))
                
                tellotrack.send_cmd(pred_str)

            for idx, text in enumerate(stats):
                cv2.putText(frame2, text, (0, 30 + (idx * 30)), 
                            cv2.FONT_HERSHEY_SIMPLEX, 
                            0.5, (0, 0, 255), lineType=30)
            cv2.imshow("main_loop", frame2)
            
            if frame.time_base < 1.0/60:
                time_base = 1.0/60
            else:
                time_base = frame.time_base
                
            frame_skip = int((time.time() - start_time)/time_base)
            
            k = cv2.waitKey(1)
            if k%256 == 27:
                # ESC pressed
                print("Escape hit, closing...")
                break
            
        k = cv2.waitKey(1)
        if k%256 == 27:
            # ESC pressed
            print("Escape hit, closing...")
            break
except Exception as ex:
    exc_type, exc_value, exc_traceback = sys.exc_info()
    traceback.print_exception(exc_type, exc_value, exc_traceback)
    print(ex)
finally:
    tellotrack.drone.quit()
    cv2.destroyAllWindows()

In [None]:
tellotrack.drone.land()
tellotrack.drone.quit()

## Object Detection
As you have seen, this approach hasn't led to great performances during inference and it would be very risky to try to control a drone with the model. Therefore, we have decided to train an [object detection](https://en.wikipedia.org/wiki/Object_detection) model.

In object detection the task is to both classify and localize an object. There are different ways to do so, our model, [MobileNet-v2](https://paperswithcode.com/lib/torchvision/mobilenet-v2) taken from [Tenforflow 2 Object Detection API](https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/index.html), predicts a set of 6 numbers for each anchor, a predefined bounding box in the image.

The 6 numbers are composed as follows:
- 4 for bounding box translation and scale
- 1 for confidence score
- 1 for classification

### Setup

In [None]:
import pathlib
from sys import platform
# Clone the tensorflow models repository
if "models_tf2_api" in pathlib.Path.cwd().parts:
    while "models_tf2_api" in pathlib.Path.cwd().parts:
        os.chdir('..')
elif not pathlib.Path('models_tf2_api').exists():
    !git clone --depth 1 https://github.com/tensorflow/models "models_tf2_api"

# Install Object Detection API
if platform == "win32":
    if "coco_api" in pathlib.Path.cwd().parts:
        while "coco_api" in pathlib.Path.cwd().parts:
            os.chdir('..')
    elif not pathlib.Path('coco_api').exists():
        !git clone --depth 1 https://github.com/cocodataset/cocoapi "coco_api"
    !(cd coco_api/PythonAPI/ && echo F|xcopy /S /Q /Y /F "../../resources/setup_coco.py" setup.py && python setup.py build_ext --inplace)
    
    !(cd models_tf2_api/research/ && protoc object_detection/protos/*.proto --python_out=. && echo F|xcopy /S /Q /Y /F "../../resources/setup_od.py" setup.py && python -m pip install . --user)
else:
    !(cd models_tf2_api/research/ && protoc object_detection/protos/*.proto --python_out=. && cp object_detection/packages/tf2/setup.py . && python -m pip install .)

In [None]:
def get_model_detection_function(model):
    """Get a tf.function for detection."""

    @tf.function
    def detect_fn(image):
        """Detect objects in image."""

        image, shapes = model.preprocess(image)
        prediction_dict = model.predict(image, shapes)
        detections = model.postprocess(prediction_dict, shapes)

        return detections, prediction_dict, tf.reshape(shapes, [-1])

    return detect_fn

### Download model & load config

In [None]:
import tensorflow as tf
from object_detection.utils import config_util
from object_detection.builders import model_builder

model_dir = '../models/graph_model/checkpoint/ckpt-0'
pipeline_config = '../models/graph_model/pipeline.config'
configs = config_util.get_configs_from_pipeline_file(pipeline_config)
model_config = configs['model']
detection_model = model_builder.build(
      model_config=model_config, is_training=False)

# Restore checkpoint
ckpt = tf.compat.v2.train.Checkpoint(
      model=detection_model)
ckpt.restore(os.path.join(model_dir))

# get model function
detect_fn = get_model_detection_function(detection_model)

In [None]:
from object_detection.utils import label_map_util
from object_detection.utils import visualization_utils as viz_utils

# map labels for inference decoding
label_map_path = configs['eval_input_config'].label_map_path
label_map = label_map_util.load_labelmap(label_map_path)
categories = label_map_util.convert_label_map_to_categories(
    label_map,
    max_num_classes=label_map_util.get_max_label_map_index(label_map),
    use_display_name=True)
category_index = label_map_util.create_category_index(categories)
label_map_dict = label_map_util.get_label_map_dict(label_map, use_display_name=True)

In [None]:
cam = cv2.VideoCapture(0)
cv2.namedWindow("test_obj_det", cv2.WINDOW_NORMAL)
cv2.resizeWindow('test_obj_det', 800, 600)

while True:
    ret, frame = cam.read()
    if not ret:
        print("failed to grab frame")
        break
    img = cv2.cvtColor(frame, cv2.COLOR_RGB2BGR)
            
    input_tensor = tf.convert_to_tensor(
        np.expand_dims(img, 0), dtype=tf.float32)
    input_tensor = tf.image.resize(input_tensor, (320, 320))
    detections, _, _ = detect_fn(input_tensor)
    scores = detections["detection_scores"][0].numpy()
    classes = detections["detection_classes"][0].numpy()
    bboxes = detections["detection_boxes"][0].numpy()
    index_scores = np.where(scores>0.40)[0]  # 0.40 threshold

    if len(index_scores) > 0:
        pred_str = "other"
        for id_score in index_scores:
            index_pred = int(classes[id_score])
            if not CLASS_NAMES[index_pred] == "other":  # not face found
                pred_str = CLASS_NAMES[index_pred]
                break                                   # Stopping at first valid command
    else:
        pred_str = "other"
            
    label_id_offset = 1
    image_np_with_detections = frame.copy()
    viz_utils.visualize_boxes_and_labels_on_image_array(
          image_np_with_detections,
          detections['detection_boxes'][0].numpy(),
          (detections['detection_classes'][0].numpy() + label_id_offset).astype(int),
          detections['detection_scores'][0].numpy(),
          category_index,
          use_normalized_coordinates=True,
          max_boxes_to_draw=3,
          min_score_thresh=.40,
          agnostic_mode=False,
    )
    cv2.imshow("test_obj_det", image_np_with_detections)

    k = cv2.waitKey(1)
    if k%256 == 27:
        # ESC pressed
        print("Escape hit, closing...")
        break

cam.release()
cv2.destroyAllWindows()

### Inference

In [None]:
# initialize drone
tellotrack = TelloCV()
tellotrack.init_drone()

In [None]:
cv2.namedWindow("main_loop_obj_det", cv2.WINDOW_NORMAL)
cv2.resizeWindow('main_loop_obj_det', 800, 600)

tellotrack.drone.takeoff()
try:
    # skip first 300 frames
    frame_skip = 300
    while True:
        for frame in tellotrack.container.decode(video=0):
            if 0 < frame_skip:
                frame_skip = frame_skip - 1
                continue
            start_time = time.time()
            img, frame = tellotrack.process_frame(frame)
            # frame = np.array(frame.to_image())
            
            input_tensor = tf.convert_to_tensor(
                np.expand_dims(img, 0), dtype=tf.float32)
            input_tensor = tf.image.resize(input_tensor, (320, 320))
            detections, _, _ = detect_fn(input_tensor)
            scores = detections["detection_scores"][0].numpy()
            classes = detections["detection_classes"][0].numpy()
            bboxes = detections["detection_boxes"][0].numpy()
            index_scores = np.where(scores>0.40)[0]  # 0.40 threshold

            if len(index_scores) > 0:
                pred_str = "other"
                for id_score in index_scores:
                    index_pred = int(classes[id_score])
                    if not CLASS_NAMES[index_pred] == "other":  # not face found
                        pred_str = CLASS_NAMES[index_pred]
                        break                                   # Stopping at first valid command
            else:
                pred_str = "other"
                
            tellotrack.send_cmd(pred_str)  # Sending command to drone
            
            label_id_offset = 1
            image_np_with_detections = img.copy()
            viz_utils.visualize_boxes_and_labels_on_image_array(
                  image_np_with_detections,
                  detections['detection_boxes'][0].numpy(),
                  (detections['detection_classes'][0].numpy() + label_id_offset).astype(int),
                  detections['detection_scores'][0].numpy(),
                  category_index,
                  use_normalized_coordinates=True,
                  max_boxes_to_draw=3,
                  min_score_thresh=.40,
                  agnostic_mode=False,
            )
            cv2.imshow("main_loop_obj_det", image_np_with_detections)
            
            if frame.time_base < 1.0/60:
                time_base = 1.0/60
            else:
                time_base = frame.time_base
                
            frame_skip = int((time.time() - start_time)/time_base)
            
            k = cv2.waitKey(1)
            if k%256 == 27:
                # ESC pressed
                print("Escape hit, closing...")
                break
            
        k = cv2.waitKey(1)
        if k%256 == 27:
            # ESC pressed
            print("Escape hit, closing...")
            break
except Exception as ex:
    exc_type, exc_value, exc_traceback = sys.exc_info()
    traceback.print_exception(exc_type, exc_value, exc_traceback)
    print(ex)
finally:
    tellotrack.drone.quit()
    cv2.destroyAllWindows()

In [None]:
tellotrack.drone.land()
tellotrack.drone.quit()

## Future Work
You may try to improve over this by increasing the amount of data, trying different architectures and/or performing more preprocessing, such as removing the background.