# COCOnut - Recognizing Common Objects

## Project Outline

In this project, we’ll be utilizing the AeroCore 2 for Jetson’s CSI camera capabilities in conjunction with [Google’s Object Detection API](https://github.com/tensorflow/models/tree/master/research/object_detection) to recognize common objects around you.

<img src="./objdetect.jpeg">

You will see a live feed with superimposed labels for 80 different classes of objects using a deep convolutional neural network. You’ll gain experience with using a large dataset - [COCO (Common Objects in Context) dataset](http://cocodataset.org/#home) - as well as with the [OpenCV API](https://opencv.org/). Let’s get started!

## Installing Dependencies

We first need to get the computer vision library that we’ll be using for this project - OpenCV (Open Source Computer Vision Library). OpenCV is a widely-used, open source computer vision and machine learning software library. Installing it can be a bit tedious, but luckily the folks over at JetsonHacks have a nice script to do it all for you. (NOTE: We’ll be using python 3+ for this project). Running the below cell will install all the dependencies and get everything setup:

In [None]:
!./setup.sh

If you want to do install things manually (optional):
Download their repository [here](https://github.com/jetsonhacks/buildOpenCVTX2) or by cloning it onto the Jetson TX2:

In [None]:
!git clone https://github.com/jetsonhacks/buildOpenCVTX2.git

Or for the TX1:

In [None]:
!git clone https://github.com/jetsonhacks/buildOpenCVTX1.git

cd into the folder and make a folder for the build:

In [None]:
!cd buildOpenCVTX* && mkdir opencv_build

Now execute the build script:

In [None]:
!sudo ./buildOpenCV.sh -s opencv_build

The script will take around an hour and a half to build the library. After that is finished, <font color='blue'>Cython</font>, <font color='blue'>pillow</font>, <font color='blue'>lxml</font>, <font color='blue'>jupyter</font>, <font color='blue'>matplotlib</font>, and Google’s <font color='blue'>Protocol Buffers</font> all need to be installed to get up and running.

To install the Protocol Buffers (protoc), download the protoc-x.x.x-linux-aarch_64.zip file from the [releases page](https://github.com/google/protobuf/releases) into this project directory:

Unzip it:

In [None]:
!unzip *aarch_64.zip -d protoc3

Install it by simply copying over the binaries:

In [None]:
!sudo mv protoc3/bin/* /usr/local/bin/
!sudo mv protoc3/include/* /usr/local/include/

Now we need to get the python dependencies:

In [None]:
!sudo pip3 install Cython
!sudo pip3 install pillow
!sudo pip3 install lxml
!sudo pip3 install jupyter

We’ll need some other libraries to install matplotlib: 

In [None]:
!sudo apt-get install libfreetype6-dev pkg-config libpng12-dev 

Now to install matplotlib:

In [None]:
!sudo pip3 install matplotlib

Now to download the [TensorFlow models repository](https://github.com/tensorflow/models). There are a lot of interesting models under the /research/ header, from full resolution image compression with RNNs to text summarization, but we’ll be using the object detection API for this project. Download it with:

In [None]:
!git clone https://github.com/tensorflow/models.git

cd into the /models/research/ directory and setup protoc:

In [None]:
!cd models/research && protoc object_detection/protos/*.proto --python_out=.

Now all the dependencies are installed.

## Making the Script and Configuring the backend

Google provides a useful starting script for single images, but we want it to work on the Jetson with a live camera feed. We’ll keep many of the same imports and setup as the original:

In [None]:
import numpy as np
import os
import six.moves.urllib as urllib
import sys
import tarfile
import tensorflow as tf
import zipfile

from collections import defaultdict
from io import StringIO
from matplotlib import pyplot as plt
from PIL import Image

sys.path.append("..")
from object_detection.utils import ops as utils_ops

from utils import label_map_util
from utils import visualization_utils as vis_util

The first thing we need to do is allow the script to input frames from the CSI camera on the AeroCore. We’ll use OpenCV with a <font color='green'>nvcamerasrc</font> pipeline:

In [None]:
import cv2
pipeline = "nvcamerasrc ! video/x-raw(memory:NVMM), width=(int)1920, height=(int)1080, format=(string)I420, framerate=(fraction)60/1 ! nvvidconv flip-method=2 ! video/x-raw, format=(string)I420 ! videoconvert ! video/x-raw, format=(string)BGR ! appsink"
cap = cv2.VideoCapture(pipeline)

Here you can specify various options on your input pipeline such as framerate, resolution, orientation, and color. To change the resolution, simply edit the values for <font color='green'>width=</font> and <font color='green'>height=</font>. To change the framerate, change the <font color='green'>framerate=</font> value to whatever framerate you want. The <font color='green'>flip-method=</font> flag can also be changed to different modes to change camera orientation. This pipeline specifies a 1920x1080 resolution at 60FPS rotated 180°. 

Now to specify and download the model to use. Pick a model from the list of COCO-trained models in the [TensorFlow detection model zoo](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md#coco-trained-models-coco-models). It’s recommended that you pick one of the mobilenets as these are more efficient CNNs designed for mobile use and thus will be faster on the Jetson. Click on it to download and you’ll see it’s file name (Ex. <font color='red'>ssd_mobilenet_v2_coco_2018_03_29.tar</font>). This is what we’ll use in the script:

In [None]:
MODEL_NAME = 'NET_NAME'
MODEL_FILE = MODEL_NAME + '.tar.gz'
DOWNLOAD_BASE = 'http://download.tensorflow.org/models/object_detection/'
PATH_TO_CKPT = MODEL_NAME + '/frozen_inference_graph.pb'
PATH_TO_LABELS = os.path.join('data', 'mscoco_label_map.pbtxt')

Replace NET_NAME with the file name you just found. When the script runs, it will download or load this pre-trained model. This dataset contains 80 different classes of objects so we’ll add:

In [None]:
NUM_CLASSES = 80

Now to download the model and load it into memory:

In [None]:
opener = urllib.request.URLopener()
opener.retrieve(DOWNLOAD_BASE + MODEL_FILE, MODEL_FILE)
tar_file = tarfile.open(MODEL_FILE)
for file in tar_file.getmembers():
    file_name = os.path.basename(file.name)
    if 'frozen_inference_graph.pb' in file_name:
        tar_file.extract(file, os.getcwd())

detection_graph = tf.Graph()

with detection_graph.as_default():
    od_graph_def = tf.GraphDef()
    with tf.gfile.GFile(PATH_TO_CKPT, 'rb') as fid:
        serialized_graph = fid.read()
        od_graph_def.ParseFromString(serialized_graph)
        tf.import_graph_def(od_graph_def, name='')

We’ll create a label map to map numbers to objects. Think of it like a standard hashtable or hashmap. This is so when our CNN predicts a “41,” we know this means a “TV”:

In [None]:
label_map = label_map_util.load_labelmap(PATH_TO_LABELS)
categories = label_map_util.convert_label_map_to_categories(label_map, max_num_classes=NUM_CLASSES, use_display_name=True)
category_index = label_map_util.create_category_index(categories)

Just like in the first project, we need to process the each frame into an array of values to pass into our model:

In [None]:
def load_image_into_numpy_array(image):
    (im_width, im_height) = image.size
    return np.array(image.getdata()).reshape((im_height, im_width, 3)).astype(np.uint8)

Now to do the detection. We’ll use the detection graph we just downloaded in our TF session to detect and visualize that detection:

In [None]:
with detection_graph.as_default():
    with tf.Session(graph=detection_graph) as sess:
        while True:
            ret, image_np = cap.read()
            # Expand dimensions since the model expects images to have shape: [1, None, None, 3]
            image_np_expanded = np.expand_dims(image_np, axis=0)
            image_tensor = detection_graph.get_tensor_by_name('image_tensor:0')
            # Each box represents a part of the image where a particular object was detected.
            boxes = detection_graph.get_tensor_by_name('detection_boxes:0')
            # Each score represent how level of confidence for each of the objects.
            # Score is shown on the result image, together with the class label.
            scores = detection_graph.get_tensor_by_name('detection_scores:0')
            classes = detection_graph.get_tensor_by_name('detection_classes:0')
            num_detections = detection_graph.get_tensor_by_name(
                'num_detections:0')
            # Actual detection.
            (boxes, scores, classes, num_detections) = sess.run(
                [boxes, scores, classes, num_detections],
                feed_dict={image_tensor: image_np_expanded})
            # Visualization of the results of a detection.
            vis_util.visualize_boxes_and_labels_on_image_array(
                image_np,
                np.squeeze(boxes),
                np.squeeze(classes).astype(np.int32),
                np.squeeze(scores),
                category_index,
                use_normalized_coordinates=True,
                line_thickness=8)

            cv2.imshow('object detection', cv2.resize(image_np, (800, 600)))
            if cv2.waitKey(25) & 0xFF == ord('q'):
                cv2.destroyAllWindows()
                break

That’s it on the python side. Now we need to change the backend for matplotlib so it will run on the Jetson. To find where the backend is specified, run:

In [None]:
!sudo vi `python3 -c "import matplotlib; print(matplotlib.matplotlib_fname())"`

This command first creates and writes a python script to print the path to the matplotlibrc file, which contains the backend specification, and then passes that output into a vi command for you to edit it. Now in the editor, find the <span style="background-color: #000000"><font color="white"><b>backend : gtk3agg</span></font></b> line and replace <span style="background-color: #000000"><font color="white"><b>gtk3agg</span></font></b> with <span style="background-color: #000000"><font color="white"><b>qt4agg</span></font></b> (remember: enter EDIT mode by hitting “I” and enter COMMAND mode by hitting ESC. Save by typing “:w” and quit with “:q”). 

Finally, we need to install the qt4agg backend we want it to use:

In [None]:
!sudo apt-get install python3-pyqt4

And that’s it! Though it seems shorter than the much simpler MNIST example, this is a much higher-level project - much more is going on under the hood, especially with the downloaded model. Run the script (<font color="green">python <.py name> or python3 <.py name></font>). It will take a short while to initialize the pipeline and load the model, but you should see a window pop up showing a stream of the CSI camera. It won’t be buttery smooth, but keep in mind that the framerate set in the pipeline is for input into the model. The calculations in between the input and the output use a lot of system resources, thus lowering the output framerate. Now you can point the camera around the room, even at yourself, and watch as it draws label boxes around various objects!
