# Explore the dataset


In this notebook, we will perform an EDA (Exploratory Data Analysis) on the processed Waymo dataset (data in the `processed` folder). In the first part, you will create a function to display 

In [None]:
from utils import get_dataset
import glob
import copy
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import numpy as numpy
import tensorflow as tf

%matplotlib inline

In [None]:
dataset_glob_path = "/home/workspace/data/train/*.tfrecord"
dataset = get_dataset(dataset_glob_path)

## Write a function to display an image and the bounding boxes

Implement the `display_instances` function below. This function takes a batch as an input and display an image with its corresponding bounding boxes. The only requirement is that the classes should be color coded (eg, vehicles in red, pedestrians in blue, cyclist in green).

In [None]:
def display_instances(batch):
    """
    This function takes a batch from the dataset and displays the image with 
    the associated bounding boxes.
    """
    rgb_mapping = { 1: 'red',   # Vehicles
                    2: 'green', # Cyclists
                    4: 'blue'   # Pedestrians
                  }

    # Set the plot to show 2x5 grid of images at 36 inches each
    # @note: Had trouble getting the workspace notebook to allow for better sizing
    col, row = 5, 2
    _, ax = plt.subplots(col, row, figsize=(36, 36))
    for n, batch_data in enumerate(batch):
        # Get the col, row position for the current image
        x = n % col
        y = n % row
        # Parse out display information
        bboxes = batch_data['groundtruth_boxes'].numpy()
        classes = batch_data['groundtruth_classes'].numpy()
        img = batch_data['image']

        # Display the batch image
        ax[x, y].imshow(img)
    
        # Normalize the bounding boxes to the current image size
        img_height, img_width, _ = img.shape
        normalized_bboxes = copy.deepcopy(bboxes)
        normalized_bboxes[:, (0, 2)] = bboxes[:, (0, 2)] * img_height
        normalized_bboxes[:, (1, 3)] = bboxes[:, (1, 3)] * img_width

        # Draw the bounding box with the correct coloring based on the classification
        # (i.e. vehicle, pedestrian, cyclist)
        for bb, cl in zip (normalized_bboxes, classes):
            y1, x1, y2, x2 = bb
            anchor_point = (x1, y1)
            bb_w =  x2 - x1
            bb_h = y2 - y1
            rec = patches.Rectangle(anchor_point, bb_w, bb_h, facecolor='none',
                                    edgecolor=rgb_mapping[cl])
            ax[x, y].add_patch(rec)
        ax[x, y].axis('off')
    plt.tight_layout()
    plt.show()

## Display 10 images 

Using the dataset created in the second cell and the function you just coded, display 10 random images with the associated bounding boxes. You can use the methods `take` and `shuffle` on the dataset.

In [None]:
# Limiting shuffle buffer_size to 1000 to prevent eating up memory
rand_dataset = dataset.shuffle(1000)
display_instances(dataset.take(10))

## Additional EDA

In this last part, you are free to perform any additional analysis of the dataset. What else would like to know about the data?
For example, think about data distribution. So far, you have only looked at a single file...

In [None]:
The bounding boxes seem a bit off for blurrier images (due to weather). Additionally, with more objects to detect, the boxes make the image hard to see --- making it hard to detect if there's a false postive. 