# Explore the dataset


In this notebook, we will perform an EDA (Exploratory Data Analysis) on the processed Waymo dataset (data in the `processed` folder). In the first part, you will create a function to display 

In [1]:
from utils import get_dataset
import tensorflow as tf
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
import numpy as np

%matplotlib inline

In [2]:
tf.config.list_physical_devices('GPU')

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

In [None]:
dataset = get_dataset("/home/workspace/data/train/*.tfrecord")

## Write a function to display an image and the bounding boxes

Implement the `display_instances` function below. This function takes a batch as an input and display an image with its corresponding bounding boxes. The only requirement is that the classes should be color coded (eg, vehicles in red, pedestrians in blue, cyclist in green).

In [None]:
def display_instances(batch):
    """
    This function takes a batch from the dataset and display the image with 
    the associated bounding boxes.
    """
    # color mapping of classes
    colormap = {1: [1, 0, 0], 2: [0, 0, 1], 4:[0, 1, 0]}
    
    # retrieve dataset size
    n_data = len(list(batch))
    
    # loop through images. Plot image and its corresponding bounding boxes
    for sample in batch:
        
        # retrieve image data, boxes data and class data for each bounding box 
        img = sample['image'].numpy()
        boxes = sample['groundtruth_boxes'].numpy()
        classes = sample['groundtruth_classes'].numpy()
        
        # show image
        fig, ax = plt.subplots(1,1, figsize = (8,8))
        
        
        # add bounding boxes
        for cl, bb in zip(classes, boxes):
            # calculate coordinate of bounding box
            y1, x1, y2, x2 = bb
            y1 = y1*img.shape[0]*640/1280
            x1 = x1*img.shape[1]*640/1920
            y2 = y2*img.shape[0]*640/1280
            x2 = x2*img.shape[1]*640/1920

            # create bounding box variable
            rec = Rectangle((x1,y1), x2-x1, y2-y1, facecolor = 'none', edgecolor = colormap[cl])
            
            # add rec to image
            ax.add_patch(rec)
            
        # set image property
        ax.imshow(img)
        ax.axis('off')
        plt.show()

    # ADD CODE HERE

## Display 10 images 

Using the dataset created in the second cell and the function you just coded, display 10 random images with the associated bounding boxes. You can use the methods `take` and `shuffle` on the dataset.

In [None]:
## STUDENT SOLUTION HERE

# calculate class distribution in training dataset
n_data = 20000
batch = dataset.take(20000)
labels = {1:0, 2:0, 4:0}


for sample in batch:
    label = sample['groundtruth_classes'].numpy()
    for l in label:
        print(l)
        labels[l] += 1
        
    break

# batch = dataset.shuffle(75).take(10)
# display_instances(batch)

## Additional EDA

In this last part, you are free to perform any additional analysis of the dataset. What else would like to know about the data?
For example, think about data distribution. So far, you have only looked at a single file...

In [None]:
## STUDENT SOLUTION HERE
batch = dataset.shuffle(75).take(50)
display_instances(batch)

# observation:
1. Most images don't include bicycle
2. Most images are taken during day time with sunny da