# Explore the dataset


In this notebook, we will perform an EDA (Exploratory Data Analysis) on the processed Waymo dataset (data in the `processed` folder). In the first part, you will create a function to display 

In [1]:
import glob
import os
from utils import get_dataset
from  matplotlib import use as pltuse
import matplotlib.pyplot as plt
import matplotlib.patches as patches
#from matplotlib.patches import Rectangle
from PIL import Image


In [2]:
import numpy

In [3]:
dataset = get_dataset("/home/workspace/data/train/*.tfrecord")

INFO:tensorflow:Reading unweighted datasets: ['/home/workspace/data/train/*.tfrecord']
INFO:tensorflow:Reading record datasets for input file: ['/home/workspace/data/train/*.tfrecord']
INFO:tensorflow:Number of filenames to read: 86
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_deterministic`.
Instructions for updating:
Use `tf.data.Dataset.map()


## Write a function to display an image and the bounding boxes

Implement the `display_instances` function below. This function takes a batch as an input and display an image with its corresponding bounding boxes. The only requirement is that the classes should be color coded (eg, vehicles in red, pedestrians in blue, cyclist in green).

In [4]:
def display_instances(batch):
    """
    This function takes a batch from the dataset and display the image with 
    the associated bounding boxes.
    """
    # ADD CODE HERE
    # color mapping of classes
    colormap = {1: [1,0,0], 2: [0,1,0], 4: [0,0,1]}
    for image in batch:
        fig, ax = plt.subplots()
        img = image['image']
        ax.imshow(img)
        scale = image['image'].shape[0]
        bboxes = image['groundtruth_boxes']
        classes = image['groundtruth_classes']
        for cl, bb in zip(classes, bboxes):
            y1, x1, y2, x2 = bb * scale
            rec = patches.Rectangle((x1, y1), x2- x1, y2-y1, facecolor='none', 
                            edgecolor=colormap[cl.numpy()])
            ax.add_patch(rec)
        pltuse('TkAgg')
        plt.show()



## Display 10 images 

Using the dataset created in the second cell and the function you just coded, display 10 random images with the associated bounding boxes. You can use the methods `take` and `shuffle` on the dataset.

In [None]:
## STUDENT SOLUTION HERE
dataset_shuffle = dataset.shuffle(200)
display_instances(dataset_shuffle.take(11))


## Additional EDA

In this last part, you are free to perform any additional analysis of the dataset. What else would like to know about the data?
For example, think about data distribution. So far, you have only looked at a single file...

In [6]:
#Environmental lighting analysis
#'Based on https://natespilman.com/blog/making-a-histogram-image-light-with-matplotlib/'
#LightImage = 0
#DarkImage = 0
Brightness =[]
for image in dataset.take(5000):
    ImageArray = numpy.array(image['image'])
    Brightness.append(numpy.sum (ImageArray) /(ImageArray.shape[0]+ImageArray.shape[1]+ImageArray.shape[2]))
fig, ax = plt.subplots()
ax.hist(Brightness, bins=5, edgecolor='white')
ax.set_xlabel("Brightness Classification")
ax.set_ylabel("Image density")
ax.set_title = ('Image Density vs Brightness')
rec = ax.patches
bar_label = ['Very Dark', 'Dark', 'Normal', 'Bright', 'Very Bright']
for rec, label in zip(rec,bar_label):
    height = rec.get_height()
    # 
    ax.text(rec.get_x() + rec.get_width()/2, height+1, label, ha = 'center', va ='bottom')
plt.show()
    


In [7]:
# Find the ratio of vehicles, cycists and pedestrians
Cyclists = 0
Pedestrians = 0
Vehicles = 0
for image in dataset.take(10000):
    Vehicles += list((image['groundtruth_classes']).numpy()).count(1)
    Pedestrians += list((image['groundtruth_classes']).numpy()).count(2)
    Cyclists += list((image['groundtruth_classes']).numpy()).count(4)
labels= 'Vehicles', 'Pedestrians', 'Cyclists'
TotalDensity = [Vehicles, Pedestrians, Cyclists]
fig, ax = plt.subplots()
ax.pie(TotalDensity, labels =labels, autopct = '%1.1f%%')
plt.show()


In [8]:
#Sort images with no classes, only two classes and all classes
#ImagesEmpty = 0
#ImagesVehCy = 0
#ImagesVehPed = 0
#ImagesPedCy = 0
#ImagesVehPedCy = 0
#for image in dataset.take(10000):     
 #   Vehicles += list((image['groundtruth_classes']).numpy()).count(1)
 #   Pedestrians += list((image['groundtruth_classes']).numpy()).count(2)
    #Cyclists += list((image['groundtruth_classes']).numpy()).count(4)
    #ListComb = [Vehicles, Cyclists, Pedestrians]
    #if ListComb.count(0) == 3:
    #    ImagesEmpty += 1
   # else if ListComb.count(0) == 0:
     #   ImagesVehPedCy += 1
    #else if ListComb.count(0) == 1:
        
        
    
    #if Vehicles == 0 and Pedestrians == 0 and Cyclists == 0:
    #    ImagesEmpty += 1
   # else if Vehicles > 0 and Pedestrians > 0 and Cyclist > 0:
        #ImagesVehPedCy += 1

    