# Analysis - Object Detection

This notebook is about the exploratory data analysis to Object Detection project

In [1]:
import os
import random

import pandas as pd
import numpy as np

from matplotlib import pyplot as plt
import seaborn as sns

import json

import project.download_content as content

from collections import Counter

import notebooks_utils.analysis as utils

from IPython.display import display




%matplotlib inline

This analysis starts with the downloading of all the data. To do this, you could use the makefile created. Just open the terminal, go to where you clone this project and run `make download-content`, following the instructions to download files.

**To run this analysis and reproduced it, you must download the METADATA files**. Besides that, in some cells, it is necessary to download the images (TRAIN, TEST, and VALIDATION image files). Because of that, to reproduce this analysis entirely, you should download these files also (download around 550Gb). If you did not download them, these cells are not going to run entirely, but the cell will notify you about this, and the process is going to follow.

The analysis made with all the images files downloaded could be accessed in an HTML file that is in the project, called "analysis.html". You do not need to download all images to see it, open the file in your browser.

## Wrangling

### Gather

In [2]:
# Gathering all files
if not content.does_metadata_exist():
    raise OSError(f'There are metadata file(s) that did not downloaded yet...')
print('All Metadata files exist...')

METAPATH = os.path.join(content.DATAPATH, 'METADATA')

# metadata general files
print('Gathering all metadata files...', end='')

df_classes_raw = pd.read_csv(METAPATH + "/class-descriptions-boxable.csv",
                             names=['class_encode', 'class_name'])
with open(METAPATH + "/bbox_labels_600_hierarchy.json") as f:
    dict_hierarchy_raw = json.load(f)
print('OK!')


# train files
print('Gathering all train files...', end='')
df_train_bbox_raw = pd.read_csv(METAPATH + "/train-annotations-bbox.csv")
df_train_labels_raw = pd.read_csv(
    METAPATH + "/train-annotations-human-imagelabels-boxable.csv")
print('OK!')


# validation files
print('Gathering all validation files...', end='')
df_val_bbox_raw = pd.read_csv(METAPATH + "/validation-annotations-bbox.csv")
df_val_labels_raw = pd.read_csv(
    METAPATH + "/validation-annotations-human-imagelabels-boxable.csv")
print('OK!')


# test files
print('Gathering all test files...', end='')
df_test_bbox_raw = pd.read_csv(METAPATH + "/test-annotations-bbox.csv")
df_test_labels_raw = pd.read_csv(
    METAPATH + "/test-annotations-human-imagelabels-boxable.csv")
print('OK!')

# downloaded images
if utils.all_images_downloaded():
    print("Gathering labels of all images downloaded...", end='')
    df_images_train = utils.images_downloaded('TRAIN')
    df_images_val = utils.images_downloaded('VALIDATION')
    df_images_test = utils.images_downloaded('TEST')
    print('Ok!')
else:
    print("""
Unfortunately you did not download the files...
That's ok, you can follow, but some cells will not run properly.
    """)

All Metadata files exist...
Gathering all metadata files...OK!
Gathering all train files...OK!
Gathering all validation files...OK!
Gathering all test files...OK!
Gathering labels of all images downloaded...Ok!


### Assess

#### Explaining data

It is crucial to start the analysis to understand all the files downloaded.

When you run the command `make download-content`, you should notice that the files are divided into four blocks, METADATA, TRAIN, VALIDATION, and TEST.

The blocks TRAIN, VALIDATION, and TEST are about the images files, and there is nothing but that, the images.

**The block METADATA is a must-have content**. In it, we are going to find eight files.

**The first six files are essentially a split of two data**, the bounding box annotations *([type]-annotations-bbox.csv)* and the image-level annotations *([type]-annotations-human-imagelabels-boxable.csv)*.

Image-level Annotations are manual data labeled by a human (from google and a crowdsourced system). These labels try to represent what the image has. More about this dataset could be found in [here](https://storage.googleapis.com/openimages/web/factsfigures.html) in session **Image-level Labels**.

The Bounding Box Annotation is a dataset that defines, for each image, where are the objects annotated previously by the Image-level Annotation. These annotations are focused on the most specific labels in Image-level Annotations. More about this dataset could be founded [here](https://storage.googleapis.com/openimages/web/factsfigures.html) in session **Bonding Boxes**. 

These two datasets described above were divided by Google in three sets. Train, which is going to be used to train the model, Validation, which is going to be used to compare the models, and Test, which is going to be used to do the final evaluation of the chosen one model.

The final two files are about the classes themselves.

All the labels described in the paragraphs above are encoded representations of the classes. The CSV *class-descriptions-boxable.csv* maps each **encoded class name** (machine-understandable) **to a semantic class name** (human-understandable).

Lastly, the JSON file *bbox_labels_600_hierarchy.json* **describes a hierarchical tree of all classes**, describing a hierarchy of each class. This file shows us, for example, that Woman derives from Person.

#### Assessing Data

In [3]:
#number of images by dataset
print(f"""Number of bouding boxes: {(df_train_bbox_raw.shape[0]
                                     + df_val_bbox_raw.shape[0]
                                     + df_test_bbox_raw.shape[0]):,}""", end="\n"*2)

print(f"labels in train images: {df_train_labels_raw.shape[0]:,}")
print(f"labels in validation images: {df_val_labels_raw.shape[0]:,}")
print(f"labels in test images: {df_test_labels_raw.shape[0]:,}", end="\n"*2)

print(f"bouding boxes labeled in train: {df_train_bbox_raw.shape[0]:,}")
print(f"bouding boxes labeled in validation: {df_val_bbox_raw.shape[0]:,}")
print(f"bouding boxes labeled in test: {df_test_bbox_raw.shape[0]:,}")

Number of bouding boxes: 15,851,536

labels in train images: 8,996,795
labels in validation images: 256,707
labels in test images: 772,776

bouding boxes labeled in train: 14,610,229
bouding boxes labeled in validation: 303,980
bouding boxes labeled in test: 937,327


In [4]:
utils.check_images_download()

print(f"images downloaded in train: {df_images_train.shape[0]:,}")
print(f"images downloaded in validation: {df_images_val.shape[0]:,}")
print(f"images downloaded in test: {df_images_test.shape[0]:,}")

images downloaded in train: 1,743,042
images downloaded in validation: 41,620
images downloaded in test: 125,436


#### Explaining more about the data

###### Classes Mapping

In [5]:
# display classes and their encodes
print(f"total classes: {df_classes_raw.shape[0]}")
display(df_classes_raw.sample(2, random_state=17))

total classes: 601


Unnamed: 0,class_encode,class_name
386,/m/07dd4,Torch
58,/m/01dws,Bear


In the cell above, all datasets are described:

In [6]:
# show info about all dfs
for k, df in {'Train Bounding Boxes': df_train_bbox_raw,
              'Train Labels': df_train_labels_raw,
              'Validation Bounding Boxes': df_val_bbox_raw,
              'Validation Labels': df_val_labels_raw,
              'Test Bounding Boxes': df_test_bbox_raw,
              'Test Labels': df_test_labels_raw}.items():
    print(f"####### {k.upper()} #######", end="\n"*2)
    print(f"shape: {df.shape[0]:,} rows, {df.shape[1]} columns")
    print(f"duplicated values: {df[df.duplicated(keep='first')].shape[0]} records",
          end="\n"*2)

    print("Unique Values:")
    for col in df.columns:
        print(
            f"   {str(col)+' ':-<15} {str(df[col].dtype).upper()+' ':-<10} Nulls = {df[col].isna().sum():,} | Uniques = {df[col].nunique():,}")
    display(df.sample(2, random_state=37))
    print('_'*80, end="\n"*2)

####### TRAIN BOUNDING BOXES #######

shape: 14,610,229 rows, 13 columns
duplicated values: 558 records

Unique Values:
   ImageID ------- OBJECT --- Nulls = 0 | Uniques = 1,743,042
   Source -------- OBJECT --- Nulls = 0 | Uniques = 2
   LabelName ----- OBJECT --- Nulls = 0 | Uniques = 599
   Confidence ---- INT64 ---- Nulls = 0 | Uniques = 1
   XMin ---------- FLOAT64 -- Nulls = 0 | Uniques = 446,369
   XMax ---------- FLOAT64 -- Nulls = 0 | Uniques = 446,839
   YMin ---------- FLOAT64 -- Nulls = 0 | Uniques = 496,455
   YMax ---------- FLOAT64 -- Nulls = 0 | Uniques = 476,443
   IsOccluded ---- INT64 ---- Nulls = 0 | Uniques = 3
   IsTruncated --- INT64 ---- Nulls = 0 | Uniques = 3
   IsGroupOf ----- INT64 ---- Nulls = 0 | Uniques = 3
   IsDepiction --- INT64 ---- Nulls = 0 | Uniques = 3
   IsInside ------ INT64 ---- Nulls = 0 | Uniques = 3


Unnamed: 0,ImageID,Source,LabelName,Confidence,XMin,XMax,YMin,YMax,IsOccluded,IsTruncated,IsGroupOf,IsDepiction,IsInside
2219686,2360d8f563dbb953,xclick,/m/01g317,1,0.378125,0.529688,0.375,0.804167,1,0,0,0,0
9909796,a96113e785b31ec2,xclick,/m/07j7r,1,0.43625,0.49875,0.636961,0.77955,1,0,0,0,0


________________________________________________________________________________

####### TRAIN LABELS #######

shape: 8,996,795 rows, 4 columns
duplicated values: 0 records

Unique Values:
   ImageID ------- OBJECT --- Nulls = 0 | Uniques = 1,743,042
   Source -------- OBJECT --- Nulls = 0 | Uniques = 2
   LabelName ----- OBJECT --- Nulls = 0 | Uniques = 601
   Confidence ---- INT64 ---- Nulls = 0 | Uniques = 2


Unnamed: 0,ImageID,Source,LabelName,Confidence
7673221,d52ec8ea8095ceda,verification,/m/05r655,0
1702253,2793756b766d2b66,verification,/m/01bl7v,0


________________________________________________________________________________

####### VALIDATION BOUNDING BOXES #######

shape: 303,980 rows, 13 columns
duplicated values: 0 records

Unique Values:
   ImageID ------- OBJECT --- Nulls = 0 | Uniques = 37,306
   Source -------- OBJECT --- Nulls = 0 | Uniques = 1
   LabelName ----- OBJECT --- Nulls = 0 | Uniques = 570
   Confidence ---- INT64 ---- Nulls = 0 | Uniques = 1
   XMin ---------- FLOAT64 -- Nulls = 0 | Uniques = 52,066
   XMax ---------- FLOAT64 -- Nulls = 0 | Uniques = 52,182
   YMin ---------- FLOAT64 -- Nulls = 0 | Uniques = 45,696
   YMax ---------- FLOAT64 -- Nulls = 0 | Uniques = 45,471
   IsOccluded ---- INT64 ---- Nulls = 0 | Uniques = 2
   IsTruncated --- INT64 ---- Nulls = 0 | Uniques = 2
   IsGroupOf ----- INT64 ---- Nulls = 0 | Uniques = 2
   IsDepiction --- INT64 ---- Nulls = 0 | Uniques = 2
   IsInside ------ INT64 ---- Nulls = 0 | Uniques = 2


Unnamed: 0,ImageID,Source,LabelName,Confidence,XMin,XMax,YMin,YMax,IsOccluded,IsTruncated,IsGroupOf,IsDepiction,IsInside
265694,df011533a62a094a,xclick,/m/07mhn,1,0.137787,0.48643,0.607812,0.998437,1,1,0,0,0
158384,864d6c8df3d06b74,xclick,/m/01g317,1,0.159057,0.210604,0.232816,0.421286,1,0,0,0,0


________________________________________________________________________________

####### VALIDATION LABELS #######

shape: 256,707 rows, 4 columns
duplicated values: 0 records

Unique Values:
   ImageID ------- OBJECT --- Nulls = 0 | Uniques = 41,151
   Source -------- OBJECT --- Nulls = 0 | Uniques = 2
   LabelName ----- OBJECT --- Nulls = 0 | Uniques = 577
   Confidence ---- INT64 ---- Nulls = 0 | Uniques = 2


Unnamed: 0,ImageID,Source,LabelName,Confidence
167771,a621ee5f23c3a6e3,verification,/m/05s2s,1
226733,e1a88f17eef86a15,verification,/m/026qbn5,0


________________________________________________________________________________

####### TEST BOUNDING BOXES #######

shape: 937,327 rows, 13 columns
duplicated values: 0 records

Unique Values:
   ImageID ------- OBJECT --- Nulls = 0 | Uniques = 112,194
   Source -------- OBJECT --- Nulls = 0 | Uniques = 1
   LabelName ----- OBJECT --- Nulls = 0 | Uniques = 583
   Confidence ---- INT64 ---- Nulls = 0 | Uniques = 1
   XMin ---------- FLOAT64 -- Nulls = 0 | Uniques = 101,542
   XMax ---------- FLOAT64 -- Nulls = 0 | Uniques = 101,871
   YMin ---------- FLOAT64 -- Nulls = 0 | Uniques = 82,929
   YMax ---------- FLOAT64 -- Nulls = 0 | Uniques = 83,316
   IsOccluded ---- INT64 ---- Nulls = 0 | Uniques = 2
   IsTruncated --- INT64 ---- Nulls = 0 | Uniques = 2
   IsGroupOf ----- INT64 ---- Nulls = 0 | Uniques = 2
   IsDepiction --- INT64 ---- Nulls = 0 | Uniques = 2
   IsInside ------ INT64 ---- Nulls = 0 | Uniques = 2


Unnamed: 0,ImageID,Source,LabelName,Confidence,XMin,XMax,YMin,YMax,IsOccluded,IsTruncated,IsGroupOf,IsDepiction,IsInside
391592,6991ac65c79d307e,xclick,/m/0k65p,1,0.618037,0.854111,0.405,0.5275,0,0,0,0,0
96435,1928189562dc619a,xclick,/m/03q69,1,0.0,0.457227,0.0,0.951327,0,1,0,0,0


________________________________________________________________________________

####### TEST LABELS #######

shape: 772,776 rows, 4 columns
duplicated values: 0 records

Unique Values:
   ImageID ------- OBJECT --- Nulls = 0 | Uniques = 123,926
   Source -------- OBJECT --- Nulls = 0 | Uniques = 2
   LabelName ----- OBJECT --- Nulls = 0 | Uniques = 589
   Confidence ---- INT64 ---- Nulls = 0 | Uniques = 2


Unnamed: 0,ImageID,Source,LabelName,Confidence
144973,2f1eb82880cafec3,verification,/m/0f4s2w,0
237070,4d2e86d6498d365b,verification,/m/01prls,1


________________________________________________________________________________



###### Classes Hierarchy

In [7]:
# classes hierarchy

num_classes, classes = utils.count_recursive(dict_hierarchy_raw)

print(f"There are {num_classes} classes in the JSON hierarchy",
      end="\n"*2)

print("The first node class encode is:",
      dict_hierarchy_raw['LabelName'], end="\n"*2)

# defining a node to consult
i=17

print(f"the {i}th son encode of the first node:",
      dict_hierarchy_raw['Subcategory'][i]['LabelName'])
print(f"The sons of the {i}th node:")
for subcat in dict_hierarchy_raw['Subcategory'][i]['Subcategory']:
    print(f"   {subcat}")

There are 671 classes in the JSON hierarchy

The first node class encode is: /m/0bl9f

the 17th son encode of the first node: /m/01x3z
The sons of the 17th node:
   {'LabelName': '/m/046dlr'}
   {'LabelName': '/m/06_72j'}
   {'LabelName': '/m/0h8mzrc'}


In [8]:
# finding with some class in hierarchial has no human representation
for encoded_name in classes:
    try:
        semantic_name = utils.semantic_name(encoded_name)
    except KeyError:
        print(f"{encoded_name} - 'Undefined Human Representation'")

/m/0bl9f - 'Undefined Human Representation'


Import to notice that only the first node ("the father of all") has no semantic representation!

###### About labels and image downloads

In [9]:
# assess the images downloaded
utils.check_images_download()

print('### Number of unique images in each dataframe ###', end='\n'*2)
train_labels   = pd.DataFrame(df_train_labels_raw.ImageID.unique(), columns=['ImageID'])
train_bbox     = pd.DataFrame(df_train_bbox_raw.ImageID.unique(), columns=['ImageID'])
train_download = pd.DataFrame(df_images_train.ImageID.unique(), columns=['ImageID'])

print(f"{'images in train images-level ':-<40}> {train_labels.shape[0]:,}")
print(f"{'images in train bbox ':-<40}> {train_bbox.shape[0]:,}")
print(f"{'images in train download folder ':-<40}> {train_download.shape[0]:,}",
      end='\n'*2)

val_labels   = pd.DataFrame(df_val_labels_raw.ImageID.unique(), columns=['ImageID'])
val_bbox     = pd.DataFrame(df_val_bbox_raw.ImageID.unique(), columns=['ImageID'])
val_download = pd.DataFrame(df_images_val.ImageID.unique(), columns=['ImageID'])

print(f"{'images in validation images-level ':-<40}> {val_labels.shape[0]:,}")
print(f"{'images in validation bbox ':-<40}> {val_bbox.shape[0]:,}")
print(f"{'images in valitation download folder ':-<40}> {val_download.shape[0]:,}",
      end='\n'*2)

test_labels   = pd.DataFrame(df_test_labels_raw.ImageID.unique(), columns=['ImageID'])
test_bbox     = pd.DataFrame(df_test_bbox_raw.ImageID.unique(), columns=['ImageID'])
test_download = pd.DataFrame(df_images_test.ImageID.unique(), columns=['ImageID'])

print(f"{'images in test images-level ':-<40}> {test_labels.shape[0]:,}")
print(f"{'images in test bbox ':-<40}> {test_bbox.shape[0]:,}")
print(f"{'images in test donwload folder ':-<40}> {test_download.shape[0]:,}",
      end='\n'*2)

### Number of unique images in each dataframe ###

images in train images-level -----------> 1,743,042
images in train bbox -------------------> 1,743,042
images in train download folder --------> 1,743,042

images in validation images-level ------> 41,151
images in validation bbox --------------> 37,306
images in valitation download folder ---> 41,620

images in test images-level ------------> 123,926
images in test bbox --------------------> 112,194
images in test donwload folder ---------> 125,436



In [10]:
# joining train dfs
train_labels['label'] = True
train_bbox['bbox'] = True
train_download['download'] = True

df_train_imgs = (train_download.merge(train_labels, on='ImageID', how='outer')
                               .merge(train_bbox, on='ImageID', how='outer'))
df_train_imgs.fillna(False, inplace=True)

# joining validation dfs
val_labels['label'] = True
val_bbox['bbox'] = True
val_download['download'] = True

df_val_imgs = (val_download.merge(val_labels, on='ImageID', how='outer')
                           .merge(val_bbox, on='ImageID', how='outer'))
df_val_imgs.fillna(False, inplace=True)

# joining test dfs
test_labels['label'] = True
test_bbox['bbox'] = True
test_download['download'] = True

df_test_imgs = (test_download.merge(test_labels, on='ImageID', how='outer')
                             .merge(test_bbox, on='ImageID', how='outer'))
df_test_imgs.fillna(False, inplace=True)

In [11]:
ref = {'Train': df_train_imgs, 'Validation': df_val_imgs, 'Test': df_test_imgs}

for label, df in ref.items():
    not_down = df[df.download == False].shape[0]
    not_label = df[(df.download == True) & (df.label == False)].shape[0]
    not_bbox = df[(df.download == True) & (df.bbox == False)].shape[0]
    label_not_bbox = df[(df.download == True) 
                        & (df.label == True)
                        & (df.bbox == False)].shape[0]
    bbox_not_label = df[(df.download == True) 
                        & (df.label == False)
                        & (df.bbox == True)].shape[0]
    
    print(f"About {label} data:")
    print(f"   There are {not_down} images not downloaded")
    print(f"   There are {not_label} images without label")
    print(f"   There are {not_bbox} images without bounding boxes identified")
    print(f"   There are {label_not_bbox} images with labels in image level, but without bounding boxes")
    print(f"   There are {bbox_not_label} images without labels in image level, but having bounding boxes", end='\n'*2)

About Train data:
   There are 0 images not downloaded
   There are 0 images without label
   There are 0 images without bounding boxes identified
   There are 0 images with labels in image level, but without bounding boxes
   There are 0 images without labels in image level, but having bounding boxes

About Validation data:
   There are 0 images not downloaded
   There are 469 images without label
   There are 4314 images without bounding boxes identified
   There are 3845 images with labels in image level, but without bounding boxes
   There are 0 images without labels in image level, but having bounding boxes

About Test data:
   There are 0 images not downloaded
   There are 1510 images without label
   There are 13242 images without bounding boxes identified
   There are 11732 images with labels in image level, but without bounding boxes
   There are 0 images without labels in image level, but having bounding boxes



### Clean

#### Removing duplicate data

In [12]:
df_train_bbox_raw = df_train_bbox_raw.drop_duplicates()
df_train_bbox_raw.shape

(14609671, 13)

#### Adding semantic class label

In [13]:
def semantic_label(df):
    return (df.merge(df_classes_raw,
                     left_on='LabelName',
                     right_on='class_encode',
                     how='inner')
              .drop(['class_encode'], axis=1)
              .rename(columns={'class_name': 'LabelSemantic'}))

df_train_bbox   = semantic_label(df_train_bbox_raw)
df_train_labels = semantic_label(df_train_labels_raw)
df_val_bbox     = semantic_label(df_val_bbox_raw)
df_val_labels   = semantic_label(df_val_labels_raw)
df_test_bbox    = semantic_label(df_test_bbox_raw)
df_test_labels  = semantic_label(df_test_labels_raw)

#### Naming the first class in the hierarchy

As I shown, the first node in the JSON hierarchy has no name. I am going to define it as "Entity" in the `df_classes`

In [18]:
df_classes = 
df_classes = df_classes.append(df_classes_raw).reset_index()

df_classes = pd.concat([pd.DataFrame([['/m/0bl9f', 'Entity']], columns=['class_encode', 'class_name']),
                        df_classes_raw],
...                     ignore_index=True)

Unnamed: 0,index,class_encode,class_name
0,0,/m/0bl9f,Entity
1,0,/m/011k07,Tortoise
2,1,/m/011q46kg,Container
3,2,/m/012074,Magpie
4,3,/m/0120dh,Sea turtle
...,...,...,...
597,596,/m/0qmmr,Wheelchair
598,597,/m/0wdt60w,Rugby ball
599,598,/m/0xfy,Armadillo
600,599,/m/0xzly,Maracas


## EDA

**Some classes could be in more then on subtree**, as shown above

In [15]:
# duplicated paths
ambiguous_paths = [k for k, c in Counter(classes).items() if c > 1]

random.seed(57)
for cat in random.choices(ambiguous_paths, k=3):
    print(f"{utils.semantic_name(cat)}")
    paths = utils.node_path(dict_hierarchy_raw, cat)
    for path in paths:
        print(f"    " + " -> ".join([utils.semantic_name(k) for k in path]))

Oven


KeyError: '/m/0bl9f'