# Analysing COCO dataset

The COCO dataset is widely used for training visual ML models. We will produce some statistics that will be used later in the defintion of the AttentionNET model that splits between identifying the focus in a picture and identifying the object in the focus.

First we load the modules we will use:

In [None]:
%matplotlib inline
from pycocotools.coco import COCO
import numpy as np
import skimage.io as io
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
import pandas as pd

Based on the examples from COCO gothub we will setup a directory to load the annotation data. Since this is the largest dataset we will use the training 2017 dataset and we will focus on the "instances" classifications. The dataset annotations are pre-downloaded in the `annotations` directory:

In [None]:
dataDir='.'
dataType='train2017'
annFile='{}/annotations/instances_{}.json'.format(dataDir,dataType)

We now setup a COCO object to deal with the annotations data and to parse it:

In [None]:
# initialize COCO api for instance annotations
coco=COCO(annFile)

We can analyse the categories from the annotations file. This is provided in the member `cats` of the `COCO` object. This is a dictionary with the key the category ID and as attributes:
- `name`: the name of the category
- `supercategory`: the name of the supercategory that groups more categories together
- `id`: the ID of the category (again although is already the key of the dictionary):

In [None]:
coco.cats

The supercategories can be shown with:

In [None]:
nms = set([cat['supercategory'] for cat in coco.cats.values()])
print('COCO supercategories: \n{}'.format(' '.join(nms)))

## Number of images

Let see now how many images we have in this training set. We can use the member `imgs` of the `COCO` object:

In [None]:
len(coco.imgs)

So we have 118,287 images in this dataset. We are interested to see how many annotations (individual objects) are for each image and how they are distributed.

The `imgs` is a dictionary that uses the picture ID as the key and contains the following information about each image:

In [None]:
coco.imgs[391895]

From this list we are interested at this time by the size of the picture (`height` and `width`). Later, when training the models we will put in place we will also have to download the images using the `coco_url` provided so that we can do multiple passes on the dataset during training.

## Annotations

Additionally we have in the `anns` attibute that contains information about the picture. It is a dictionary with key the annotation `id` and then a number of informations related to that annotation:

In [None]:
coco.anns[156]

The interesting attributes of an annotation are the `image_id` for which that annotation was created, the `area` that represents the surrounding areas of the `segmentation` in pixels and the `bbox` that is the bounding box for the item.

To get all the annotations for a given image we can use:

In [None]:
an_image_id = 475546

annIds = coco.getAnnIds(imgIds=an_image_id, iscrowd=None)
print(annIds)

And if we want to see the details:

In [None]:
for anno in annIds:
    print(coco.anns[anno])

Let's try to have a better understanding of the annotations across the whole dataset. We will count the number of annotations for all images and then we will display a histogram:

In [None]:
no_anno = []
for img in coco.imgs.values():
    ann = coco.getAnnIds(imgIds=img['id'], iscrowd=None)
    no_anno.append(len(ann))

plt.hist(no_anno, bins=50, range=(0,50), density=True)
plt.gca().yaxis.set_major_formatter(PercentFormatter(xmax=1))
plt.show()

Or in a cummulative fashion:

In [None]:
plt.hist(no_anno, bins=50, range=(0,50), density=True, cumulative=True)
plt.gca().yaxis.set_major_formatter(PercentFormatter(xmax=1))
plt.show()

## Items in Images (annotations)

One of the characteristics of the annotated items in a picture is that they tend to cover a significantly smaller portion of the image than the whole area of the picture. For example, for the `an_image_id` we selected earlier, if we extract all the annotations and the associated `area` and compare that with the overall area of the image we will get the following results:

In [None]:
an_img = coco.imgs[an_image_id]
image_area = an_img['height'] * an_img['width']

for anno in annIds:
    item_categ_id = coco.anns[anno]['category_id']
    item_area = coco.anns[anno]['area']
    item_area_perc = item_area / image_area * 100
    print(f"category: {coco.cats[item_categ_id]['name']}, area: {item_area:.1f} ({item_area_perc:.2f}%)")

As you can see the majority of the items identified in the picture are less than 3% of the whole image, with only one element covering 13.3%. This is clearly visible if we display the corresponding image and the annotations:

In [None]:
I = io.imread(coco.imgs[an_image_id]['coco_url'])
plt.imshow(I)
plt.axis('off')
coco.showAnns(coco.loadAnns(annIds))

We will perform the above calculations for each picture in the set and put the resulting data into a Panda frame for easier manipulation.

In [None]:
df = pd.DataFrame(coco.anns.values(), index=coco.anns.keys())

In [None]:
df

Let's make sure that the numbers match the previously calculated number of annotations per image:

In [None]:
df[['image_id', 'id']].groupby(by="image_id").count().hist(bins=50, range=(0,50))
plt.show()

They seem to be the same. Let's add now the image area and the percentage of annotated item in the image:

In [None]:
df['image_height'] = df.apply(lambda x: coco.imgs[x['image_id']]['height'], axis=1)
df['image_width'] = df.apply(lambda x: coco.imgs[x['image_id']]['width'], axis=1)
df['image_area'] = df['image_height'] * df['image_width']
df['anno_area_perc'] = df['area'] / df['image_area'] * 100

Let's see the results:

In [None]:
df

And let's plot a histogram with these percentages:

In [None]:
df['anno_area_perc'].hist(bins=50, range=(0,100))
plt.gca().set_ylabel('Number of annotations')
plt.gca().set_xlabel('Annotation area as a % of total image area')
plt.show()

As expected the vast majority is bellow 10%. Let's focus on that part:

In [None]:
df['anno_area_perc'].hist(bins=50, range=(0,10))
plt.gca().set_ylabel('Number of annotations')
plt.gca().set_xlabel('Annotation area as a % of total image area')
plt.title('Number of annotations by % of total image area\nclipped to 10%')
plt.show()

A cumulative histogram with % of all annotation will give an even better perpective:

In [None]:
df['anno_area_perc'].hist(bins=50, range=(0,100), density=True, cumulative=True)
plt.gca().set_ylabel('Number of annotations')
plt.gca().set_xlabel('Annotation area as a % of total image area')
plt.gca().yaxis.set_major_formatter(PercentFormatter(xmax=1))
plt.xlim(left=0, right=40)
plt.title('Cumulative annotations by % of total image area')
plt.show()

In [None]:
dir(coco)

## Biography

[1]T.-Y. Lin et al., ‘Microsoft COCO: Common Objects in Context’, arXiv:1405.0312 [cs], Feb. 2015, Accessed: May 05, 2021. [Online]. Available: http://arxiv.org/abs/1405.0312.
