## How to deal with multi label data

This notebook served me to understand the necessary data transormations to get the data in a useful format for training and evaluating a neural network in tensorflow. In the `COCO_MLC` class, I'll try to use more tensorflow operations.

In [1]:
#Common imports
import numpy as np
import time, os
import pathlib
import random
from collections import defaultdict

#Reproducibility
random.seed(23)

In [2]:
ROOT_PATH = "."
DATA_DIR = os.path.join(ROOT_PATH, "data")
coco_year = 2017

In [3]:
data_dir = pathlib.Path(DATA_DIR)
data_dir

WindowsPath('data')

In [10]:
class_names = np.array(sorted([item.name for item in data_dir.glob('train/*')]))
class_names

array(['car', 'negative', 'person'], dtype='<U8')

In [11]:
def to_img_name(s):
    return str(s).split(os.path.sep)[-1]

split = "train"
img_filen = []
for i, class_n in enumerate(class_names):
    pattern = "{}/{}/*.jpg".format(split, class_n)
    img_filen += [(filen, i) for filen in map(to_img_name, data_dir.glob(pattern))]  

img_filen[:5]

[('000000003711.jpg', 0),
 ('000000005205.jpg', 0),
 ('000000009801.jpg', 0),
 ('000000016977.jpg', 0),
 ('000000020671.jpg', 0)]

In [12]:
from collections import defaultdict

merged_dict = defaultdict(list)

for filen, label in img_filen:
    merged_dict[filen].append(label)

In [13]:
list_ds_raw = list(merged_dict.items())
list_ds_raw[:5]

[('000000003711.jpg', [0]),
 ('000000005205.jpg', [0, 2]),
 ('000000009801.jpg', [0, 2]),
 ('000000016977.jpg', [0]),
 ('000000020671.jpg', [0, 2])]

The following transformations can be done simultaneously, but this is just some testing...

In [14]:
for ix, elem in enumerate(list_ds_raw):
    filen, labels = elem
    # Images with multiple labels have multiple possible filepaths(i.e. 
    # they exist in different categories) so we will take the first one,
    # for example.
    class_n = class_names[labels[0]]
    full_path = data_dir / split / class_n / filen
    list_ds_raw[ix] = (str(full_path), labels)


In [15]:
list_ds_raw[:5]

[('data\\train\\car\\000000003711.jpg', [0]),
 ('data\\train\\car\\000000005205.jpg', [0, 2]),
 ('data\\train\\car\\000000009801.jpg', [0, 2]),
 ('data\\train\\car\\000000016977.jpg', [0]),
 ('data\\train\\car\\000000020671.jpg', [0, 2])]

In [17]:
for ix, elem in enumerate(list_ds_raw):
    filep, labels = elem
    label = np.zeros(len(class_names),)
    label[labels]=1
    
    list_ds_raw[ix] = (filep, label)

In [18]:
list_ds_raw[:5]

[('data\\train\\car\\000000003711.jpg', array([1., 0., 0.])),
 ('data\\train\\car\\000000005205.jpg', array([1., 0., 1.])),
 ('data\\train\\car\\000000009801.jpg', array([1., 0., 1.])),
 ('data\\train\\car\\000000016977.jpg', array([1., 0., 0.])),
 ('data\\train\\car\\000000020671.jpg', array([1., 0., 1.]))]