# Toy Generation Script

Given a CSV with 2 columns:
- classification
- Relative path

There are 500000 images, with 18 different classes. The toy will be the data for the basic experiments, to test if the model is working properly. So it will be a reduced dataset composed by 2000-3000 images of each class. If one of the classes doesnt have enough instances, take all of them. The total should be approximately 30000-40000 images.

For future experiments the toy could be rereduced.

Be careful, dont take images consecutive as they will be from the same photo shot, do a shuffle before taking them.

## Libraries

In [1]:
import numpy as np

import os
import random

## Constants

In [2]:
## Source Path and CSV
SOURCE_DIR = "../../CSVs/"

SOURCE_CSV = "preDatasetAlba_01_15.csv"

SOURCE = SOURCE_DIR + SOURCE_CSV

assert os.path.isfile(SOURCE)

## Destination Path and CSV
DEST_DIR   = "../../CSVs/"

DESTINATION_CSV = "TOY.csv"

DESTINATION = DEST_DIR + DESTINATION_CSV

### Variable Constants

In [None]:
## Number of images per class
IMAGES_PER_CLASS = 3000

## Functions

In [3]:
# Extract data from CSV

# Returns:
#   - path
#   - label

def readCSV(inputcsv = SOURCE):
    # read csv
    labels, paths = np.loadtxt(inputcsv, dtype=str,
                                     delimiter=',', usecols=(0, 1), unpack=True)

    # as data are strings, ensure that they are readable eliminating the symbol "
    labels = [label.replace("\"","") for label in labels]
    paths = [path.replace("\"","") for path in paths]

    return labels, paths

In [5]:
# takes the data and creates a dictionary (with multiple values per key) from it, where the keys are the classes
def create_class_dictionary(data):
    # create empty dictionary
    dictionary = {}

    # for each row of data [label, path]
    for idx, dat in enumerate(data):
        # if the key doesnt exist in the dictionary, add it
        if dat[0] not in dictionary:
            dictionary[dat[0]] = [dat[1]]
        # if it already exists, append the new value to the array of values
        else:
            dictionary[dat[0]].append(dat[1])

    return dictionary

In [6]:
# Given IMAGES_PER_CLASS, it takes a max of IMAGES_PER_CLASS number of images, and gives to the key the images wanted
def get_number_of_imgs_each_class(dictionary):
    classes = dictionary.keys()

    for clas in classes:
        # takes the images of current class in a list
        class_imgs = dictionary[clas]
        # shuffle images to take from different shots
        random.shuffle(class_imgs)
        # takes the first IMGES_PER_CLASS
        class_imgs =  class_imgs[0:IMAGES_PER_CLASS]

        dictionary[clas] = class_imgs

    return dictionary

In [7]:
# Given a dictionary, returns a np array with it data, setting the key as the first element of each value, value will be the second element (column)
def create_new_nparray(dictionary):
    classes = dictionary.keys()

    new_data = np.empty([0,2])

    for idx, clas in enumerate(classes):
        class_imgs = dictionary[clas]

        for img in class_imgs:
            current_data = [clas, img]
            new_data = np.vstack([new_data, current_data])

    return new_data

In [17]:
# write a csv with the data given
def write_csv(data, new_csv):
    # assert data, "Lista de datos vacia, cuidado!"

    if os.path.exists(new_csv):
        os.remove(new_csv)

    np.savetxt(new_csv, data, delimiter=",",fmt='%s')

Get data from CSV

In [8]:
labels, paths = readCSV()
data = np.column_stack([labels, paths])
data

array([['vacia', 'rev01/10/10_20201021 (1000).JPG'],
       ['vacia', 'rev01/10/10_20201023 (10041).JPG'],
       ['vacia', 'rev01/10/10_20201023 (10047).JPG'],
       ...,
       ['jabali', 'rev15/9/9_20220110 (936).JPG'],
       ['gamo', 'rev15/9/9_20211217 (96).JPG'],
       ['jabali', 'rev15/9/9_20220110 (975).JPG']], dtype='<U32')

Create Dictionary from Data with the classes as dictionary keys

In [9]:
dictionary = create_class_dictionary(data)
dictionary

{'vacia': ['rev01/10/10_20201021 (1000).JPG',
  'rev01/10/10_20201023 (10041).JPG',
  'rev01/10/10_20201023 (10047).JPG',
  'rev01/10/10_20201023 (10053).JPG',
  'rev01/10/10_20201023 (10056).JPG',
  'rev01/10/10_20201023 (10058).JPG',
  'rev01/10/10_20201023 (10068).JPG',
  'rev01/10/10_20201023 (10070).JPG',
  'rev01/10/10_20201023 (10082).JPG',
  'rev01/10/10_20201021 (1010).JPG',
  'rev01/10/10_20201021 (1013).JPG',
  'rev01/10/10_20201021 (1015).JPG',
  'rev01/10/10_20201023 (10158).JPG',
  'rev01/10/10_20201023 (10159).JPG',
  'rev01/10/10_20201021 (1016).JPG',
  'rev01/10/10_20201021 (1021).JPG',
  'rev01/10/10_20201023 (10253).JPG',
  'rev01/10/10_20201021 (1032).JPG',
  'rev01/10/10_20201021 (1034).JPG',
  'rev01/10/10_20201023 (10346).JPG',
  'rev01/10/10_20201023 (10347).JPG',
  'rev01/10/10_20201023 (10348).JPG',
  'rev01/10/10_20201021 (1038).JPG',
  'rev01/10/10_20201023 (10384).JPG',
  'rev01/10/10_20201023 (10387).JPG',
  'rev01/10/10_20201023 (10392).JPG',
  'rev01/10/

Set values of dictionary random images of each class

In [10]:
dictionary = get_number_of_imgs_each_class(dictionary)
dictionary

{'vacia': ['rev09/36/36_20210619 (4898).JPG',
  'rev09/49/49_20210704 (24833).JPG',
  'rev08/20/20_20210529 (16059).JPG',
  'rev13/18/18_20211024 (3094).JPG',
  'rev09/5/5_20210709 (6191).JPG',
  'rev13/32/32_20211026 (4008).JPG',
  'rev05/2/2_20210303 (2136).JPG',
  'rev07/2/2_20210501 (20959).JPG',
  'rev07/2/2_20210502 (23065).JPG',
  'rev14/4/4_20211217 (13350).JPG',
  'rev08/24/24_20210517 (6173).JPG',
  'rev08/2/2_20210516 (9780).JPG',
  'rev08/49/49_20210523 (18205).JPG',
  'rev01/15/15_20201108 (16586).JPG',
  'rev11/49/49_20210816 (353).JPG',
  'rev11/49/49_20210816 (1998).JPG',
  'rev10/25/25_20210713 (283).JPG',
  'rev11/49/49_20210819 (9550).JPG',
  'rev13/4/4_20211106 (15343).JPG',
  'rev14/36/36_20211201 (1206).JPG',
  'rev15/25/25_20211228 (58).JPG',
  'rev08/49/49_20210525 (21192).JPG',
  'rev05/24/24_20210317 (3991).JPG',
  'rev04/34/34_20210125 (2396).JPG',
  'rev06/25/25_20210323 (3767).JPG',
  'rev11/10/10_20210904 (18301).JPG',
  'rev06/20/20_20210410 (12422).JPG',

Create new numpy array with the data we want in the new csv, which is stored in the dictionary

In [13]:
new_data = create_new_nparray(dictionary)

Escribir csv

In [18]:
write_csv(new_data, DESTINATION)