##  Building training and validation image sets for a data driven contest

### Author/ML Engineer: Leon Hamnett - [linkedIn](https://www.linkedin.com/in/leon-hamnett/)



### Introduction:

As part of a team of machine learning engineers, I took part in a [datadriven contest](https://https-deeplearning-ai.github.io/data-centric-comp/) organised by Andrew Ng (a well known machine learning teacher and researcher). The aim of this competition was to focus on methods to improve dataset quality as opposed to improving the machine learning model itself. 

During this contest we created a number of different image datasets using such methods as cleaning and relabelling the existing dataset, creating synthetic data and applying a number of different image transforms and augmentations on the images. 

As each submission could only contain a maximum of 10,000 images, it was necessary to shuffle all of our image datasets together and select 10,000 images to be included in the training and validation image datasets.

This notebook was used to randomly choose 10,000 images from a number of different folders containing different image datasets, aiming for a balanced number of images in each class (10 classes total).

### Importing libaries:

During this notebook, we rely heavily on the shutil and os libaries for file and folder manipulations, as well as implementing the code in an efficient way to ensure that the shuffling and copying of images into new datasets can be completed as quickly as possible.


In [None]:
#import the libaries
import shutil
import numpy as np
import os
from os import getcwd
import shutil
!pip3 install pyfastcopy
import pyfastcopy



In [None]:
#mount google drive so we can access the files and folders with all our images
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
#check where the current working directory is located
print(getcwd())
image_dir = '/content/drive/MyDrive/Datadriven_contest/Image_datasets'
shuffled_dir = os.path.join(image_dir,'shuffled_temp')
numerals = ['i', 'ii', 'iii', 'iv','v', 'vi', 'vii', 'viii', 'ix', 'x'] #list of numerals to create folders, each numeral is an image class

/content


### Creating the folder structure:

Firstly we will create the folder structure to copy the images into for our newly shuffled and created dataset. If the folder structure already exists from previous iterations, we delete it and we start with empty folders again.

In [None]:
#make a new temp directory
if os.path.exists(shuffled_dir) == True: #check if folder already exists
  print('Deleting existing shuffled_temp directory')
  shutil.rmtree(shuffled_dir) #delete folder
  print('making new shuffled temp directory')
  os.mkdir(shuffled_dir) #create folder
else:
  print('making new shuffled temp directory')
  os.mkdir(shuffled_dir)

#make new train and valid directories
new_train_dir = os.path.join(shuffled_dir,'train')
os.mkdir(new_train_dir)
new_valid_dir = os.path.join(shuffled_dir,'val')
os.mkdir(new_valid_dir)

#make numeral/class sub folders for both train and validation sets
for folder in os.listdir(shuffled_dir): 
    for x in numerals:
      folder_to_make_temp = os.path.join(shuffled_dir,folder,x)
      os.mkdir(folder_to_make_temp)
      print('Created {} numeral: {}'.format(folder,x))

Deleting existing shuffled_temp directory
making new shuffled temp directory
Created train numeral: i
Created train numeral: ii
Created train numeral: iii
Created train numeral: iv
Created train numeral: v
Created train numeral: vi
Created train numeral: vii
Created train numeral: viii
Created train numeral: ix
Created train numeral: x
Created val numeral: i
Created val numeral: ii
Created val numeral: iii
Created val numeral: iv
Created val numeral: v
Created val numeral: vi
Created val numeral: vii
Created val numeral: viii
Created val numeral: ix
Created val numeral: x


### Check how many images we have:

Now we count how many images we have in all of the different folders we want to shuffle. We unzip an image dataset if it has been uploaded to gdrive in a .zip format. 

Next we go through all of the folders and we count the images that are in each class for each folder and we keep a running total of the images we have per class, as well as the total number of images. 

This also serves as a sanity check, to make sure that the code is actually looking through the folders correctly and will be able to copy the images at a later time. This would also flag any empty folders or missing classes which we could rectify as needed.

In [None]:
# #unzip folders if needed. Syntax: !unzip zipped_file -d folder_to_unzip_to
# os.chdir(image_dir)
# !unzip "augmented_shift_rotate.zip" -d "aug_shift_rotate"

Archive:  augmented_shift_rotate.zip
replace aug_shift_rotate/data_modified/rotation/ii/rot_0_5733.png? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

Next we choose from all of our image folders the images that we would like to shuffle into our training and validation sets.

In [None]:
#Choose folders to include (reference directory with i,ii,iii etc folders)
baseline = os.path.join(image_dir,'baseline/baseline/combined_train_val_baseline')
image_gen = os.path.join(image_dir,'image_gen/image_gen')
augmented_zoom = os.path.join(image_dir,'augmented_zoom_flip/Zoom')
augmented_flip = os.path.join(image_dir,'augmented_zoom_flip/Flip')
augmented_thresh = os.path.join(image_dir,'augmented_thresh')
augmented_blur = os.path.join(image_dir,'augmented_blur')
image_gen2 = os.path.join(image_dir,'image_gen2_curls_aug_transl')
image_gen3 = os.path.join(image_dir,'image_gen3_no_aug')
image_gen4 = os.path.join(image_dir,'image_gen4_no_aug')
test = os.path.join(image_dir,'baseline/baseline/test/label_book')
final = os.path.join(image_dir,'processed_images_01_train_val')
final2 = os.path.join(image_dir,'processed_images_02')
shuf_temp_train = os.path.join(image_dir,'shuffled_temp/train')
shuf_temp_val = os.path.join(image_dir,'shuffled_temp/val')
shuffled_temp = os.path.join(image_dir,'processed_images_02_train_val/val')
processed3 = os.path.join(image_dir,'processed_images_03')
augmented_shift = os.path.join(image_dir,'aug_shift_rotate/data_modified/shift')
augmented_rotate = os.path.join(image_dir,'aug_shift_rotate/data_modified/rotation')
processed4 = os.path.join(image_dir,'processed_images_04/train')

#folders_to_shuffle = [processed3,augmented_zoom,augmented_flip,augmented_thresh,augmented_blur,augmented_shift,augmented_rotate] 
folders_to_shuffle = [processed4]

#init counters for images totals
total_images,num_1s,num_2s,num_3s,num_4s,num_5s,num_6s,num_7s,num_8s,num_9s,num_10s = [0]*11 #init counters
totals = [total_images,num_1s,num_2s,num_3s,num_4s,num_5s,num_6s,num_7s,num_8s,num_9s,num_10s] #make list of counters

#for each folder loop through numeral folder and count images inside
for folder in folders_to_shuffle:
  for numeral in range(len(numerals)):
    try:
      temp_folder = os.path.join(folder,numerals[numeral]) #set folder location as the one we are looking at
      temp_count = len(os.listdir(temp_folder)) #count nuber of images
      totals[numeral+1]+=temp_count #add image number to relevant class counter
      totals[0]+=temp_count #add image number to totals
    except:
      continue
    print('folder {} numeral {} count: {} total: {}'.format(folder,numeral+1,temp_count,totals[0]))

#print totals once counting is finished
print(totals)

folder /content/drive/MyDrive/Datadriven_contest/Image_datasets/processed_images_04/train numeral 1 count: 2390 total: 2390
folder /content/drive/MyDrive/Datadriven_contest/Image_datasets/processed_images_04/train numeral 2 count: 2005 total: 4395
folder /content/drive/MyDrive/Datadriven_contest/Image_datasets/processed_images_04/train numeral 3 count: 1700 total: 6095
folder /content/drive/MyDrive/Datadriven_contest/Image_datasets/processed_images_04/train numeral 4 count: 1885 total: 7980
folder /content/drive/MyDrive/Datadriven_contest/Image_datasets/processed_images_04/train numeral 5 count: 1911 total: 9891
folder /content/drive/MyDrive/Datadriven_contest/Image_datasets/processed_images_04/train numeral 6 count: 1637 total: 11528
folder /content/drive/MyDrive/Datadriven_contest/Image_datasets/processed_images_04/train numeral 7 count: 1599 total: 13127
folder /content/drive/MyDrive/Datadriven_contest/Image_datasets/processed_images_04/train numeral 8 count: 1628 total: 14755
folde

### Shuffling images and copying images to make new dataset:

Now we randomly choose an appropiate number of training and validation images from all of the above folders and we copy them into our empty folders.

In [None]:
#choose number of images from each folder and shuffle into train valid
for folder in folders_to_shuffle:
  temp_folder_path = os.path.join(image_dir,folder) #make path to folder
  print(temp_folder_path)
  for numeral in numerals: #loop through numerals/classes
      print('starting numeral {}'.format(numeral))
      try: # catch exceptions that are not images
        temp_num_path = os.path.join(temp_folder_path,numeral) #set path to each folder containing a set of numerals
        temp_image_list = os.listdir(temp_num_path) #get list of images
        total_images_in_folder = len(temp_image_list) #get number of images
        
        #get indexes from image list and assign to train/valid sets
        if total_images_in_folder > 1000: #if we have more than 1000 images for each numeral, choose less so only 1000 are chosen for each of the 10 classes
          num_train_images = int(900) #set 90% of images as train, rest validation
          num_val_images = int(100)
        else: #if we have less than 1000 images per class, just take all the images
          num_train_images = int(total_images_in_folder * 0.9)
          num_val_images = int(total_images_in_folder * 0.1)

        #create list of indexes corresponding to an image, choose indices as validation and train
        val_indexes_to_copy = np.random.choice(np.arange(len(temp_image_list)),size=num_val_images,replace=False) #randomly choose 10% of images as validation
        train_all = [x for x in np.arange(len(temp_image_list)) if x not in val_indexes_to_copy] #choose indices not in validation set
        train_indexes_to_copy = np.random.choice(train_all,size=num_train_images,replace=False) # from indices not in val set, choose appropiate number of train images

        #process/copy train_images
        for index in train_indexes_to_copy: #loop through the training indices
          try:
                temp_img_path = os.path.join(temp_folder_path,numeral,temp_image_list[index]) #get image path
                copy_path = os.path.join(new_train_dir,numeral,temp_image_list[index]) #make path to copy image
                shutil.copy2(temp_img_path,copy_path) #copy into relevant train folder
                #!cp $temp_img_path $copy_path
                # with open(temp_img_path, 'rb') as fin:
                #   with open(copy_path, 'wb') as fout:
                #     shutil.copyfileobj(fin, fout, 128*1024)
          except Exception as e: 
              print(e)
              continue

        print('numeral {} train images finished copying'.format(numeral))

        #process valid_images
        for index in val_indexes_to_copy:
          try:
                temp_img_path = os.path.join(temp_folder_path,numeral,temp_image_list[index]) #get image path
                copy_path = os.path.join(new_valid_dir,numeral,temp_image_list[index])
                shutil.copy2(temp_img_path,copy_path) #copy into relevant valid folder
          except Exception as e: 
              print(e)
              continue
        print('numeral {} valid images finished copying'.format(numeral))

      except Exception as e:
          print(e)
          continue
      
      print(numeral, ' numeral has finished shuffling')
print('all numerals and folders finished')

/content/drive/MyDrive/Datadriven_contest/Image_datasets/processed_images_04/train
starting numeral i
numeral i train images finished copying
numeral i valid images finished copying
i  numeral has finished shuffling
starting numeral ii
numeral ii train images finished copying
numeral ii valid images finished copying
ii  numeral has finished shuffling
starting numeral iii
numeral iii train images finished copying
numeral iii valid images finished copying
iii  numeral has finished shuffling
starting numeral iv
numeral iv train images finished copying
numeral iv valid images finished copying
iv  numeral has finished shuffling
starting numeral v
numeral v train images finished copying
numeral v valid images finished copying
v  numeral has finished shuffling
starting numeral vi
numeral vi train images finished copying
numeral vi valid images finished copying
vi  numeral has finished shuffling
starting numeral vii
numeral vii train images finished copying
numeral vii valid images finished co

Now we have obtained our new shuffled and copied dataset.

It should be noted that the above code has been optimised for copying images in an as efficient manner as possible. 

Firstly, instead of using a list of filenames and randomly choosing from this list, we use a list of indices. It is much quicker to shuffle and choose elements from a list of integers than a list of long strings representing filenames. Once we have a list of numbers, representing the images we want to choose, it is a simple matter to pass each integer as an index for the list of images and get our images in this fashion.

Secondly, the copying operation itself has been optimised. There are a number of different functions which were analysed:
1. shutil copy2 function: ```shutil.copy2(temp_img_path,copy_path)``` 
2. Using the command line cp function 
``` !cp $temp_img_path $copy_path```
3. writing the images directly: 
```
with open(temp_img_path, 'rb') as fin: 
    with open(copy_path, 'wb') as fout: 
      shutil.copyfileobj(fin, fout, 128*1024) 
    ```
         
After some experimentation and iteration, it was found that using the shututil copy2 function with python fast copy package installed was the quickest and most efficient way to copy a large number of images.

In [None]:
#if needed zip the new dataset
os.chdir(image_dir)
!zip -r processed_images_01_train_val.zip processed_images_01_train_val