# Venice Boat Classification - Prepare Dataset

The data is split into two folders, named "sc5" for training, and "sc5-2013-Mar-Apr-Test-20130412" for testing. However, they do not have the same structure. Both folders are contained in the 'Dataset' folder
* The training set handles separation of categories by folders, each one containing the corresponding images. 
* The testing set has a single folder with all the files merged, and a ground truth text file that relates the name of the file with its corresponding category.

# Prepare testing set
In order to work with the same structure, the testing set will be organized in a new folder called 'sc5-test' in the same way as the training set, separated by categories.
Nevertheless, some of the name of the categories between both sets are not the same. Namely, 

<table style="text-align:center">
    <thead>
        <tr><th colspan="2" style="text-align:center">Category names</th></tr>
        <tr><th style="text-align:center">Train</th><th style="text-align:center">Test</th></tr>
    </thead>
    <tbody>
        <tr><td>Water</td><td>Snapshot Acqua</td></tr>
        <tr><td>lanciafino10mbianca</td><td>Lancia: fino 10 m Bianca</td></tr>
        <tr><td>Vaporetto ACTV</td><td>VaporettoACTV</td></tr>
    </tbody>
</table>

Accordingly, the following code also standardizes the category folder names. The ground truth file, named 'sc5-2013-GroundTruth.txt', is taken out from the original folder and put into the 'Dataset' folder.

In [39]:
import os 
import matplotlib.pyplot as plt
import numpy as np
import shutil

#Define paths
base_path = 'Dataset'
test_path = base_path + '/' + 'sc5-2013-Mar-Apr-Test-20130412'
train_path = base_path + '/' + 'sc5'
test_path_ordered = base_path + '/' + 'sc5-test'

In [27]:
#Create new folder for testing set
if not os.path.exists(test_path_ordered):
    os.mkdir(test_path_ordered)

#Get list of labels and put them in lowercase
labels_train = os.listdir(train_path)
labels_train = list(map(lambda x: x.lower(), labels_train))
print("Total labels: " + str(len(labels_train)))
labels_train

Total labels: 24


['alilaguna',
 'ambulanza',
 'barchino',
 'cacciapesca',
 'caorlina',
 'gondola',
 'lanciafino10m',
 'lanciafino10mbianca',
 'lanciafino10mmarrone',
 'lanciamaggioredi10mbianca',
 'lanciamaggioredi10mmarrone',
 'motobarca',
 'motopontonerettangolare',
 'motoscafoactv',
 'mototopo',
 'patanella',
 'polizia',
 'raccoltarifiuti',
 'sandoloaremi',
 'sanpierota',
 'topa',
 'vaporettoactv',
 'vigilidelfuoco',
 'water']

In [3]:
# Read the ground truth file
f = open(os.path.join(base_path, 'sc5-2013-GroundTruth.txt'),'r')
rows = f.read().split('\n')
rows[:5]

['20130412_043104_54559.jpg;Snapshot Acqua',
 '20130412_043117_54573.jpg;Motobarca',
 '20130412_043148_54819.jpg;Vaporetto ACTV',
 '20130412_043218_54895.jpg;Mototopo',
 '20130412_043335_55056.jpg;Mototopo']

In [40]:
#Loop over images in the original folder, rename them and save them in new folder, in its corresponding category.
i = 0
num_rows = len(rows)
print("[INFO] Preprocessing begins..." )
for row in rows:
    if row == "": continue
    [file, label] = row.split(';')
    
    #Removes spaces, colons and word 'snapshot'. Replaces word 'acqua' by 'water'
    label = label.lower().replace(' ', '').replace(':','').replace('snapshot','') 
    if label == 'acqua':
        label = 'water'
    folder = os.path.join(test_path_ordered,label)
    if not os.path.exists(folder):
        os.mkdir(folder)
    shutil.copyfile(os.path.join(test_path, file), os.path.join(folder,file))    
    i += 1
    if i%100 == 0:
        print("[INFO] " + str(i) + " of " + str(num_rows) + " elements processed." )
print("[INFO] Test folder preprocessing finished.")

[INFO] Preprocessing begins...
[INFO] 100 of 1970 elements processed.
[INFO] 200 of 1970 elements processed.
[INFO] 300 of 1970 elements processed.
[INFO] 400 of 1970 elements processed.
[INFO] 500 of 1970 elements processed.
[INFO] 600 of 1970 elements processed.
[INFO] 700 of 1970 elements processed.
[INFO] 800 of 1970 elements processed.
[INFO] 900 of 1970 elements processed.
[INFO] 1000 of 1970 elements processed.
[INFO] 1100 of 1970 elements processed.
[INFO] 1200 of 1970 elements processed.
[INFO] 1300 of 1970 elements processed.
[INFO] 1400 of 1970 elements processed.
[INFO] 1500 of 1970 elements processed.
[INFO] 1600 of 1970 elements processed.
[INFO] 1700 of 1970 elements processed.
[INFO] 1800 of 1970 elements processed.
[INFO] 1900 of 1970 elements processed.
[INFO] Test folder preprocessing finished.


The folder names in the training folder are converted to lowercase too, to match the names in the testing folder.

In [41]:
for folder in os.listdir(train_path):
    os.rename(train_path +'/'+ folder,train_path  +'/'+ folder.lower())

Optionally, the original test folder can be erased since no longer needed.

In [44]:
shutil.rmtree(test_path)

In [46]:
#Final folder structure
!tree /a

Folder PATH listing for volume Datos
Volume serial number is 4A52-21EF
D:.
+---.ipynb_checkpoints
+---.vs
|   \---HW2
|       \---v15
+---Dataset
|   +---.ipynb_checkpoints
|   +---dstest
|   |   +---alilaguna
|   |   +---ambulanza
|   |   +---barchino
|   |   +---gondola
|   |   +---lanciafino10m
|   |   +---lanciafino10mbianca
|   |   +---lanciafino10mmarrone
|   |   +---lanciamaggioredi10mbianca
|   |   +---motobarca
|   |   +---motopontonerettangolare
|   |   +---motoscafoactv
|   |   +---mototopo
|   |   +---patanella
|   |   +---polizia
|   |   +---raccoltarifiuti
|   |   +---sandoloaremi
|   |   +---topa
|   |   \---vaporettoactv
|   +---dstrain
|   |   +---alilaguna
|   |   +---ambulanza
|   |   +---barchino
|   |   +---gondola
|   |   +---lanciafino10m
|   |   +---lanciafino10mbianca
|   |   +---lanciafino10mmarrone
|   |   +---lanciamaggioredi10mbianca
|   |   +---motobarca
|   |   +---motopontonerettangolare
|   |   +---motoscafoactv
|   |   +---mototopo
|   |   +---patanell