# Dataset analysis and preparation

Analysis of the GTSRB dataset and creation of an enhanced dataset

## Objectives :

- Understand the complexity associated with data, even when it is only images
- Learn how to build up a simple and usable image dataset

The German Traffic Sign Recognition Benchmark (GTSRB) is a dataset with more than 50,000 photos of road signs from about 40 classes.The final aim is to recognise them !

# Step 1 - Import and Init

In [23]:
import os, time, sys
import csv
import math, random

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import h5py
import tqdm

from skimage.morphology import disk
from skimage.util import img_as_ubyte
from skimage.filters import rank
from skimage import io, color, exposure, transform

from sklearn.utils import shuffle

from importlib import reload


## Step 2 - Parameters
The generation of datasets may require some time and space : **10' and 10 GB**.  

You can choose to perform tests or generate the whole enhanced dataset by setting the following parameters:  
`scale` : 1 mean 100% of the dataset - set 0.2 for tests (need 2 minutes with scale = 0.2)  
`progress_verbosity`: Verbosity of progress bar: 0=silent, 1=progress bar, 2=One line  
`output_dir` : where to write enhanced dataset, could be :
 - `./data`, for tests purpose
 - `<datasets_dir>/GTSRB/enhanced` to add clusters in your datasets dir.  

In [24]:
# ---- For smart tests :
#
scale      = 0.2
output_dir = './data' 

# ---- For a Full dataset generation :
#
# scale      = 1
# output_dir = f'{datasets_dir}/GTSRB/enhanced'

# ---- Verbosity
#
progress_verbosity = 2

# Step 3 - Read the dataset
Description is available there : http://benchmark.ini.rub.de/?section=gtsrb&subsection=dataset
 - Each directory contains one CSV file with annotations : `GT-<ClassID>.csv` and the training images
 - First line is fieldnames: `Filename ; Width ; Height ; Roi.X1 ; Roi.Y1 ; Roi.X2 ; Roi.Y2 ; ClassId`
    
## 3.1 - Understanding the dataset

In [25]:
df = pd.read_csv(f'data/GT-Test.csv', header=0)
df.head(10)

Unnamed: 0,Width,Height,Roi.X1,Roi.Y1,Roi.X2,Roi.Y2,ClassId,Path
0,53,54,6,5,48,49,16,Test/00000.png
1,42,45,5,5,36,40,1,Test/00001.png
2,48,52,6,6,43,47,38,Test/00002.png
3,27,29,5,5,22,24,33,Test/00003.png
4,60,57,5,5,55,52,11,Test/00004.png
5,52,56,5,5,47,51,38,Test/00005.png
6,147,130,12,12,135,119,18,Test/00006.png
7,32,33,5,5,26,28,12,Test/00007.png
8,45,50,6,5,40,45,25,Test/00008.png
9,81,86,7,7,74,79,35,Test/00009.png


## 3.2 - Usefull functions
A nice function for reading a dataset from an index.csv file.\
Input: an intex.csv file\
Output: an array of images ans an array of corresponding labels

In [27]:
from tqdm import tqdm  # Assurez-vous que cette ligne est présente en haut de votre script/notebook
import pandas as pd
from skimage import io
import numpy as np
import os

def read_csv_dataset(csv_file, progress_verbosity=1):
    '''
    Reads traffic sign data from German Traffic Sign Recognition Benchmark dataset.
    Arguments:  
        csv_file (str): Description file, Example /data/GT-Train.csv
        progress_verbosity (int): Verbosity level of progress update, default is 1 (on)
    Returns:
        x, y (tuple): np array of images, np array of corresponding labels
    '''

    path = os.path.dirname(csv_file)

    # ---- Read csv file
    df = pd.read_csv(csv_file, header=0)
    
    # ---- Get filenames and ClassIds
    filenames = df['Path'].to_list()
    y = df['ClassId'].to_list()
    x = []
    
    # ---- Read images
    for filename in tqdm(filenames, disable=(progress_verbosity == 0), desc="Loading Images"):
        image = io.imread(os.path.join(path, filename))
        x.append(image)
    
    # ---- Return
    return np.array(x, dtype=object), np.array(y)


x, y = read_csv_dataset(f'data/GT-Test.csv', progress_verbosity=progress_verbosity)


Loading Images:   0%|          | 0/12630 [00:00<?, ?it/s]




FileNotFoundError: No such file: '/Users/jules/Documents/Apprentissage-Panneaux/data/Test/00000.png'

## 3.2 - Read the data
We will read the following datasets:
 - **Train** subset, for learning data as :  `x_train, y_train`
 - **Test** subset, for validation data as :  `x_test, y_test`
 - **Meta** subset, for visualisation as : `x_meta, y_meta`
 
The learning data will be randomly mixted and the illustration data (Meta) sorted.  
Will take about 1'30s on HPC or 45s on my labtop.

In [21]:
# Démarrer le chronomètre
start_time = time.time()

# Chemin vers les datasets, remplacez 'datasets_dir' par le chemin réel
datasets_dir = 'data'

# ---- Lire les datasets
(x_train, y_train) = read_csv_dataset(f'{datasets_dir}/GT-Train.csv')
(x_test, y_test) = read_csv_dataset(f'{datasets_dir}/GT-Test.csv')
(x_meta, y_meta) = read_csv_dataset(f'{datasets_dir}/GT-Meta.csv')
    
# ---- Mélanger le jeu de données d'entraînement
x_train, y_train = shuffle(x_train, y_train, random_state=0)

# ---- Trier Meta
combined = list(zip(x_meta, y_meta))
combined.sort(key=lambda x: x[1])
x_meta, y_meta = zip(*combined)

# Afficher le temps écoulé
end_time = time.time()
print(f'Temps écoulé: {end_time - start_time} secondes')


NameError: name 'tqdm' is not defined