## Dataset folder structure 

- rear_signal_dataset
   - Footage_name
      - Footage_name_XXX
         - Footage_name_XXX_DDD (sequence of class XXX starting from frame number DDD)
            - light_mask
               - frameDDDDDDDD.png (frames with a 8 digit number indicating the frame number)
               - ...
         - Footage_name_XXX_DDD
            - light_mask
               - frameDDDDDDDD.png
               - ...
      - Footage_name_XXX
         - Footage_name_XXX_DDD
            - ...
         - Footage_name_XXX_DDD
            -...
         - Footage_name_XXX_DDD
            - ...
   - Footage_name
      - ...

*XXX indicates the label of the signal

The dataset first divides the data by Footage_name (ex. "20160809_route8-08-09-2016_09-50-36_idx99"). Within each footage directory, there is a sub-directory for each class of label (brake lights on, lights off, etc.). Then, there is a subdirectory for each brief burst of shots. Each burst of shots consists of a single vehicle. These images are nearly identical to each other. 

In [1]:
from utils import get_immediate_directories, get_label_sequence_from_name, get_split, image_is_valid, get_immediate_images

## Read in the list of easy, medium, hard

In [2]:
from difficulty_levels import DifficultyLevels, write_difficulty_levels

difficulties = DifficultyLevels('./data/Easy.txt', './data/Moderate.txt', './data/Hard.txt')  


E: 569, M: 72, H: 26


Check the total number of .png files. We can use this to ensure our final number of detected .png's is correct


In [17]:
import glob

paths = [x for x in glob.glob('data/**/*.png', recursive=True)]
print("Total pngs: {}".format(len(paths)))

Total pngs: 63918


## Get the paths to the images
The path to each image is collected per class.

In [27]:
from collections import defaultdict
import random

# Seed the random image selector so the output is reproducable
random.seed(0)

def get_paths_for_image_sequences():
    """
    Gets the path for the images per sequence
    
    Will return a dictionary per_class_image_sequences of form:
        {
            <label>: {
                <sequence>: {
                    difficulty: 'E', // 'E', 'M', 'H'
                    image_paths: [...],
                    split: "train", // "train", "test", "val"
                },
                <sequence>: {
                    difficulty: 'E', // 'E', 'M', 'H'
                    image_paths: [...],
                    split: "val",
                },
            },
            <label>: {
                // ...
            },
        }
    """
    
    per_class_image_sequences = defaultdict(lambda: defaultdict(dict))

    # Footage directories
    footage_dirs = get_immediate_directories('./data')

    for f_dir in footage_dirs:
        # These are folders corresponding to each class
        path_1 = os.path.join('./data', f_dir)
        f_class_dirs = get_immediate_directories(path_1)

        # Loop through all sequence dirs of form Footage_name_XXX_DDD
        # ex. 20160805_g1k17-08-05-2016_15-57-59_idx99_BOO
        for f_class_dir in f_class_dirs:
            # The label of the class ex. "BLO"
            class_label, _ = get_label_sequence_from_name(f_class_dir)
            
            path_2 = os.path.join(path_1, f_class_dir)
            footage_sequence_dirs = get_immediate_directories(path_2)
            
            # Loop through the individual sequences
            # footage_sequence_dir can be of form 20160805_g1k17-08-05-2016_15-57-59_idx99_BOO_00023411
            for footage_sequence_dir in footage_sequence_dirs:
                path_3 = os.path.join(path_2, footage_sequence_dir, "light_mask")

                difficulty_level = difficulties.get_difficulty_level(footage_sequence_dir)
                
                # Save the difficulty level
                per_class_image_sequences[class_label][footage_sequence_dir]["difficulty"] = difficulty_level
                per_class_image_sequences[class_label][footage_sequence_dir]["split"] = get_split()

                # Get the images
                image_names = get_immediate_images(path_3)
                image_names = [os.path.join(path_3, i) for i in image_names]

                # Save all the valid image paths
                per_class_image_sequences[class_label][footage_sequence_dir]["image_paths"] = image_names
               
    return per_class_image_sequences

## Re-arrange images + save original sequence
The following can be used to move all the images into a directory for the class they represent. Having a shallower folder structure greatly improves performance on Colab. The original organization and what images belonged to what sequences will be saved in a pickled dictionary.

In [29]:
from shutil import copyfile, rmtree
from utils import rm_mkdir

per_class_image_sequences = get_paths_for_image_sequences()

splits = ['train', 'val', 'test']

output_dir = "./lstm_output"

rm_mkdir(output_dir)

for s in splits:
    # Make sub-directory
    split_directory = os.path.join(output_dir, s)
    os.mkdir(split_directory)
    
    # Create directories for each label
    for label in per_class_image_sequences:
        os.mkdir(os.path.join(split_directory, label))

# Move files in per_class_image_sequences to new directories
new_easy = []
new_medium = []
new_hard = []

frame_number = 0

# Copy the images into the new paths
for label in per_class_image_sequences:
    for sequence in per_class_image_sequences[label]:
        split = per_class_image_sequences[label][sequence]["split"]
        difficulty = per_class_image_sequences[label][sequence]["difficulty"]
        per_class_image_sequences[label][sequence]["updated_name"] = []
        output_dir_path = os.path.join(output_dir, split, label)

        for idx, image_path in enumerate(per_class_image_sequences[label][sequence]["image_paths"]):
            # Create a new name for the image
            output_image_name = 'frame_' + f'{frame_number:05}.png'
            frame_number += 1

            # Store a parallel list with the updated name
            per_class_image_sequences[label][sequence]["updated_name"].append(output_image_name)  

            output_path = os.path.join(output_dir_path, output_image_name)
            
            # Copy the file over
            copyfile(image_path, output_path)
            
            # Add the new name to the difficulty level path
            if difficulty == "E":
                new_easy.append(output_image_name)
            elif difficulty == "M":
                new_medium.append(output_image_name)
            elif difficulty == "H":
                new_hard.append(output_image_name)

print("Total frames: {}".format(frame_number))

Unrecognized sequence dir name: 20161013_demo_test-10-13-2016_15-51-02_OLO_00001274
Unrecognized sequence dir name: 20161013_demo_test-10-13-2016_15-51-02_BOO_00001274
Unrecognized sequence dir name: 20161007_demo_surface-10-07-2016_16-14-04_2_BOR_00005877
Unrecognized sequence dir name: 20161007_demo_surface-10-07-2016_16-14-04_2_BOO_00005877
Unrecognized sequence dir name: 20160915_route_demo2-09-15-2016_18-49-23_OOR_00000215
Unrecognized sequence dir name: route-02-23-2016_17-17-51_BOO_9125
Unrecognized sequence dir name: 20160920_route_demo-09-20-2016_18-47-39_BLO00001405
Total frames: 63783


In [30]:
write_difficulty_levels(output_dir, new_easy, new_medium, new_hard)

In [31]:
import pickle

dictionary_pickle_name = os.path.join(output_dir, 'lstm_sequence_dict.pickle')

# Remove the defaultdict lambda to create keys
# This is required in order to pickle it
per_class_image_sequences.default_factory = None

with open(dictionary_pickle_name, 'wb') as handle:
    pickle.dump(per_class_image_sequences, handle)

In [40]:
# Check that the pickle was successful
with open(dictionary_pickle_name, 'rb') as handle:
    test_dict = pickle.load(handle)

test_dict['BLO']['20160915_route_demo1-09-15-2016_18-28-53_BLO_00000492']['updated_name'][:5]

['frame_00000.png',
 'frame_00001.png',
 'frame_00002.png',
 'frame_00003.png',
 'frame_00004.png']