## Dataset folder structure 

- rear_signal_dataset
   - Footage_name
      - Footage_name_XXX
         - Footage_name_XXX_DDD (sequence of class XXX starting from frame number DDD)
            - light_mask
               - frameDDDDDDDD.png (frames with a 8 digit number indicating the frame number)
               - ...
         - Footage_name_XXX_DDD
            - light_mask
               - frameDDDDDDDD.png
               - ...
      - Footage_name_XXX
         - Footage_name_XXX_DDD
            - ...
         - Footage_name_XXX_DDD
            -...
         - Footage_name_XXX_DDD
            - ...
   - Footage_name
      - ...

*XXX indicates the label of the signal


## Output Dataset folder structure 

```
|_ frames
|  |_ [video name 0]
|  |  |_ [video name 0]_000001.jpg
|  |  |_ [video name 0]_000002.jpg
|  |  |_ ...
|  |_ [video name 1]
|     |_ [video name 1]_000001.jpg
|     |_ [video name 1]_000002.jpg
|     |_ ...
|_ frame_lists
|  |_ train.csv
|  |_ test.csv
|  |_ val.csv
```

The dataset first divides the data by Footage_name (ex. "20160809_route8-08-09-2016_09-50-36_idx99"). Within each footage directory, there is a sub-directory for each class of label (brake lights on, lights off, etc.). Then, there is a subdirectory for each brief burst of shots. Each burst of shots consists of a single vehicle. These images are nearly identical to each other. 

In [1]:
from utils import get_immediate_directories, get_label_sequence_from_name, get_split, image_is_valid, get_immediate_images

## Read in the list of easy, medium, hard

In [2]:
from difficulty_levels import DifficultyLevels, write_difficulty_levels

difficulties = DifficultyLevels('./data/Easy.txt', './data/Moderate.txt', './data/Hard.txt')  


E: 569, M: 72, H: 26


Check the total number of .png files. We can use this to ensure our final number of detected .png's is correct


In [3]:
import glob

paths = [x for x in glob.glob('data/**/*.png', recursive=True)]
print("Total pngs: {}".format(len(paths)))

Total pngs: 63918


## Get the paths to the images
The path to each image is collected per class.

`per_class_image_sequences` is a default dictionary of form:

```
{
    <label>: {
        <sequence>: {
            difficulty: 'E', // 'E', 'M', 'H'
            image_paths: [...],
            split: "train", // "train", "test", "val"
        },
        <sequence>: {
            difficulty: 'E', // 'E', 'M', 'H'
            image_paths: [...],
            split: "val",
        },
    },
    <label>: {
        // ...
    },
}
```

In [4]:
from utils import get_paths_for_image_sequences

per_class_image_sequences = get_paths_for_image_sequences(difficulties)

Unrecognized sequence dir name: 20161013_demo_test-10-13-2016_15-51-02_OLO_00001274
Unrecognized sequence dir name: 20161013_demo_test-10-13-2016_15-51-02_BOO_00001274
Unrecognized sequence dir name: 20161007_demo_surface-10-07-2016_16-14-04_2_BOR_00005877
Unrecognized sequence dir name: 20161007_demo_surface-10-07-2016_16-14-04_2_BOO_00005877
Unrecognized sequence dir name: 20160915_route_demo2-09-15-2016_18-49-23_OOR_00000215
Unrecognized sequence dir name: route-02-23-2016_17-17-51_BOO_9125
Unrecognized sequence dir name: 20160920_route_demo-09-20-2016_18-47-39_BLO00001405


## Re-arrange images + save original sequence
The following can be used to move all the images into a directory for the class they represent. Having a shallower folder structure greatly improves performance on Colab. The original organization and what images belonged to what sequences will be saved in a pickled dictionary.

In [7]:
from shutil import copyfile, rmtree
from utils import rm_mkdir

splits = ['train', 'val', 'test']

output_dir = "./slowfast_output"
output_frame_dir = os.path.join(output_dir, "frames")

rm_mkdir(output_dir)
rm_mkdir(output_frame_dir)

# 2D list to store [sequence name, class, difficulty level]
metadata = {key: [] for key in splits}
total_frames_processed = 0

# Copy the images into the new paths
for label in per_class_image_sequences:
    for sequence in per_class_image_sequences[label]:
        split = per_class_image_sequences[label][sequence]["split"]
        difficulty = per_class_image_sequences[label][sequence]["difficulty"]
        output_dir_path = os.path.join(output_frame_dir, sequence)

        # Make the directory for the sequence folder
        rm_mkdir(output_dir_path)

        metadata[split].append([sequence, label, difficulty])

        # Start counting the frame_number from 0 in each sequence file
        frame_number = 0

        for idx, image_path in enumerate(per_class_image_sequences[label][sequence]["image_paths"]):
            # Create a new name for the image
            output_image_name = 'frame_' + f'{frame_number:05}.png'
            frame_number += 1

            output_path = os.path.join(output_dir_path, output_image_name)

            # Copy the file over
            copyfile(image_path, output_path)
            
        total_frames_processed += frame_number

print("Total frames: {}".format(total_frames_processed))

Total frames: 63783


## Save Metadata
Create train.csv, test.csv, and val.csv from `metadata`

In [18]:
import pandas as pd

# Create data frames
dfs = {key: pd.DataFrame(metadata[key], columns =['Sequence ID', 'Label', 'Difficulty']) for key in splits}

csv_dir = os.path.join(output_dir, "frame_lists")
rm_mkdir(csv_dir)

# save to CSV
for key in dfs:
    # Drop the difficulty level, since we don't need it
    dfs[key] = dfs[key].drop(columns=['Difficulty'])

    # save as a CSV
    dfs[key].to_csv(os.path.join(csv_dir, "{}.csv".format(key)), index=False)

In [30]:
write_difficulty_levels(output_dir, new_easy, new_medium, new_hard)

In [31]:
import pickle

dictionary_pickle_name = os.path.join(output_dir, 'lstm_sequence_dict.pickle')

# Remove the defaultdict lambda to create keys
# This is required in order to pickle it
per_class_image_sequences.default_factory = None

with open(dictionary_pickle_name, 'wb') as handle:
    pickle.dump(per_class_image_sequences, handle)

In [40]:
# Check that the pickle was successful
with open(dictionary_pickle_name, 'rb') as handle:
    test_dict = pickle.load(handle)

test_dict['BLO']['20160915_route_demo1-09-15-2016_18-28-53_BLO_00000492']['updated_name'][:5]

['frame_00000.png',
 'frame_00001.png',
 'frame_00002.png',
 'frame_00003.png',
 'frame_00004.png']