M5 Project: Scene Understanding for Autonomous Vehicles
Team 9 (also known as למלם, pronounced "Lam Lam")
Lorenzo Betto: smemo23@gmail.com
Noa Mor: noamor87@gmail.com
Ana Caballero Cano: ana.caballero.cano@gmail.com
Ivan Caminal Colell: ivancaminal72@gmail.com
The goal of the project is to perform object detection, segmentation and recognition using Deep Learning in the context of scene understanding for autonomous vehicles. The network's goal will be to successfully compute the aforementioned tasks on classes like pedestrians, vehicles, road, etc.
Link to the Overleaf article, i.e. the report of the project.
We read and understand the slides and Overleaf document
Team members are those named above.
We installed the necessary software.
Task A: Bash file to output the number of samples of each folder.
Task B: We ran the code for KITTI dataset, for training and validation
Task Cii: We implemented a new CNN (LamLam) with two parallel sequential processes of convolutional layers.
Task E: We wrote the report
Task A: We have created a bash script that returns 3 txt (train, test, val) that contain a list "subfolder_name; number_of_images".
Task B: We ran the code for the KITTY, for trainning and validation. Not for test
Task Cii: Our own CNN implementation, we named it LamLam (as our team).
It has two parallel sequential processes of convolutional layers of different sizes that allow to capture two different types of information.
CUDA_VISIBLE_DEVICES=0 python train.py -c config/dataset.py -e expName
100%
-
Analyze the dataset: The images are 64x64 pixels and differ in point of view, background, illumination and HUE. Furthermore, some images are slightly blurred.
-
Count the number of samples per class: 16527 for training,
644 for validation
8190 for testing.
To know the number of samples per class follow the link:
Google Sheets
-
Accuracy of train/test Accuracy Train: 97.7 %;
Accuracy Test: 95.2 %
The accuracy of train is better than in the test set, as expected. -
For this case which one provides better results, crop or resize? On this dataset crop useless because images are already cropped, so resize is better.
-
Where does the mean subtraction takes place? The mean subtraction takes place in the ImageDataGenerator, setting norm_featurewise_center to ‘True’.
-
Fine-tune the classification for the Belgium traffic signs dataset. Custom accuracy:
Accuracy with Belgium traffic signs dataset:
Custom loss:
Loss with Belgium traffic signs dataset:
We ran the KITTI dataset for the training and the validation datasets since the test set is private and we can't access it.
We used a CNN that was tested in the Machine Learning course of the same Master program. Such architecture is shown in
and it performed well with a classification problem that involved scenery images.
The idea that led to the development of a network with two parallel sequential processes of convolutional layers of different sizes was to allow to capture two different types of information, the first one being the small details and texture and the second one to capture the composition and details in the bigger picture.
The model's parameters were optimized using a random search when the model was first used, i.e. in the Machine Learning course.
We boost the performance of the network by using a SPP layer (Spatial Pyramid Pooling) instead of a costum pooling layer in the end of each tower (to concatenate the two towers, their shape must agree).
In addition this layer makes the model independent from the image size.
The Training is done over TT100K dataset and testing is done over the Belgium database. On the way to try to create a generic model.
Task A: We fixed some errors to be able to run the code.
Task B: We red two articled and did a summary.
Task C: SSD object detector, using Keras
Task D: Evaluation of udacity dataset
Task E: Boost the performance throught data augmentation
Task A: YOLO object detector
We fixed some errors to be able to run the code. We got this results
The dataset was also analyzed:
The number of signs in the annotation files do not always includes all the traffic sign exist in the image.
They differ in the number of traffic signs, the orientation and illumination.
Task B: -
Task C: SSD object detector, Keras implementation
Task D: Run Udacity dataset for epochs to 40 and tune the colors (saturation) to solve the challenges of the dataset.
Task E: Data augmentation
CUDA_VISIBLE_DEVICES=0 python train.py -c config/dataset.py -e expName
100%
We fixed some errors to be able to run the initial code.
From the dataset analysis we can conclude that:
TT100k dataset: The number of signs in the annotation files do not always includes all the traffic sign that exist in the image, the selective choices are not clear. The images differ in the number of traffic signs, orientation and illumination.
Udacity dataset: consist of urban images while the camera always facing toward to road including the dashboard. There is a big difference between the train and test images.
Training images are in strong mid-day light, mostly saturated colors, shades and reflective light (e.g. reflective light from the windshield). While test images have more vivid colors in different time of day.
There is a large variance in the luminance in the photos, has a lot of un balance and disorders in the luminance (for example : reflective light from the windshield).
Propose (and implement) solutions.
A solution can be to pre process the images - tunning the colors (saturation).
Or training using data augmentation on the color channel - creating more variance in the color spectrum.
Summaries:
YOLO: You only Look Once
SSD: Single Shot MultiBox Detector
Implementation found in this Github repo.
Set-up new experiments files to detect among cars, pedestrians, and trucks on the Udacity dataset, Train and evaluate
Increment the number of epochs to 40.
We ran the yolo with plain udacity (for comparison with TT100K)
We boosted the performance of the network by implementing the previous solution (pro process images changing color saturation) and applying data augmentation.
Task (a): Run the provided code (YOLO-Traffic Sign)
Task (B): Read two papers: FCN for Semantic Segmentation and U-Net.
Task (c): Implement a new network, modifying a keras implemented
Task (d): Train the network on a different dataset
Task (a): We used the preconfigured experiment file (camvid_segmentation.py) to segment objects with the FCN8 model.
Then we analysed the dataset and evaluate on train, val and test sets.
Task (B): -
Task (c): We implemented a new network,U-Net, modifying a keras implemented.
Task (d): Train the network on a different datasets, kitti, cambid, Cityscapes and Synthia_rand_cityscapes
CUDA_VISIBLE_DEVICES=0 python train.py -c config/dataset.py -e expName
100%
We fixed some errors to be able to run the initial code
From the dataset analysis we can conclude that:
Kitti and Camvid have the same classes.
Kitti dataset: the images are taken at the same time of day and there is an important contrast between sun and shadow, making it difficult to segment them.
Camvid dataset: the streets are more similar, but differ at the time of day, that we can observe in the lighting.
Cityscapes and Synthia_rand_cityscapes have the same classes.
Cityscapes dataset: it is similar in terms of illumination and color even though they are from different german cities. These images differ more in content (pedestrians, bicycles, vehicles, etc)
Synthia_rand_cityscapes differs on the scene. There are groups of the same scene which differ on the moment of the day. We can observe it on the illumination, shadows or even the rain.
Unet network was implemented inspired by zhixuhao's github repository, adding image mirror padding layer for a full image segmentation network.
We trained the network with four datasets: Kitti, Cambid, Cityscapes and Synthia Cityscapes.
As Kitti and Camvid contain the same number of classes we tried to use the fine tuned weights of fcn8 on Camvid to test the performance in Kitti but we did not obtain good results.
Synthia and Cityscapes also have the same number of classes. The former contains synthetic images, the latter contains real ones.
We tried to do data augmentation on the color channels by adding for each pixel value in every channel a random value of intensity from a uniform distribution with a margin of ±20% but unfortunately did not improve the jaccard metric. Additionally we implement a weighted cross entropy loss function which take into account classes frequency in the unbalance datasets.