# Semantic segmentation and depth estimation with hardware restricted FCNs in TensorFlow

Dense prediction tasks like semantic segmentation and depth estimation based on monocular
images are challenging fields of research. Here we evaluate different deep learning models
for real-world scene understanding for autonomous driving under the constraint of limited
hardware.

## What are we going to do?

<> Build different CNNs in Tensorflow based on pretrained backbone models (IncResnet V2 and NASnet-Mobile).

<> Evaluate semantic segmentation and depth estimation models as single task and as multi-task-learning topologies.

<> Compare our models to the current state-of-the-art with a sweet surprise

<> Do everything with relation to autonomous driving (Cityscapes-benchmark suite and limited hardware)

<> Give a baseline frame for everyone that doesn't own a high-end PC and still wants usable results

<img src="images/Cityscapes_Inference.png" /> 

<img src="images/Cityscapes_Inference_D.png" /> 

## Requirements

<> TensorFlow v. 1.4+ with python 3.x.x and GPU-support + installed [TF-Slim](https://github.com/tensorflow/models/tree/master/research/slim)

<> 16 Gb RAM and at least a 4 Gb RAM GPU like the NVIDIA 1050 Ti

<> Some background on TensorFlow and ConvNets (+ if you want to rebuild it, access to [Cityscapes](https://www.cityscapes-dataset.com/)

## Models

To get the basic idea of transfer-learning and dense-prediction tasks we evaluate 6 different Models. 

NASnet Mobile and Inception ResNet V2 as pretrained backbones
Either of them trained as:
* "single task semantic" (SS)
* "single task semantic with multiple endpoints" (SM)
* "multi-task depth and semantic" (MT)
with their default input image size of 224 / 299 pixels.

Below is a heavily compressed overview of the respective model structures.

### Basic structure
<img src="images/General_Structure.png" /> 
### CME
<img src="images/CME.png" />

### SS
<img src="images/SS.png" /> 
### SM
<img src="images/SM.png" /> 
### MT
<img src="images/MT.png" />
### Legend
<img src="images/Legend.png" /> 

The given implementations in the "Networks"-Folder require 4 Gbyte as for the NASnet models and 6 Gbyte graphic card RAM for the IncRes models.

## Topology informations

Training procedure: initial learn-rate: 1e-4 (polynomial decay in 75% training time to 1e-5) over 75 epochs on the 22,975 images spanning coarse dataset, 150 epochs "fine tuning" (initial LR 1e-4, polynomial decay in 75% training time to 1e-6) on the 2,975 images fine annoted dataset. Batch-size is 6 for all training phases. No use of gradient aggregation. Evaluation result mIoU (mean intersection over union) is based on the 500 images validation dataset and the official [evaluation script](https://github.com/mcordts/cityscapesScripts).

<img src="images/Training_Information.PNG" /> 

To evaluate the importance of the input image size further, we increased the sizes for two models. Due to the limited hardware gradient aggregation is used for the Inception ResNet model.

<img src="images/Increased_Input.PNG" /> 

## Comparision with State-of-the-Art

Compared to the current [boardleader](http://research.mapillary.com/publications/arXive17.html) our best model scores surprisingly good (for 1/8 of the limiting gRAM). Categories present in the radar-chart are official categories of the Cityscapes-benchmark. 

* around 3 percent difference in category flat
* around 15 percent difference in categories: construction, vehicle, sky, nature
* factor 2..3 in categories human and object

<img src="images/Comparison_Boardleader.png" /> 

## Detailed hardware relevant model description

Since deployment and use on mobile platforms is heavily bound by the computational power of the device, it is usefull to have a basic idea of the floating point operations (FLOPs, FP) and model parameters (PAR) needed. All numbers given are valid for the trained models during "productive" inference. BS = Batch-Size, t = average time per image (includes fetching and resizing)

### Backbone overview
<img src="images/Backbone_params.PNG" /> 

### FCN-Extension overview
<img src="images/Model_params.PNG" />