The core contribution of this work comes from combining monocular depth estimation and semantic segmentation into one multi-task network to improve depth estimation accuracy. We extended the existing Monodepth network by adding a second decoder that is dedicated to the semantic segmentation task. The dual-decoder, multi-task network is trained with a hybrid loss function that combines the original depth-disparity losses from the depth decoder with the softmax cross-entropy loss of the semantic decoder. We also conduct experiments with an additional Dice loss added to the semantic decoder loss, with additional improvements shown.
- Download the preatined weigths from here
- You can download our pre-trained models to an existing directory by running:
sh ./utils/get_model.sh model_name output_directory
- Tensorflow 1.0
- python
- CUDA 8.0
[Cityscapes](https://www.cityscapes-dataset.com
Cityscapes \cite{Cityscapes} is a large-scale database that focuses on semantic understanding of urban street scenes. It provides semantic, instance-wise, and dense pixel annotations for 30 classes grouped into 8 different categories such as flat surfaces, humans, vehicles, constructions, objects, nature, sky, and void. The dataset consists of around 5000 fine annotated images and 20,000 coarse annotated ones captured in 50 cities during several months, daytimes, and good weather conditions. Originally, it was recorded as video so the frames given in the dataset were manually selected to have the following features: a large number of dynamic objects, varying scene layout, and varying background.
You will need to register in order to download the data, which already has a refined train/val/test set with 5000 images. We used 3750 images for training.
We used leftImg8bit_trainvaltest.zip
, rightImg8bit_trainvaltest.zip
with the refined labels for training.
There is a simple mode monodepth_simple.py which allows you to quickly run our model on a test image. Download the pretrained weights from above and place them in the models folder.
python monodepth_simple.py --image_path ~/my_image.jpg --checkpoint_path ~/models/model_cityscapes
Warning: The input sizes need to be mutiples of 128 for vgg
or 64 for resnet50
.
The model's dataloader expects a data folder path as well as a list of filenames (relative to the root data folder):
python3 monodepth_main.py --mode train --data_path /datasets/cityscapes/ --filenames_file \
utils/cityscapes_fine_train.txt --input_height 256 --input_width 512 --batch_size 4 --log_dir \
~/tmp/ --checkpoint_path models/model_cityscapes
You can also fine-tune from a checkpoint using --retrain
.
You can monitor the learning process using tensorboard
and pointing it to your chosen log_directory
.
By default the model only saves a reduced summary to save disk space, you can disable this using --full_summary
.
Please look at the main file for all the available options.
To test change the --mode
flag to test
, the network will output the disparities in the model folder or in any other folder you specify with --output_directory
.
You will also need to load the checkpoint you want to test on, this can be done with --checkpoint_path
:
python monodepth_main.py --mode test --data_path ~/data/cityscapes/test/ \
--filenames_file ~/code/monodepth/utils/filenames/kitti_stereo_2015_test_files.txt --log_directory ~/tmp/ \
--checkpoint_path ~/tmp/my_model/model-181250
Please note that there is NO extension after the checkpoint name
If your test filenames contain two files per line the model will ignore the second one, unless you use the --do_stereo
flag.
The network will output two files disparities.npy
and disparities_pp.npy
, respecively for raw and post-processed disparities.
Please find the trained model for UNet here