Skip to content

This is the code base for paper ``Geometric Pretraining for Monocular Depth Estimation``, the paper is currently under review. The preprint will be available when it is ready.


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


This is the project page of the paper "Geometric Pretraining for Monocular Depth Estimation''.

GeometricPritraining is a pretrain task designed specifically for depth estimation. The pretrain task requires only uncalibrated images from existing datasets or the internet. After the pretraining stage, the backbone network can be transferred to depth estimation tasks. Using unlimited images from the internet, we demonstrate that the geometric-pretrained networks perform better than ImageNet-pretrained networks by a large margin. At the time of the submission, the proposed geometric pretrained networks achieve the new state-of-the-art performance using only existing training methods.

This project page contains:

  • the implementation of the pretrain task,

  • the scripts to extract training images from internet videos.

  • the geometric pretrained networks and the corresponding transferred depth estimation networks.

All components will be open source once the paper is accepted.


2020-10-02: Release some of the pretrained backbones and monodepth weights.

2020-06-22: The open-source procedure is delayed by the unexpected COV-19 situation. We will gradually release models in the following week.

2019-11-02: We pretrained encoder networks using the new dataset and achieves new state-of-the-art results. These pretrained methods will be released shortly.

2019-11-02: We built a new dataset using methods described in the paper that contains only Youtube wild frames. The dataset contrains 151k image pairs for training and 16k images for testing. We are willing to release the extracted frames as a open-source dataset for the community. However, due to the Youtube license, we are afride that the dataset cannot be released without the creators' approvement. We are now contacting the creators.

The proposed method

The proposed method uses a conditional encoder-decoder network to reconstruct the optical flow between two images. With a narrow bottleneck, the encoder network is forced to learn motion-invariant structure information. After the pretraining, the encoder can be transferred to depth estimation using existing methods (e.g. Monodepth2). The system is illustrated in the following figure.


After the pretrain, the transferred network show better accuracy, generalization ability, and few-shot learning.

  • The KITTI-trained network tested on KITTI dataset:

GeoPt is geometric pretrained backbone. Click each photo for full resolution.

  • The KITTI-trained network tested on CityScapes:
  • The KITTI-trained network tested on YouTube videos:

video video

The proposed dataset DrivingVideos

There are countless of images on the internet. In this project, we build the pretrain dataset, DrivingVideos, using videos from the internet. Due to the current size of the pretrain dataset, we mixed the pretraining dataset using KITTI, CityScapes, and DrivingVideos. We are still enlarging the dataset. In the future, we will use only DrivingVideos for the pretraining as experiments show that this leads to the best transferred performance. For more details please check the paper.

A tipical sample from DrivingVideos is a 3-frame sequential image with no calibration informatiom:


The pretrained networks and transferred depth networks

Here, we provide the pretrained networks and transferred depth networks (Monodepth2 is used for the transferring). For details, please check the paper. We provide some of the weights on Dropbox and you can download via the link.

The naming styple follows: k: KITTI RAW, c: CityScapes, d: Drive videos from youtube, youtube: A large dataset I collected from youtube after the paper submission, m: The monodepth is trained using only monocular sequences, ms: The monodepth is trained using both stereo and monocular sequences.

Backbone Networks:

Model Layer Num. Resolution KITTI CityScapes DrivingVideos_small DrivingVideos_big Youtube New
kcd 18 640x192 Yes Yes Yes No No
kc 18 640x192 Yes Yes No No No
d 18 640x192 No No No Yes No
kcd_hd 50 1024x320 Yes Yes Yes No No
youtube18 18 640x192 No No No No Yes
youtube50 50 1024x320 No No No No Yes

Transferred Networks (evaluation using the code from Monodepth2):

Backbone Training Mode Abs Rel Sq Rel RMSE RMSE log delta < 1.25
kcd_hd MS (1024x320) 0.093 0.704 4.367 0.183 0.896
kcd_hd MS (640x192) 0.099 0.757 4.547 0.187 0.888
kcd MS (640x192) 0.105 0.804 4.693 0.193 0.874
d M (640x192) 0.112 0.820 4.707 0.189 0.879
d S (640x192) 0.105 0.816 4.820 0.204 0.869
youtube18 S (640x192) 0.105 0.840 4.785 0.202 0.873
youtube18 MS (640x192) 0.103 0.819 4.668 0.190 0.881
youtube18 M (640x192) 0.111 0.879 4.735 0.188 0.883
youtube50 S (1024x320) 0.097 0.735 4.548 0.194 0.886
youtube50 MS (1024x320) 0.094 0.707 4.335 0.182 0.897


This is the code base for paper ``Geometric Pretraining for Monocular Depth Estimation``, the paper is currently under review. The preprint will be available when it is ready.






No releases published


No packages published