Skip to content
Branch: master
Find file History
Type Name Latest commit message Commit time
Failed to load latest commit information.
assets affinity shots Jul 22, 2019


Annotations of a sample image. Labels are shown for a subset of 15 tasks.


This repository shares a multi-annotated dataset from the following paper:

Taskonomy: Disentangling Task Transfer Learning, CVPR 2018. Amir R. Zamir, Alexander Sax*, William B. Shen*, Leonidas Guibas, Jitendra Malik, Silvio Savarese.

The dataset includes over 4.5 million images from over 500 buildings. Each image has annotations for every one of the 2D, 3D, and semantic tasks in Taskonomy's dictionary (see below). The total size of the dataset is 11.16 TB. For more details, please see the CVPR 2018 paper.

Downloading the Dataset

For accessing the full dataset and terms of use, please email the authors to receive the download links. Below you can browse a fraction of data (a single building out of >500 buildings) as a sample.

Sample building

See sample building (Cauthron) Website
Sample building Website front page


Data Statistics

The dataset consists of over 4.6 million images from 537 different buildings. The images are from indoor scenes. Images with people visible were exluded and we didn't include camera roll (pitch and yaw included). Below are some statistics about the images which comprise the dataset.

Image-level statistics

Property Mean Distribution
Camera Pitch 0.24° Distribution of camera pitches
Camera Roll 0.0° Distribution of camera roll
Camera Field of View 61.2° Distribution of camera field of view
Distance (from camera to scene content) 5.3m Distribution of distances from camera to point
3D Obliqueness of Scene Content (wrt camera) 52.9° Distribution of point obliquenesses
Points in View (for point correspondences) (median) 55 Distribution of points in camera view

Point-level statistics

Property Mean Distribution
Cameras per Point (median) 5 Distribution of camera counts

Camera-level statistics

Property Mean Distribution
Points/Camera 20.8 Distribution of points per camera

Model-level Statistics

Property Mean Distribution
Image Count 0.0° Distribution of camera roll
Point Count -0.77° Distribution of camera pitches
Camera Count 75° Distribution of camera count

Data structure

A model, selected at random, from the training set of the paper is shared in the repository. The folder structure is described below:

    Object classification (Imagenet 1000) annotation distilled from ResNet-152
    Scene classification annotations distilled from PlaceNet
    Euclidian distance images.
           Units of 1/512m with a max range of 128m.
   Z-buffer depth images.
       Units of 1/512m with a max range of 128m.
    Occlusion (3D) edge images.
    2D texture edge images.
    2D keypoint heatmaps.
    3D keypoint heatmaps.
    All (point', view') which have line-of-sight and a view of "point" within the camera frustum
    Surface normal images.
    Metadata about each (point, view).
    For each image, we keep track of the optical center of the image.
    This is uniquely identified by the pair (point, view).
        Contains annotations for:
             Room layout
             Vanishing point
             Point matching
             Relative camera pose esimation (fixated)
        And other low-dimensional geometry tasks. 
    Curvature images. 
        Principal curvatures are encoded in the first two channels.
        Zero curvature is encoded as the pixel value 127
    Images of the mesh rendered with new lighting.
    RGB images in 512x512 resolution.
    RGB images in 1024x1024 resolution.
    Semantic segmentation annotations distilled from [FCIS]( 
    The class "0" marks "uncertain" pixels, so they should be masked out in learning.
   Pixel-level unsupervised superpixel annotations based on RGB.
    Pixel-level unsupervised superpixel annotations based on RGB + Normals + Depth + Curvature.

Dataset Splits

We provide standard train/validation/test splits for the dataset to standardize future benchmarkings. The split files can be accessed here. Given the large size of the full dataset, we provide the standard splits for 4 partitions (Tiny, Medium, Full, Full+) with increasing sizes (see below) which the users can employ based on their storage and computation resources. Full+ is inclusive of Full, Full is inclusive of Medium, and Medium is inclusive of Tiny. The table below shows the number of buildings in each partition.

Split Name Train Val Test Total
Tiny 25 5 5 35
Medium 98 20 20 138
Full 344 67 71 482
Full+ 381 75 81 537


If you find the code, data, or the models useful, please cite this paper:

  title={Taskonomy: Disentangling Task Transfer Learning},
  author={Zamir, Amir R and Sax, Alexander and Shen, William B and Guibas, Leonidas and Malik, Jitendra and Savarese, Silvio},
  booktitle={2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
You can’t perform that action at this time.