Skip to content


Switch branches/tags

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

a benchmark dataset for training and evaluating global cloud classification models. It merges two satellite products from the A-train constellation: the Moderate Resolution Imaging Spectroradiometer (MODIS) from Aqua satellite and the 2B-CLDCLASS-LIDAR product derived from the combination of CloudSat Cloud Profiling Radar (CPR) and CALIPSO Cloud‐Aerosol Lidar with Orthogonal Polarization (CALIOP).



The dataset is hosted here. It contains over 300k annotated multispectral images at 1km x 1km resolution, providing daily full coverage of the Earth for 2008, 2009 and 2016.


Option 1: syncing with your DropBox Account

  1. add CUMULO to your DropBox account
  2. use rclone for syncing it on your machine

Option 2: direct download

  1. use one of these download scripts

File Format

Data is stored in Network Common Data Form (NetCDF) following this convention.

There is 1 NetCDF file per swath of 1354x2030 pixels, 1 every 5 minutes, named:

filename =

YYYY => year
DDD => absolute day since 01.01.YYYY 
HH => hour of day
MM => minutes    

File Content

To see the variables available for a netcdf file and their description, run:

ncdump -h netcdf/

Code Source

  1. The script extracts one CUMULO's swath (as a netcdf file) from the corresponding MODIS' MYD02, MYD03, MYD06 and MYD35 files, and CloudSat's CS_2B-CLDCLASS and/or CS_2B-CLDCLASS-LIDAR files.
python3 pipeline <save-dir> <myd02-filename>
  1. src/ contains the code source for extracting the different CUMULO's features, for alignment them and for completing the missing values when possible.


pip install gcsfs
conda install -c conda-forge pyhdf  #The pip install's wheels are broken at time of writing
pip install satpy
pip install satpy[modis_l1b]
pip install -r requirements.txt

Machine Learning Baselines

Examples for training models on CUMULO are provided here.


If you find this work useful, please cite the original paper:

        title={Cumulo: A Dataset for Learning Cloud Classes},
        author={Zantedeschi, Valentina and Falasca, Fabrizio and Douglas, Alyson and Strange, Richard and Kusner, Matt J and Watson-Parris, Duncan},
        journal={arXiv preprint arXiv:1911.04227},


This work is the result of the 2019 ESA Frontier Development Lab Atmospheric Phenomena and Climate Variability challenge. We are grateful to all organisers, mentors and sponsors for providing us this opportunity. We thank Google Cloud for providing computing and storage resources to complete this work.