<img src='https://radiant-assets.s3-us-west-2.amazonaws.com/PrimaryRadiantMLHubLogo.png' alt='Radiant MLHub Logo' width='300'/>

How to use the Radiant MLHub API to browse and download the BigEarthNet dataset
=====

This Jupyter notebook, which you may copy and adapt for any use, shows basic examples of how to use the API to download labels and source imagery for the BigEarthNet dataset. Full documentation for the API is available at [docs.mlhub.earth](docs.mlhub.earth).

We'll show you how to set up your authorization, see the list of available collections and datasets, and retrieve the items (the data contained within them) from those collections. 

Each item in our collection is explained in json format compliant with [STAC](https://stacspec.org/) [label extension](https://github.com/radiantearth/stac-spec/tree/master/extensions/label) definition.

Citation Requirements and Contact Information
----

The BigEarthNet archive was constructed by the Remote Sensing Image Analysis [(RSiM)](https://www.rsim.tu-berlin.de/menue/remote_sensing_image_analysis_group/) Group and the Database Systems and Information Management [(DIMA)](https://www.dima.tu-berlin.de/menue/database_systems_and_information_management_group/?no_cache=1) Group at the Technische Universität Berlin [(TU Berlin)](https://www.tu-berlin.de/menue/home/parameter/en/). This work is supported by the European Research Council under the ERC Starting Grant BigEarth and by the German Ministry for Education and Research as Berlin Big Data Center [(BBDC)](http://www.bbdc.berlin/home/).

The BigEarthNet archive *requires* the a citation of the BigEarthNet paper whenever the archive is used. The citation for this paper is listed below along with contact information for inqueries about the archive and a PDF manual detailing the structure of the archive.

Citation
--
G. Sumbul, M. Charfuelan, B. Demir, V. Markl, "[BigEarthNet: A Large-Scale Benchmark Archive for Remote Sensing Image Understanding](http://bigearth.net/static/documents/BigEarthNet_IGARSS_2019.pdf)", IEEE International Geoscience and Remote Sensing Symposium, pp. 5901-5904, Yokohama, Japan, 2019.



Contact Information
--
* Website: [www.bigearth.net](www.bigearth.net)
* Email: contact@bigearth.net
* Manual: [http://bigearth.net/static/documents/BigEarthNetManual.pdf](http://bigearth.net/static/documents/BigEarthNetManual.pdf)

Authentication
-----

Access to the Radiant MLHub API requires an API key. To get your API key, go to [dashboard.mlhub.earth](https://dashboard.mlhub.earth). If you have not used Radiant MLHub before, you will need to sign up and create a new account. Otherwise, sign in. In the **API Keys** tab, you'll be able to create API key, which you will need. *Do not share* your API key with others: your usage may be limited and sharing your API key is a security risk.

Copy the API key, and paste it in the box bellow.

Click **Run** or press `SHIFT` + `ENTER` before moving on to run this first piece of code.

In [None]:
# only the requests module is required to access the API
import requests

# copy your API key from dashboard.mlhub.earth and paste it in the following
API_KEY = 'PASTE_YOUR_API_KEY_HERE'
API_BASE = 'https://api.radiant.earth/mlhub/v1'

Search for data collections
-----

To see what training data is available, you will want to see the collections available through the API.

A collection represents the top-most data level. Typically this means the data comes from the same source for the same geography. It might include different years or sub-geographies.

To find data with specific parameters, see the [API documentation](http://docs.mlhub.earth/?python#the-feature-collections-in-the-dataset).

To see the list, simply run the following cell. The returned list shows the collection id values, collection license, and data source citation (if available).

In [None]:
# get list of all collections
r = requests.get(f'{API_BASE}/collections?key={API_KEY}')
h = r.json()
collections = h['collections']

# print the list of collections 
for c in collections:
    print(f'ID:       {c["id"]}\nLicense:  {c.get("license", "N/A")}\nCitation: {c.get("sci:citation", "N/A")}\n')

Setting the query properties
----

The BigEarthNet dataset is split into two collections, one which contains the labels and one which contains the source imagery. Labels link to their respective source imagery items so we will set our collection ID to `bigearthnet_v1_labels`.

In [None]:
collectionId = 'bigearthnet_v1_labels'

Downloading Items
----

The next cell contains the functions which page through the results and download the labels and source imagery.

In [None]:
import boto3 # Required to download assets hosted on S3
import os
from urllib.parse import urlparse

s3 = boto3.client('s3')

def download_s3(uri, path):
    parsed = urlparse(uri)
    bucket = parsed.netloc
    key = parsed.path[1:]
    s3.download_file(bucket, key, os.path.join(path, key.split('/')[-1]))
    print(f'Downloaded s3://{bucket}/{key}')
    
def download_http(uri, path):
    parsed = urlparse(uri)
    r = requests.get(uri)
    f = open(os.path.join(path, parsed.path.split('/')[-1]), 'wb')
    for chunk in r.iter_content(chunk_size=512 * 1024): 
        if chunk:
            f.write(chunk)
    f.close()
    print(f'Downloaded {uri}')

def get_download_uri(uri):
    r = requests.get(uri, allow_redirects=False)
    return r.headers['Location']

def download(href, path):
    download_uri = get_download_uri(href)
    parsed = urlparse(download_uri)
    
    if parsed.scheme in ['s3']:
        download_s3(download_uri, path)
    elif parsed.scheme in ['http', 'https']:
        download_http(download_uri, path)

def download_source_and_labels(item):
    labels = item.get('assets').get('labels')
    links = item.get('links')
    
    # Make the directory to download the files to
    path = f'bigearthnet/{item["id"]}/'
    if not os.path.exists(path):
        os.makedirs(path)
    
    # Download the labels
    download(labels['href'], path)
    
    #Download the source imagery
    for link in links:
        if link['rel'] != 'source':
            continue
        
        r = requests.get(link['href'], params={'key': API_KEY})
        for key, asset in r.json()['assets'].items():
            download(asset['href'], path)
            
def get_items(uri, classes=None, cloud_and_shadow=None, seasonal_snow=None, max_items_downloaded=None, items_downloaded=0):
    r = requests.get(uri, params={'key': API_KEY})
    collection = r.json()
    for feature in collection.get('features', []):
        # Check if the item has one of the label classes we're interested in
        matches_class = True
        if classes is not None:
            matches_class = False
            for label_class in feature['properties'].get('labels', []):
                if label_class in classes:
                    matches_class = True
                    break
        
        # Check if the item matches the cloud and shadows filter we specify
        matches_clouds = True
        if cloud_and_shadow is not None:
            matches_clouds = feature['properties'].get('cloud_and_shadow', False) == cloud_and_shadow
            
        
        # Check if the item matches the seasonal snow filter we specify
        matches_snow = True
        if seasonal_snow is not None:
            matches_snow = feature['properties'].get('seasonal_snow', False) == seasonal_snow
            
        # If the item does not match all of the criteria we specify, skip it
        if not matches_class or not matches_clouds or not matches_snow:
            continue
            
        # Download the label and source imagery for the item
        download_source_and_labels(feature)
        
        # Stop downloaded items if we reached the maximum we specify
        items_downloaded += 1
        if max_items_downloaded is not None and items_downloaded >= max_items_downloaded:
            return
        
    # Get the next page if results, if available
    for link in collection['links']:
        if link['rel'] == 'next' and link['href'] is not None:
            get_items(link['href'], classes=classes, cloud_and_shadow=cloud_and_shadow, seasonal_snow=seasonal_snow, max_items_downloaded=max_items_downloaded, items_downloaded=items_downloaded)

Downloading all labels
----

This next cell will download all labels and source imagery contained in the BigEarthnet dataset. For demonstration purposes in this notebook, we limit the number of items downloaded to 1. You can remove the `max_items_downloaded` argument and the function will download all 590,326 labels and source imagery.

In [None]:
get_items(f'{API_BASE}/collections/{collectionId}/items?limit={limit}', max_items_downloaded=1)

Filtering downloads based of labels
----

A likely scenario is you only want to download tiles which contain certain land cover classes or tiles which are not cloudy. First you'll need to know which land cover classes are contained in the dataset so you know which ones to filter on. The next cell will query the API and return the possible values for each label.

In [None]:
r = requests.get(f'{API_BASE}/collections/{collectionId}/items?limit=1&key={API_KEY}')
label_classes = r.json()['features'][0]['properties']['label:classes']
for label_class in label_classes:
    print(f'\nClasses for {label_class["name"]}')
    for c in sorted(label_class['classes']):
        print(f'- {c}')

Filtering downloads on label classes, cloud and shadows, and seasonal snow
----

The labels in BigEarthNet have three properties.
1) An array of land cover type classes contained in the tile
2) Whether the tile contains cloud and cloud shadows
3) Whether the tile has seasonal snow

We can filter our download based off one or more of the properties.

In this next cell we will download the first 5 tiles which contain either the `Coniferous forest` or `Rice fields` classes, do not contain clouds and cloud shadows, and do not contain seasonal snow.

In [None]:
get_items(
    f'{API_BASE}/collections/{collectionId}/items?limit={limit}',
    classes=['Coniferous forest', 'Rice fields'],
    cloud_and_shadow=False,
    seasonal_snow=False,
    max_items_downloaded=5)