# Goose inventory bulk load

This notebook describes how WV04 images from the P2020 inventory service were bulk-loaded into the Goose database.

There are about 195,000 WV04 images in the inventory database.

In the goose-python-client repo is the Python package dgloader which contains code for loading images in the P2020 inventory service into the Goose database.

The dgloader package uses these standard P2020 environment variables when obtaining a token for using P2020 service:

* P2020_IDENTITY_TOKEN_SERVER
* P2020_IDENTITY_CLIENT_ID
* P2020_IDENTITY_CLIENT_SECRET

You must have these environment variables set before running functions in the dgloader package.  You must also set these environment variables before executing this notebook.  If you start Jupyter Lab from a shell then set the environment variables in the shell before running jupyter lab.

In [1]:
from dgloader import inventory

Make sure P2020 environment variables are set before continuing.  Just print their names not their values.  In particular we don't want the value of P2020_IDENTITY_CLIENT_SECRET displayed.

In [2]:
import os
assert 'P2020_IDENTITY_TOKEN_SERVER' in os.environ.keys()
assert 'P2020_IDENTITY_CLIENT_ID' in os.environ.keys()
assert 'P2020_IDENTITY_CLIENT_SECRET' in os.environ.keys()

AssertionError: 

## The dg catalog

In the Goose database we use a catalog named "dg" to contain DigitalGlobe imagery.  Make sure this catalog exists before we bulk load images into it.  Also make sure the catalog has the correct JSON schema associated with it.  The file dgloader/schemas/dg-stac-item-schema.json is the JSON schema we validate DigitalGlobe STAC items against.

Start by connecting to the Goose service:

In [2]:
from dgcatalog import Stac

# Pick a service URL depending on the environment you're using
# service_url = 'https://api-test-2.discover.digitalglobe.com/v2/stac'
# service_url = 'https://api-dev-2.discover.digitalglobe.com/v2/stac'
service_url = 'https://api-2.discover.digitalglobe.com/v2/stac'

stac = Stac(url=service_url, username='super_tester@mailinator.com', verbose=True)

Password:  ············


Requesting token from https://geobigdata.io/auth/v1/oauth/token
Token successfully received.


The following code checks to see if the "dg" catalog exists.  If it does then it is updated to use the current version of the DG JSON schema.  If the "dg" catalog does not exist then it is created.

In [4]:
# Read the DG JSON schema file
import json
import urllib.parse
dg_schema_filename = 'dgloader/schemas/dg-stac-item-schema.json'
print('Reading DG JSON schema file {}'.format(dg_schema_filename))
with open(dg_schema_filename, 'r') as f:
    dg_schema = json.load(f)

# Try reading the dg catalog, and update its schema if it exists
try:
    dg_catalog = stac.get_catalog(catalog_id='dg')
    print('Catalog dg exists.  Updating its schema.')
    dg_catalog['stac_item_schema'] = dg_schema
    stac.update_catalog(dg_catalog)
# Catalog doesn't exist so create it
except:
    print('Catalog dg does not exist, creating it')
    dg_catalog = {
        'id': 'dg',
        'stac_version': '0.6.0',
        'title': 'DigitalGlobe',
        'description': 'DigitalGlobe STAC catalog',
        'links': [
            {
                'rel': 'self',
                'href': urllib.parse.urljoin(service_url, 'catalog/dg')
            }
        ],
        'stac_item_schema': dg_schema
    }
    stac.insert_catalog(dg_catalog)

Reading DG JSON schema file dgloader/schemas/dg-stac-item-schema.json
Get catalog catalog_id=dg
GET: https://api-2.discover.digitalglobe.com/v2/stac/catalog/dg
HTTP Status: 404
Request ID: b0b83757-afaf-487a-89a3-a60ad6e3b6f0
Catalog dg does not exist, creating it
POST: https://api-2.discover.digitalglobe.com/v2/stac/catalog
HTTP Status: 201
Request ID: 9c022f4f-a3d5-4c1e-a42b-c550bb3120e1


# Goose loader

The Goose Loader is a small system in AWS used for bulk loading of STAC items.  It just consists of an SQS queue and a Lambda function that procesess items in the queue.  To load a large number of STAC items into the goose database just insert them into the queue.  The Lambda function will in time insert them into the database.

There is also a dead-letter queue to handle items that the Lambda function failed to insert.

The items you put in the queue may be STAC items or may be JSON feature collections, where each feature in the collection is a STAC item.

When inserting items in the queue you must use an SQS message attribute to indicate which catalog to insert the item into.  A STAC item or JSON feature collection has no place to specify what catalog is to be used, so an SQS attribute is used for this instead.  Use a single attribute named "dg", of type string, and with a value of the name of the catalog.