<img src="img/supporting_fair_data_header.png">

In this notebook we demonstrate how the Globus platform can be used to create automated pipelines that can be used to make arbitrary data more Findable, Accessible, Interoperable, and Reusable. We demonstrate flexible access control, descriptive metadata, and use of persistent identifiers, as well as various ways to search and discover data based on these attributes.

We will walk through the following data flow:
1. Authenticate with Globus and get tokens for accessing various services
1. Assemble a dataset and move the data to a remote, immutable endpoint, with restricted access
1. Define some metadata for our dataset
1. Mint a persistent identifier for the data
1. Index descriptive metadata in Globus Search such that is discoverable by other users

The basic tutorial flow is illustrated below.  

<img src="img/publication_flow.png" alt="Automated data publication flow" align="CENTER" style="width: 85%;"/>

## Prerequisites

To complete this tutorial you will need to make sure you are in the [Tutorial Users Group](https://app.globus.org/groups/50b6a29c-63ac-11e4-8062-22000ab68755).

In [None]:
import json
# Globus SDK, for interacting with Globus Services (pip install globus-sdk)
import globus_sdk
# Minid, for minting identifiers (pip install minid)
import minid

# Globus Endpoint for storing pusblished data (petrel#testbed)
publication_endpoint = "e56c36e4-1063-11e6-a747-22000bf2d559"
http_hostname = 'testbed.petrel.host'

# Globus Group which can view published datasets
access_group = "50b6a29c-63ac-11e4-8062-22000ab68755"

# URL for the endpoint
http_base_url = "https://testbed.petrel.host/"

# search index ID to store metadata
search_index = "f702761b-3a05-4ba1-af2b-c0e07850c6f1"

# ID of this tutorial notebook as registered with Globus Auth
CLIENT_ID = 'd61ed2e0-b4f9-4fe9-9433-41e2528a807d'

# 1. Authentication

Before implementing the automated data flow we must authenticate with Globus and request access tokens to use the transfer, search, and identifier services. Here we get the tokens avaialable in JupyterHub, and create clients for interacting with Globus services.

In [None]:
import pickle, base64, os, pprint

if os.getenv('GLOBUS_DATA'):
    # get Globus Auth token data from the JupyterHub environment
    data = pickle.loads(base64.b64decode(os.getenv('GLOBUS_DATA')))
    
    # extract access token for each service
    transfer_token = data['tokens']['transfer.api.globus.org']['access_token']
    search_token = data['tokens']['search.api.globus.org']['access_token']
    minid_token = data['tokens']['85114005-42e6-4671-a73a-0a40150c2b88']['access_token']

else:
    # not running in JupyterHub environment; need to authenticate user
    native_auth_client = globus_sdk.NativeAppAuthClient(CLIENT_ID)

    # start a flow with a specific set of requested scopes (levels of access to Globus apps/services)
    # after login, you will be prompted to grant this notebook access to these services
    transfer_scope = 'urn:globus:auth:scope:transfer.api.globus.org:all'
    search_scope = 'urn:globus:auth:scope:search.api.globus.org:all'
    minid_scope = 'https://auth.globus.org/scopes/identifiers.fair-research.org/writer'
    native_auth_client.oauth2_start_flow(
        requested_scopes=[
            transfer_scope,
            search_scope,
            minid_scope
        ]
    )
    print("Login Here:\n\n{0}".format(native_auth_client.oauth2_get_authorize_url()))
    print("\nIMPORTANT NOTE: the link above can only be used once!")
    print("If login or a later step in the flow fails, you must execute this cell again to generate a new link.")
    
    # add the code that you got from Globus below
    auth_code = input('PASTE YOUR AUTH CODE HERE> ')

    # and exchange it for a response object containing your token(s)
    tokens = native_auth_client.oauth2_exchange_code_for_tokens(auth_code)

    # extract access token for each service
    transfer_token = tokens.by_scopes[transfer_scope]['access_token']
    search_token = tokens.by_scopes[search_scope]['access_token']
    minid_token = tokens.by_scopes[minid_scope]['access_token']
    
# see what the tokens look like
print("Retrieved tokens:")
print("Transfer: %s" % transfer_token)
print("Search: %s" % search_token)
print("Minid: %s" % minid_token)

In [None]:
# create clients to access each of the services
# to pass tokens to clients, wrap them in GlobusAuthorizers and pass the results to client objects
# these are generic objects which support multiple authentication methods - access Tokens are just one
transfer = globus_sdk.TransferClient(
    authorizer=globus_sdk.AccessTokenAuthorizer(transfer_token))
search = globus_sdk.SearchClient(
    authorizer=globus_sdk.AccessTokenAuthorizer(search_token))
minid_client = minid.MinidClient(
    authorizer=globus_sdk.AccessTokenAuthorizer(minid_token))

# 2. Assemble Dataset

In the first stage of the flow we move the data to a location that is immuatable, accessible only to authorized users (i.e. those in the Tutorial Users group), and able to scale as needed. We use a Globus shared endpoint for this purpose, as it allows us to dynamically manage access to data. 

To isolate users' datasets from each other we create a unique directory on our shared endpoint. To avoid name conflcits, we will name the directory using a UUID.

In [None]:
import uuid

# use Globus Transfer to create a new directory
share_dir = '/' + str(uuid.uuid4()) + '/'
r = transfer.operation_mkdir(publication_endpoint, path=share_dir)

print("Dataset path: %s" % share_dir)
print("https://app.globus.org/file-manager?origin_id=%s&origin_path=%s" % (publication_endpoint, share_dir))

Having created the directory we now need to populate it with our dataset. For simplicity, we will move sample Globus data from the "Globus Tutorial Endpoint." You are welcome to use any data you like, just update the `source_endpoint` and `source_path`.

In [None]:
# define the source endpoint and directory containing data to be published
# (Globus Tutorial Endpoint 1):/share/godata/
source_endpoint = 'ddb59aef-6d04-11e5-ba46-22000b92c6ec'
source_path = '/share/godata/file1.txt'
share_path = share_dir + os.path.basename(source_path)

# TransferData is a helper function for building good Transfer Task documents for the Globus Transfer Service
tdata = globus_sdk.TransferData(
    transfer, source_endpoint, publication_endpoint,
    label='Tutorial copy data', sync_level='checksum')

# you can add multiple files and directories to transfer -- for our case, just add one
tdata.add_item(source_path, share_path)

# submit the transfer and get a task document to describe it
task_description = transfer.submit_transfer(tdata)

We now wait for the transfer to complete using the Globus SDK `task_wait` function. To confirm that the data is transferred correctly we preform an `ls` operation on the shared endpoint. Note: in this example we also record the last file name in the publication directory so that we can associate metadata later in the tutorial. 

In [None]:
# NOTE: It's technically possible for the task to terminate with a failure. This code does not handle this condition.

# wait up to 100s, checking every 1s
completed = transfer.task_wait(
    task_description['task_id'], timeout=100, polling_interval=1)

transferred_files = ''

if not completed:
    print('Transfer still running...')
else:
    for f in transfer.operation_ls(publication_endpoint, path=share_dir):
        transferred_files = f['name'] + ';' + transferred_files

print(transferred_files)

Now that the data are placed on a shared endpoint, and in a unique directory, we can share the data with individuals or groups of users. Below we share the data with the "Tutorial Users Group" so that other tutorial participants will be able to view and download files. 

In [None]:
# this is a rule which
# - grants Read access, permissions="r"
# - to the Tutorial Users Group, access_group
# - on the directory we generated above, share_path
rule_data = {
    'DATA_TYPE': 'access',
    'principal_type': 'group', 
    'principal': access_group,
    'path': share_dir,
    'permissions': 'r'
}

# add the access control rule to the shared endpoint
result = transfer.add_endpoint_acl_rule(publication_endpoint, rule_data)
print(result['message'])

# 3. Create Metadata to Describe Dataset

We will define simple metadata which describes our dataset. This metadata will be used for registering the identifier and also for loading into our search index to enable discovery of the published dataset.

You should update the metadata below to reflect your publication. Add your name as a contributor and update the title, date, and keywords. 

In [None]:
metadata = {
    'title': 'My Globus Tutorial Dataset',
    'contributors': ['John Smith', 'FrobozzCo', 'Zaphod Beeblebrox'],
    'date': '2019-01-01',
    'keywords': ['FCD#3', 'Blanket', 'Panic', ]
}

#  4. Associate an Identifier

Next we associate a persistent and unambiguous identifier with the dataset. This allows others to refer to a permanent name rather than a potentially volatile storage location reference.

When minting an identifier the following information must be provided:
* One or more locations to access the data, such as a URL representing a particular path on a Globus endpoint
* Metadata describing a mixture of publication-specific attributes (e.g., creator, checksum) and optionally extensible, user-defined attributes
* Access policies governing which users can access the identifier

Minids are public, simple, and lightweight identifiers that we can use for this example. We will also provide the checksum in this case.

In [None]:
# define a location for accessing the data
dataset_location = "https://%s%s" % (http_hostname, share_path)

dataset_identifier = minid_client.register(
    locations=[dataset_location],
    title=metadata['title'],
    checksums=[{
        'function': 'sha256',
        'value': '2c8b08da5ce60398e1f19af0e5dccc744df274b826abe585eaba68c525434806'
    }],
    metadata={
        'date': metadata['date'],
        'contributors': metadata['contributors']
    },
    test=True,
)

metadata['identifier'] = dataset_identifier.data['identifier']

print("Identifier %s" % dataset_identifier.data['identifier'])
print("location %s" % dataset_identifier.data['location'])
print("Metadata %s" % dataset_identifier.data['metadata'])

Now that we have minted the identifier we can resolve it to find out metadata and retrieve a link to the data. Irrespective of the service used to mint an identifier, you should ensure the scheme is also be registered with other resolvers, such as [nt2.net](https://n2t.net), the name 2 thing resolver.

Note: Registration takes a few moments to propogate. If the identifier doesn't resolve, please wait a few seconds and try again.

In [None]:
print('https://n2t.net/{}'.format(metadata['identifier']))
print('https://identifiers.fair-research.org/{}'.format(metadata['identifier']))

# 5. Index Descriptive Metadata

In this stage of the flow we aim to index the metadata that describes our published dataset. For this purpose we use Globus Search, a flexible, schema-agnostic search platform with fine grained access control on data and metadata. Globus Search provides powerful, free-text search capabilities via which others can discover our published dataset.

Globus Search supports user-managed indexes in which an adminstrator may create an index and define policies regarding its use, including who can manage the index, ingest metadata, and query the index. 

Complete documentation for using Globus Search is available at https://docs.globus.org/api/search/.

We have created an index for this tutorial. You can use the Globus SDK to retrieve information about the index as follows:

In [None]:
tutorial_index = search.get_index(index_id=search_index)
print(tutorial_index['display_name'])
print(tutorial_index['description'])

## Indexing Data

Globus Search supports scalable indexing of arbitrary entries into a selected index. An entry is comprised of three types of information:
1. A subject, which represents a name or target for the entry (e.g., a URL for a Globus-accesible file or directory)
1. Arbitrary metadata represented as a collection of attributes in nested JSON structure
1. A visibility policy that defines which users or groups are able to view and query the subject and its metadata

To index metadata we construct an JSON object that includes this information, and use the `ingest` function to add it to the index:

In [None]:
subject =  "https://%s%s" % (http_hostname, share_path)
ingest_data = {
    "ingest_type": "GMetaEntry",
    "ingest_data": {
        "subject": subject,
        "visible_to": ["public"],
        "content": metadata
    }
}
result = search.ingest(search_index, ingest_data)
print("Documents indexed: %s" % result['num_documents_ingested'])
print("Subject: %s" % subject)
print(metadata)

# Search

Globus Search implements a flexible query model that supports two types of queries: simple, free-text queries and complex, structured queries.

Simple queries perform basic sub-string matching against any metadata fields that are visible to the querying user.
As with web search, the results of a simple search are ordered based on the computed "best match" for the query. 

A simple query is as easy as passing a string to the `search` function.  The results are an ordered list of result objects. 

Update the following free text query to discover your dataset. 

In [None]:
query='john'

search_results = search.search(index_id=search_index, q=query)

print("Count: %s" % search_results['count'])
for i in search_results['gmeta']:
    print("Subject: %s" % i['subject'])
    print("Entries: %s" % json.dumps(i['entries']))

Globus Search also supports an advanced query mode in which more precise queries can be expressed. For examples, queries that search specific attributes, range expressions, exact matches, and so forth.

First we search for your published dataset using the minted identiifer, we then query for all publications with a specific contributor. 

In [None]:
search_results = search.search(search_index, q='identifier: "%s"' % metadata['identifier'], advanced=True)

print("Count: %s" % search_results['count'])
for i in search_results['gmeta']:
    print("Subject: %s" % i['subject'])
    print("Content: %s" % i['entries'])  

In [None]:
search_results = search.search(search_index, 'contributors: "John Smith"', advanced=True)

print("Count: %s" % search_results['count'])
for i in search_results['gmeta']:
    print("Subject: %s" % i['subject'])
    print("Content: %s" % json.dumps(i['entries']))

## Complex queries

Complex queries take the form of a structured JSON document, and are more commonly used when the queries is created programmatically. They may reference specific metadata fields, and may apply criteria such as value ranges, wildcards, and regular expressions. 

For example, to conduct the same free-text search as above&mdash;but to limit results to publications between 2010-2020&mdash;we can add a filter to the query.

Note: We use the Globus SDK SearchQuery to construct complex queries. We also show the resulting JSON query object used to execute the query. 

In [None]:
structured_query = (globus_sdk.SearchQuery(q=query)
                    .add_filter('date', [{'from': 2000, 'to': 2020}], type='range'))
search_results = search.post_search(search_index, structured_query)

print("Structured Query Object: %s\n" % json.dumps(structured_query))
print("Count: %s" % search_results['count'])
for i in search_results['gmeta']:
    print("Subject: %s" % i['subject'])
    print("Content: %s\n" % json.dumps(i['entries']))

Complex queries may also specify facets&mdash;a method for generating categories and associated frequencies for particular metadata fields. For example, here is a query to produce keyword facets:

In [None]:
structured_query = (globus_sdk.SearchQuery(q='*').add_facet('Publication Keywords', 'keywords'))
search_results = search.post_search(search_index, structured_query)

print("Structured Query Object: %s\n" % json.dumps(structured_query))
print("Results\nCount: %s" % search_results['count'])
print("\nFacets")
for i in search_results['facet_results']:
    for j in i['buckets']:
        print ("%s (%s)" % (j['value'], j['count']))

# Advanced indexing

One of the benefits of the Globus Search model is that you can associate visibility policies with records and metadata. Here we demonstrate how you can add a new metadata entry to a record and make it visible only to a particular group of users. 

Update the metadata added below, and confirm that the queries now show the updated metadata. Note: When querying over these entities the results will collapse metadata for the same root subject. 

In [None]:
import time

ingest_data = {
    "ingest_type": "GMetaEntry",
    "ingest_data": {
        "subject": "https://%s%s" % (http_hostname, share_path),
        "id": "rating",
        "visible_to": ['urn:globus:groups:id:%s' % access_group],
        "content": {
            "rating": "good",
        }
    }
}
result = search.ingest(search_index, ingest_data)
while search.get_task(result['task_id'])['state'] in ['PENDING', 'PROGRESS']:
    print('Ingesting...')
    time.sleep(1)
print('Done.')

search_results = search.search(search_index, q='identifier: "%s"' % metadata['identifier'], advanced=True)

print("Count: %s" % search_results['count'])
for i in search_results['gmeta']:
    print("Subject: %s" % i['subject'])
    print("Content: %s" % i['entries'])                                   