# Research Object Composer tutorial

This is a [Jupyter Notebook](https://jupyter.org/) demonstrating how a client can use the [Research Object Composer](https://github.com/researchobject/research-object-composer) REST API.

For requirements to run this notebook interactively, see the [README](https://github.com/ResearchObject/research-object-composer/blob/master/README.md). 

The [RO Composer API](https://researchobject.github.io/research-object-composer/api/) is documented using [Swagger OpenAPI](https://swagger.io/docs/specification/about/) 2.0, which means the REST API can be integrated into programming languages, however this notebook uses [Python](https://www.python.org/) to not hide too much of the HTTP details.

To execute each cell when running this notebook, select each in order, then click the **▶️Run** button above.

## Python requirements

For the below examples we'll use the Python library [requests](https://pypi.org/project/requests/) to show the HTTP  interactions. Below assumes a basic knowledge of [REST services](https://en.wikipedia.org/wiki/Representational_state_transfer).

If the below `import` does not work, try on the command line where you started Jupyter Notebook: `pip install requests`

In [141]:
import requests

RO Composer is meant to be installed on a local infrastructure or as a cloud service. The below uses a demo service hosted by The University of Manchester why is not supported and may become unavailable in the future.

If you are testing the service locally using _Docker Compose_ (see [README](https://github.com/ResearchObject/research-object-composer/blob/master/README.md)) - change below to `http://localhost:8080` or use equivalent server name if you are hosting it as a cloud service.

In [142]:
host = "http://openphacts.cs.man.ac.uk:8080"

## Profiles

The RO Composer supports creating research object for multiple **profiles**. Each profile is [defined internally](https://github.com/ResearchObject/research-object-composer/tree/master/src/main/resources/public/schemas) using [JSON Schema](https://json-schema.org/), but we can query the `/profiles` service to see which profiles are installed:


In [144]:
r = requests.get(host + "/profiles")
r.status_code

200

HTTP status code `200` means **OK**, so let's see what is the _content type_ of the result:

In [145]:
r.headers["Content-Type"]

'application/hal+json;charset=UTF-8'

The API results from RO Composer is JSON that follows the Hypertext Application Language ([HAL](http://stateless.co/hal_specification.html)) patterns for RESTful services. Let's look at the content:

In [146]:
r.json()

{'_embedded': {'researchObjectProfileList': [{'id': 1,
    'name': 'data_bundle',
    'fields': ['data', '_metadata'],
    '_links': {'self': {'href': 'http://openphacts.cs.man.ac.uk:8080/profiles/data_bundle'},
     'schema': {'href': 'http://openphacts.cs.man.ac.uk:8080/schemas/data_bundle.schema.json'},
     'researchObjects': {'href': 'http://openphacts.cs.man.ac.uk:8080/profiles/data_bundle/research_objects'}}},
   {'id': 2,
    'name': 'draft_task',
    'fields': ['input', 'workflow', 'workflow_params', '_metadata'],
    '_links': {'self': {'href': 'http://openphacts.cs.man.ac.uk:8080/profiles/draft_task'},
     'schema': {'href': 'http://openphacts.cs.man.ac.uk:8080/schemas/draft_task.schema.json'},
     'researchObjects': {'href': 'http://openphacts.cs.man.ac.uk:8080/profiles/draft_task/research_objects'}}}]},
 '_links': {'self': {'href': 'http://openphacts.cs.man.ac.uk:8080/profiles'}}}

In HAL, the `_links` section contain related REST resources, in this case only `self` which `href` is referring back to the HTTP resource we just requested. 

The `_embedded` section contains additional REST sources which properties are partially embedded. Within `researchObjectProfileList` we therefore find the different profiles supported by this service. Let's look at their `name` fields:

In [147]:
profiles = r.json()["_embedded"]['researchObjectProfileList']
[p["name"] for p in profiles]

['data_bundle', 'draft_task']

In this installation, the profile `data_bundle` is for Research Objects containing arbitrary datasets, while `draft_task` is for more specific ROs describing workflow executions. We'll look at the first in detail and see it only expects the fields `data` and `_metadata`:

In [148]:
bundle_profile = profiles[0]
bundle_profile["fields"]

['data', '_metadata']

We can request the underlying [JSON Schema](https://json-schema.org/) to see details of these fields at `/schemas/{name}` - linked to from `schema` under our profile's `_links`.

In [149]:
links = bundle_profile["_links"]
links

{'self': {'href': 'http://openphacts.cs.man.ac.uk:8080/profiles/data_bundle'},
 'schema': {'href': 'http://openphacts.cs.man.ac.uk:8080/schemas/data_bundle.schema.json'},
 'researchObjects': {'href': 'http://openphacts.cs.man.ac.uk:8080/profiles/data_bundle/research_objects'}}

In [150]:
schema = links["schema"]
schema

{'href': 'http://openphacts.cs.man.ac.uk:8080/schemas/data_bundle.schema.json'}

In [151]:
schema_response = requests.get(schema["href"])
schema_response.json()

{'$schema': 'http://json-schema.org/draft-07/schema',
 'type': 'object',
 '$baggable': {'data': '/'},
 'properties': {'_metadata': {'$ref': '/schemas/_base.schema.json#/definitions/Metadata'},
  'data': {'type': 'array',
   'items': {'$ref': '/schemas/_base.schema.json#/definitions/RemoteItem'}}}}

You may notice that the JSON Schema define the `_metadata` and `data` keys by referencing a [base schema](https://github.com/ResearchObject/research-object-composer/blob/master/src/main/resources/public/schemas/_base.schema.json) that is common for all Research Objects.  However we do not need to learn the details of the profile's JSON Schema as the RO Composer will make individual REST resources for each field.

The REST resource that collect [research objects for the given profile]((https://researchobject.github.io/research-object-composer/api/#operation/listResearchObjectsForProfile)) is at `/profiles/{name}/research_objects` and linked to from the `researchObjects` link from the profile:



In [152]:
researchObjects = links["researchObjects"]
researchObjects

{'href': 'http://openphacts.cs.man.ac.uk:8080/profiles/data_bundle/research_objects'}

This resource is a collection that supports creation using [POST](https://researchobject.github.io/research-object-composer/api/#operation/createResearchObject).

In [153]:
created = requests.post(researchObjects["href"])
created

<Response [201]>

In HTTP, **201 Created** means a new HTTP resource was made. We can find out where from the `Location` header:

In [154]:
ro_uri = created.headers["Location"]
ro_uri

'http://openphacts.cs.man.ac.uk:8080/research_objects/19'

The response also includes a preview of the created Research Object resource, where we'll find the same URI under the `self` link.

In [155]:
created.json()

{'id': 19,
 'content': {'data': [], '_metadata': None},
 'contentSha256': None,
 'profileName': 'data_bundle',
 '_links': {'self': {'href': 'http://openphacts.cs.man.ac.uk:8080/research_objects/19'},
  'profile': {'href': 'http://openphacts.cs.man.ac.uk:8080/profiles/data_bundle'},
  'content': {'href': 'http://openphacts.cs.man.ac.uk:8080/research_objects/19/content'},
  'data': {'href': 'http://openphacts.cs.man.ac.uk:8080/research_objects/19/content/data'},
  '_metadata': {'href': 'http://openphacts.cs.man.ac.uk:8080/research_objects/19/content/_metadata'}}}

Remember before we had the fields `data` and `_metadata`? We now see them under `content`, but they are not yet populated:

In [157]:
created.json()["content"]

{'data': [], '_metadata': None}

We have a corresponding REST resource to populate each, which we find under `_links`. 

In [158]:
links = created.json()["_links"]
links

{'self': {'href': 'http://openphacts.cs.man.ac.uk:8080/research_objects/19'},
 'profile': {'href': 'http://openphacts.cs.man.ac.uk:8080/profiles/data_bundle'},
 'content': {'href': 'http://openphacts.cs.man.ac.uk:8080/research_objects/19/content'},
 'data': {'href': 'http://openphacts.cs.man.ac.uk:8080/research_objects/19/content/data'},
 '_metadata': {'href': 'http://openphacts.cs.man.ac.uk:8080/research_objects/19/content/_metadata'}}

Let's fill `data` first. We saw in `content` it was a `[]` array, which we also saw as `type: array` in the JSON Schema. RO Composer exposes this as a REST collection we can `POST` to add to. 

In [159]:
data = requests.post(links["data"]["href"], json={})
data

<Response [400]>

Uups, **400 Bad Request**, perhaps `{}` was not sufficient? Perhaps we should have read that JSON Schema after all...

In [162]:
data.json()


{'pointerToViolation': '#',
 'message': '#: 4 schema violations found',
 'causingExceptions': [{'keyword': 'required',
   'pointerToViolation': '#',
   'message': '#: required key [length] not found',
   'causingExceptions': [],
   'schemaLocation': '/schemas/_base.schema.json#/definitions/RemoteItem'},
  {'keyword': 'required',
   'pointerToViolation': '#',
   'message': '#: required key [filename] not found',
   'causingExceptions': [],
   'schemaLocation': '/schemas/_base.schema.json#/definitions/RemoteItem'},
  {'keyword': 'required',
   'pointerToViolation': '#',
   'message': '#: required key [url] not found',
   'causingExceptions': [],
   'schemaLocation': '/schemas/_base.schema.json#/definitions/RemoteItem'},
  {'keyword': 'required',
   'pointerToViolation': '#',
   'message': '#: required key [checksums] not found',
   'causingExceptions': [],
   'schemaLocation': '/schemas/_base.schema.json#/definitions/RemoteItem'}],
 'schemaLocation': '/schemas/_base.schema.json#/definiti

We see here that we are missing 4 fields from the [RemoteItem](https://github.com/ResearchObject/research-object-composer/blob/master/src/main/resources/public/schemas/_base.schema.json#L4) type, `length`, `filename`, `url` and `checksums`; items in `data` reference remote files.

For the purpose of this demonstration we'll create a [simple dataset](https://github.com/ResearchObject/ro-lite/tree/master/examples/simple-dataset-0.1.0/data) contaning a [TSV file](https://github.com/ResearchObject/ro-lite/blob/master/examples/simple-dataset-0.1.0/data/repository-sizes.tsv) and a [PNG image](https://github.com/ResearchObject/ro-lite/blob/master/examples/simple-dataset-0.1.0/data/repository-sizes-chart.png). 

In [163]:
tsv = { "url": "https://raw.githubusercontent.com/ResearchObject/ro-lite/master/examples/simple-dataset-0.1.0/data/repository-sizes.tsv",
        "length": 1982,
        "filename": "repository-sizes.tsv",
        "checksums": [{"type": "sha256", 
                       "checksum": "c2160e931a6ddb8cddb451190816196fc667c5f25020a89a356a69e75ec8dc0a"}]
      } 
png = { "url": "https://raw.githubusercontent.com/ResearchObject/ro-lite/master/examples/simple-dataset-0.1.0/data/repository-sizes-chart.png",
        "length": 23803,
        "filename": "repository-sizes-chart.png",
        "checksums": [{"type": "sha256", 
                       "checksum": "c2160e931a6ddb8cddb451190816196fc667c5f25020a89a356a69e75ec8dc0a"}]
      } 
tsv_uploaded = requests.post(links["data"]["href"], json=tsv)
tsv_uploaded.status_code

200

In [164]:
csv_uploaded = requests.post(links["data"]["href"], json=png)
csv_uploaded.status_code

200

**200 OK** here means we complied with the JSON Schema, you may get an error if you get any of the keys wrong.

Now let's reload the RO and see if we have populated `content`.

In [166]:
ro = requests.get(ro_uri)
ro.json()

{'id': 19,
 'content': {'data': [{'url': 'https://raw.githubusercontent.com/ResearchObject/ro-lite/master/examples/simple-dataset-0.1.0/data/repository-sizes.tsv',
    'length': 1982,
    'filename': 'repository-sizes.tsv',
    'checksums': [{'type': 'sha256',
      'checksum': 'c2160e931a6ddb8cddb451190816196fc667c5f25020a89a356a69e75ec8dc0a'}]},
   {'url': 'https://raw.githubusercontent.com/ResearchObject/ro-lite/master/examples/simple-dataset-0.1.0/data/repository-sizes-chart.png',
    'length': 23803,
    'filename': 'repository-sizes-chart.png',
    'checksums': [{'type': 'sha256',
      'checksum': 'c2160e931a6ddb8cddb451190816196fc667c5f25020a89a356a69e75ec8dc0a'}]}],
  '_metadata': None},
 'contentSha256': None,
 'profileName': 'data_bundle',
 '_links': {'self': {'href': 'http://openphacts.cs.man.ac.uk:8080/research_objects/19'},
  'profile': {'href': 'http://openphacts.cs.man.ac.uk:8080/profiles/data_bundle'},
  'content': {'href': 'http://openphacts.cs.man.ac.uk:8080/research

Now we can download the Research Object as a [BagIt archive](https://tools.ietf.org/html/rfc8493) (RFC8493) from the `/research_objects/{id}/bag` resource (currently this REST resource is not listed under `_links`).

As this is a binary (ZIP file) we'll use a slightly different Python method to save it to a file.

In [176]:
import shutil
with requests.post(ro_uri + "/bag", stream=True) as bag:
    r.raise_for_status()
    with open("bag.zip", "wb") as zipfile:
        shutil.copyfileobj(bag.raw, zipfile)

In [None]:
If you have `unzip` installed we can also check the content.

In [177]:
! unzip -t bag.zip

Archive:  bag.zip
    testing: data_bundle-19/fetch.txt   OK
    testing: data_bundle-19/bag-info.txt   OK
    testing: data_bundle-19/metadata/manifest.json   OK
    testing: data_bundle-19/tagmanifest-md5.txt   OK
    testing: data_bundle-19/manifest-sha256.txt   OK
    testing: data_bundle-19/tagmanifest-sha256.txt   OK
    testing: data_bundle-19/manifest-sha512.txt   OK
    testing: data_bundle-19/data/content.json   OK
    testing: data_bundle-19/tagmanifest-sha512.txt   OK
    testing: data_bundle-19/bagit.txt   OK
    testing: data_bundle-19/manifest-md5.txt   OK
No errors detected in compressed data of bag.zip.


You may notice that the two remote files are **not** present in the zip file, they are referenced from `fetch.txt` and `manifest-sha256.txt` - this means that even if large files are added to the Research Object, its download ZIP remains small (until the bag is _completed_ using BagIt tools)