# Research Object Composer tutorial

This is a [Jupyter Notebook](https://jupyter.org/) demonstrating how a client can use the [Research Object Composer](https://github.com/researchobject/research-object-composer) REST API.

For requirements to run this notebook interactively, see the [README](https://github.com/ResearchObject/research-object-composer/blob/master/README.md). 

The [RO Composer API](https://researchobject.github.io/research-object-composer/api/) is documented using [Swagger OpenAPI](https://swagger.io/docs/specification/about/) 2.0, which means the REST API can be integrated into programming languages, however this notebook uses [Python](https://www.python.org/) to not hide too much of the HTTP details.

To execute this notebook, select each cell in order, then click the **▶️Run** button above.

## Python requirements

For the below examples we'll use the Python library [requests](https://pypi.org/project/requests/) to show the HTTP  interactions. Below assumes a basic knowledge of [REST services](https://en.wikipedia.org/wiki/Representational_state_transfer).

If the below `import` does not work, try on the command line where you started Jupyter Notebook: `pip install requests`

In [147]:
import requests
true,false = (True,False) # for JSON example

RO Composer is meant to be installed on a local infrastructure or as a cloud service. The below uses a **demonstration** service hosted by The University of Manchester which is not supported and *may become unavailable* in the future.

If you are testing the service locally using _Docker Compose_ (see [README](https://github.com/ResearchObject/research-object-composer/blob/master/README.md)) - change below to `http://localhost:8080` or use equivalent server name if you are hosting it as a cloud service.

## Billboard Document

As a starting point we'll retrieve the **billboard resource** - a kind of programmatic homepage that tells us what we can do at the RO Composer.

In [196]:
host = "http://openphacts.cs.man.ac.uk:8080/"
r = requests.get(host)
r.status_code

200

[HTTP status code](https://tools.ietf.org/html/rfc7231#section-6.3.1) `200` means **OK**, so let's see what is the _content type_ of the result:

In [197]:
r.headers["Content-Type"]

'application/hal+json;charset=UTF-8'

The API results from RO Composer are regular [JSON](https://tools.ietf.org/html/rfc7159) that also follow the _Hypertext Application Language_ ([HAL](http://stateless.co/hal_specification.html)) patterns for RESTful services. Let's look at the content:

In [198]:
index = r.json()
index

{'_links': {'self': {'href': 'http://openphacts.cs.man.ac.uk:8080/'},
  'profiles': {'href': 'http://openphacts.cs.man.ac.uk:8080/profiles'},
  'researchObjects': {'href': 'http://openphacts.cs.man.ac.uk:8080/research_objects'}}}

In HAL, the `_links` section contain links to related REST resources, in this case `self` refer back to the HTTP resource we just requested. Hyperlinks are given with `href`, almost like in HTML.

This way of navigating REST resources mean we do not have to commit to fixed URI patterns, we will see later why this style of [Hypermedia as the Engine of Application State](https://restfulapi.net/hateoas/) is an important aspect of RO Composer.

The other two links are `profiles` and `researchObjects`. Let's explore:

## List known Research Objects

In [248]:
links = index["_links"]
ro = links["researchObjects"]["href"]
researchObjects = requests.get(ro).json()
researchObjects

{'_embedded': {'researchObjectSummaryList': [{'id': 27,
    'profileName': 'data_bundle',
    'depositionUrl': None,
    'createdAt': '2019-06-25T12:55:17.829+0000',
    'modifiedAt': '2019-06-25T13:00:01.996+0000',
    'depositedAt': None,
    '_links': {'self': {'href': 'http://openphacts.cs.man.ac.uk:8080/research_objects/27'},
     'profile': {'href': 'http://openphacts.cs.man.ac.uk:8080/profiles/data_bundle'}}},
   {'id': 28,
    'profileName': 'data_bundle',
    'depositionUrl': None,
    'createdAt': '2019-06-25T23:30:21.750+0000',
    'modifiedAt': None,
    'depositedAt': None,
    '_links': {'self': {'href': 'http://openphacts.cs.man.ac.uk:8080/research_objects/28'},
     'profile': {'href': 'http://openphacts.cs.man.ac.uk:8080/profiles/data_bundle'}}},
   {'id': 29,
    'profileName': 'data_bundle',
    'depositionUrl': None,
    'createdAt': '2019-06-25T23:30:44.621+0000',
    'modifiedAt': None,
    'depositedAt': None,
    '_links': {'self': {'href': 'http://openphacts.cs

The top-level `/research_objects` resource gives us a *paged* listing of [Research Objects](http://www.researchobject.org/) known to the RO Composer. 

(**Tip**: If the `researchObjectSummaryList` listing is blank, come back to this section later after we have [constructed an RO](#Profiles).

### Paging

Here under `_links` we find `self` with the canonical URL for the *paged result* (including `?page=1&size=10`). If there are many research objects there will also be links like `next`, `prev`, `first` and `last` with the expected semantics.

### Purpose of RO Composer

Note that the RO Composer is **not** a registry of research objects, but it can list research objects _currently under construction_. 

The intended purpose of the composer is to be a temporary construction site that can be completed by multiple services (e.g. a data management system, a workflow system, a user interface). These clients will be jointly building a Research Object that can then be _validated_ according to a pre-defined schema, before the RO is _downloaded_ or _deposited_ into an archive like [Zenodo](http://zenodo.org/) or [Mendeley Data](https://data.mendeley.com/).


### Embedded previews

In the above listing we saw only a subset of the research object properties, here embedded as a preview under the `_embedded` section.

For each item in the `researchObjectSummaryList` we can find nested `_links` for the research object `self` or the `profile` it is intended to comply with. Let's just pick any RO to have a look:

In [258]:
ro = researchObjects["_embedded"]["researchObjectSummaryList"].pop()
links = ro["_links"]
links

{'self': {'href': 'http://openphacts.cs.man.ac.uk:8080/research_objects/27'},
 'profile': {'href': 'http://openphacts.cs.man.ac.uk:8080/profiles/data_bundle'}}

(_if this does not work, come back to this section later after [creating an RO](#Profiles)_)

### Viewing a Research Object

We'll have a brief look at an existing research object.

In [259]:
full_ro = requests.get(ro["_links"]["self"]["href"]).json()
full_ro

{'id': 27,
 'content': {'data': [{'url': 'https://raw.githubusercontent.com/ResearchObject/research-object-composer/master/examples/repository-sizes.tsv',
    'length': 1982,
    'filename': 'repository-sizes.tsv',
    'checksums': [{'type': 'sha256',
      'checksum': 'c2160e931a6ddb8cddb451190816196fc667c5f25020a89a356a69e75ec8dc0a'}]},
   {'url': 'https://raw.githubusercontent.com/ResearchObject/research-object-composer/master/examples/repository-sizes-chart.png',
    'length': 23803,
    'filename': 'repository-sizes-chart.png',
    'checksums': [{'type': 'sha256',
      'checksum': 'c2160e931a6ddb8cddb451190816196fc667c5f25020a89a356a69e75ec8dc0a'}]}],
  '_metadata': {'title': 'A good start',
   'creators': [{'name': 'Alice W Land',
     'orcid': 'https://orcid.org/0000-0002-1825-0097'}],
   'description': 'A test dataset of not much interest',
   'access_right': 'open'}},
 'createdAt': '2019-06-25T12:55:17.829+0000',
 'modifiedAt': '2019-06-25T13:00:01.996+0000',
 'depositedAt': 

Puh! We see quite some additional fields this time. 

There are two parts to an RO while it is in the composer, most of the above are related to its construction phase (e.g. `createdAt`, `profileName`, `_links`) and are automatically filled in. 

The elements that are important for deposition are under `content` - these are the fields we will need to fill in using the RO Composer API.

In [261]:
full_ro["content"]

{'data': [{'url': 'https://raw.githubusercontent.com/ResearchObject/research-object-composer/master/examples/repository-sizes.tsv',
   'length': 1982,
   'filename': 'repository-sizes.tsv',
   'checksums': [{'type': 'sha256',
     'checksum': 'c2160e931a6ddb8cddb451190816196fc667c5f25020a89a356a69e75ec8dc0a'}]},
  {'url': 'https://raw.githubusercontent.com/ResearchObject/research-object-composer/master/examples/repository-sizes-chart.png',
   'length': 23803,
   'filename': 'repository-sizes-chart.png',
   'checksums': [{'type': 'sha256',
     'checksum': 'c2160e931a6ddb8cddb451190816196fc667c5f25020a89a356a69e75ec8dc0a'}]}],
 '_metadata': {'title': 'A good start',
  'creators': [{'name': 'Alice W Land',
    'orcid': 'https://orcid.org/0000-0002-1825-0097'}],
  'description': 'A test dataset of not much interest',
  'access_right': 'open'}}

Note that if you access the RO Composer using a browser (rather than say `curl`) you will get a debug web interface that use the same JSON API. See for instance http://openphacts.cs.man.ac.uk:8080/ to see an equivalent listing of  ROs.

Next we'll see how we can **create** such a research object. But before we can get that far, we need to know what should go into `content`.

## Profiles

Research Objects can be used for different purposes depending on domain- and application-specific expectation. 

[Profiles](http://www.researchobject.org/scopes/) help define the shape and form of a class of research objects, for instance a _dataset research object_ can contain a couple of vaguely related data files, while a _workflow-centric research object_ may keep more structured workflow definitions, example inputs, execution provenance, etc.

Loosely, a profile defines an expectation of what **kind of resources** should be expected, and what **metadata** is required. In a way, a profile defines the general **purpose** of that type of Research Objects, and documents what assumptions a consumer can rely on when processing such Research Objects beyond "some files".

The RO Composer supports creating Research Object for multiple **profiles**. Each profile is [defined internally](https://github.com/ResearchObject/research-object-composer/tree/master/src/main/resources/public/schemas) using [JSON Schema](https://json-schema.org/), for creating different kind of Research Objects. 

As a kind of simplification for how research objects are eventually serialized, the choice of _profile_ effectively determines which objects must appear under `content` when constructing that type of ROs.

We can query the `/profiles` service to see which profiles are installed:


In [262]:
p = requests.get(host + "/profiles").json()
p

{'_embedded': {'researchObjectProfileList': [{'name': 'data_bundle',
    'fields': ['data', '_metadata'],
    '_links': {'self': {'href': 'http://openphacts.cs.man.ac.uk:8080/profiles/data_bundle'},
     'schema': {'href': 'http://openphacts.cs.man.ac.uk:8080/schemas/data_bundle.schema.json'},
     'template': {'href': 'http://openphacts.cs.man.ac.uk:8080/profiles/data_bundle/template'},
     'researchObjects': {'href': 'http://openphacts.cs.man.ac.uk:8080/profiles/data_bundle/research_objects'}}},
   {'name': 'draft_task',
    'fields': ['input', 'workflow', 'workflow_params', '_metadata'],
    '_links': {'self': {'href': 'http://openphacts.cs.man.ac.uk:8080/profiles/draft_task'},
     'schema': {'href': 'http://openphacts.cs.man.ac.uk:8080/schemas/draft_task.schema.json'},
     'template': {'href': 'http://openphacts.cs.man.ac.uk:8080/profiles/draft_task/template'},
     'researchObjects': {'href': 'http://openphacts.cs.man.ac.uk:8080/profiles/draft_task/research_objects'}}}]},
 '_li

### Profile links

Similar to the listing of research objects we get an `_embedded` section of the `researchObjectProfileList`, as each profile itself is a separate REST resource with multiple related resources.

Let's look at their `name` fields:

In [263]:
profiles = p["_embedded"]['researchObjectProfileList']
[profile["name"] for profile in profiles]

['data_bundle', 'draft_task']

In this installation, the profile `data_bundle` is for Research Objects containing arbitrary datasets, while `draft_task` is for more specific ROs describing workflow executions. 

The intention is that if a client know a particular profile in advance, it can build a richer interface or provide additional information from underlying data sources; while it can also deal with the remaining profiles in a more generic way.


We'll look at the `data_bundle` in detail. We see that the `content` we can create with this profile only expects the fields `data` and `_metadata`:

In [264]:
bundle_profile = next(p for p in profiles if p["name"]=="data_bundle")
bundle_profile["fields"]

['data', '_metadata']

For the particular profile the `_links` refers to several resources, `self`, `schema` and `researchObjects`

In [265]:
links = bundle_profile["_links"]
links

{'self': {'href': 'http://openphacts.cs.man.ac.uk:8080/profiles/data_bundle'},
 'schema': {'href': 'http://openphacts.cs.man.ac.uk:8080/schemas/data_bundle.schema.json'},
 'template': {'href': 'http://openphacts.cs.man.ac.uk:8080/profiles/data_bundle/template'},
 'researchObjects': {'href': 'http://openphacts.cs.man.ac.uk:8080/profiles/data_bundle/research_objects'}}

### JSON Schema

If we want we can request the underlying [JSON Schema](https://json-schema.org/) to see details of these fields at `/schemas/{name}` - linked to from `schema` above.

In [266]:
schema = links["schema"]
schema

{'href': 'http://openphacts.cs.man.ac.uk:8080/schemas/data_bundle.schema.json'}

In [267]:
schema_response = requests.get(schema["href"])
schema_response.headers["Content-Type"]

'application/json'

Note that we are no longer navigating HAL resources at the RO Composer; this is the native representation of the JSON Schema, which of course is also defined in JSON, but using different kind of keys.

In [268]:
schema_response.json()

{'$schema': 'http://json-schema.org/draft-07/schema',
 'type': 'object',
 '$baggable': {'data': '/'},
 'properties': {'_metadata': {'$ref': '/schemas/_base.schema.json#/definitions/Metadata'},
  'data': {'type': 'array',
   'items': {'$ref': '/schemas/_base.schema.json#/definitions/RemoteItem'}}},
 'required': ['data']}

This schema will be used when _validating_ the Research Object `content` created under this profile. You may recognize that `properties` here list the fields `_metadata` and `data`, with further type definitions given by reference. 

This introduction does not go in detail on JSON Schema, to a large degree the role of the RO Composer is also to hide these implementation details and present a simplified picture by making individual REST resources for each field.

### Future profile work 

The JSON Schemas are discovered during deployment and cannot be modified through the RO Composer APIs or without restarting the service.

_Versioning_ and careful updating of schemas will be important for longer-term use, as older Research Objects may no longer be valid (e.g. adding a new required field).

Documentation fields could be lifted from the JSON Schema definitions to be embedded in individual profile `field` listing and error messages.

## Creating a Research Object

The REST resource that collect [Research Objects for the given profile]((https://researchobject.github.io/research-object-composer/api/#operation/listResearchObjectsForProfile)) is at `/profiles/{name}/research_objects` and is linked to from the `researchObjects` link:

In [269]:
researchObjects = links["researchObjects"]
researchObjects

{'href': 'http://openphacts.cs.man.ac.uk:8080/profiles/data_bundle/research_objects'}

As this resource is a _collection_ it supports RO creation using [POST](https://researchobject.github.io/research-object-composer/api/#operation/createResearchObject) to start building a research object of that type.

In [270]:
created = requests.post(researchObjects["href"])
created

<Response [201]>

In HTTP, **201 Created** means a new HTTP resource was made. We can find out _where_ from the `Location` header:

In [271]:
ro_uri = created.headers["Location"]
ro_uri

'http://openphacts.cs.man.ac.uk:8080/research_objects/31'

(If you previously got an empty listing of `/research_objects` now is a good time to check back on the [previous section](#List-known-Research-Objects) - this RO is definitely under construction as we have not provided any data yet)

### Completing the Research Object

The response also includes a preview of the created Research Object resource (we don't need to `GET` it), where we'll find the same URI under the `self` link.

In [272]:
created.json()

{'id': 31,
 'content': {'data': [],
  '_metadata': {'title': None, 'description': None, 'creators': []}},
 'createdAt': '2019-06-26T00:29:47.509+0000',
 'modifiedAt': None,
 'depositedAt': None,
 'mutable': True,
 'profileName': 'data_bundle',
 'checksum': None,
 '_links': {'self': {'href': 'http://openphacts.cs.man.ac.uk:8080/research_objects/31'},
  'profile': {'href': 'http://openphacts.cs.man.ac.uk:8080/profiles/data_bundle'},
  'content': {'href': 'http://openphacts.cs.man.ac.uk:8080/research_objects/31/content'},
  'fields': [{'href': 'http://openphacts.cs.man.ac.uk:8080/research_objects/31/content/data',
    'name': 'data'},
   {'href': 'http://openphacts.cs.man.ac.uk:8080/research_objects/31/content/_metadata',
    'name': '_metadata'}]}}

Remember before we heard about the fields `data` and `_metadata`? We now see them under `content`, however they are partially populated:

In [273]:
created.json()["content"]

{'data': [], '_metadata': {'title': None, 'description': None, 'creators': []}}

### Research Object links

We have a corresponding REST resource to populate each field, which we find under `_links`. 

In [274]:
links = created.json()["_links"]
links

{'self': {'href': 'http://openphacts.cs.man.ac.uk:8080/research_objects/31'},
 'profile': {'href': 'http://openphacts.cs.man.ac.uk:8080/profiles/data_bundle'},
 'content': {'href': 'http://openphacts.cs.man.ac.uk:8080/research_objects/31/content'},
 'fields': [{'href': 'http://openphacts.cs.man.ac.uk:8080/research_objects/31/content/data',
   'name': 'data'},
  {'href': 'http://openphacts.cs.man.ac.uk:8080/research_objects/31/content/_metadata',
   'name': '_metadata'}]}

While each field might have additional properties (e.g. documentation), we are here only interested in their `name` and the corresponding `href` HTTP resource to fill the field. This Python code converts this to a `name`->`href` dictionary:

In [275]:
fields = dict((f["name"], f["href"] ) for f in links["fields"])
fields

{'data': 'http://openphacts.cs.man.ac.uk:8080/research_objects/31/content/data',
 '_metadata': 'http://openphacts.cs.man.ac.uk:8080/research_objects/31/content/_metadata'}

### Adding data to a Research Object

Let's fill `data` first. We saw in `content` it was a array  `[]`. RO Composer exposes this as a REST _collection_ where we can `POST` to add items. 

In [276]:
data = requests.post(fields["data"], json={})
data

<Response [400]>

Uups, **400 Bad Request**, perhaps `{}` was a bit minimal? Maybe we should have read that JSON Schema after all... Let's look at the returned **schema violations**:

In [277]:
data.json()


{'pointerToViolation': '#',
 'message': '#: 4 schema violations found',
 'causingExceptions': [{'keyword': 'required',
   'pointerToViolation': '#',
   'message': '#: required key [length] not found',
   'causingExceptions': [],
   'schemaLocation': '/schemas/_base.schema.json#/definitions/RemoteItem'},
  {'keyword': 'required',
   'pointerToViolation': '#',
   'message': '#: required key [filename] not found',
   'causingExceptions': [],
   'schemaLocation': '/schemas/_base.schema.json#/definitions/RemoteItem'},
  {'keyword': 'required',
   'pointerToViolation': '#',
   'message': '#: required key [url] not found',
   'causingExceptions': [],
   'schemaLocation': '/schemas/_base.schema.json#/definitions/RemoteItem'},
  {'keyword': 'required',
   'pointerToViolation': '#',
   'message': '#: required key [checksums] not found',
   'causingExceptions': [],
   'schemaLocation': '/schemas/_base.schema.json#/definitions/RemoteItem'}],
 'schemaLocation': '/schemas/_base.schema.json#/definiti

This detailed error message is coming straight from the JSON Schema validator. The RO Composer will not allow us to change the resource into an invalid state. 

Probably the most useful information is under `message`:

In [278]:
[ex["message"] for ex in data.json()["causingExceptions"]]

['#: required key [length] not found',
 '#: required key [filename] not found',
 '#: required key [url] not found',
 '#: required key [checksums] not found']

We see here that we are missing 4 fields from the [RemoteItem](https://github.com/ResearchObject/research-object-composer/blob/master/src/main/resources/public/schemas/_base.schema.json#L4) type, `length`, `filename`, `url` and `checksums`; these properties are used in `data` to reference remote files.

### Adding remote resources

For the purpose of this demonstration we'll create a [simple dataset](https://github.com/ResearchObject/ro-lite/tree/master/examples/simple-dataset-0.1.0/data) containing a [TSV file](https://github.com/ResearchObject/ro-lite/blob/master/examples/simple-dataset-0.1.0/data/repository-sizes.tsv) and a [PNG image](https://github.com/ResearchObject/ro-lite/blob/master/examples/simple-dataset-0.1.0/data/repository-sizes-chart.png). 

For simplicity the resources are hosted by GitHub and we have already calculated the `length` and `checksums`.

In [306]:
tsv = { "url": "https://raw.githubusercontent.com/ResearchObject/research-object-composer/master/examples/repository-sizes.tsv",
        "length": 1982,
        "filename": "repository-sizes.tsv",
        "checksums": [{"type": "sha256", 
                       "checksum": "c2160e931a6ddb8cddb451190816196fc667c5f25020a89a356a69e75ec8dc0a"}]
      } 
png = { "url": "https://raw.githubusercontent.com/ResearchObject/research-object-composer/master/examples/repository-sizes-chart.png",
        "length": 23803,
        "filename": "repository-sizes-chart.png",
        "checksums": [{"type": "sha256", 
                       "checksum": "e8bf79ca6fbe83aa0c34ec12705e34d70c348d53e0795504210e13982725300c"}]
      } 
tsv_uploaded = requests.post(fields["data"], json=tsv)
tsv_uploaded.status_code

200

In [307]:
csv_uploaded = requests.post(fields["data"], json=png)
csv_uploaded.status_code

200

**200 OK** here means we complied with the JSON Schema, you may get an error if you get any of the keys wrong, or the checksum value is of the incorrect length (supported `checksums`: `md5`, `sha1`, `sha256`, `sha512`).

### Future RO creation work

In this implementation, the RO Composer assumes the client already have stable http/https URLs for any files that are to be included in the RO. For instance, the [Seven Bridges Platform](https://www.sevenbridges.com/platform/) has an underlying data store used during workflow execution, where each file have a corresponding URL and a pre-calculated checksum.

As web references such URLs could change or disappear, out of control of the RO Composer or the repository. While RO Composer can record the checksum at the time of recording, for long-term storage such files could be archived in an immutable data store, potentially also minting a [MinID](http://minid.bd2k.org/) to use as an indirect  reference.


A firewalled, desktop or web client would not have the ability to expose files on the web, and would need a different mechanism for including local files. It was decided that the RO Composer itself should not be holding such (potentially large or incriminating) files, but that it could be adapted to facilitate a pass-thru upload to a directly stage at the destination repository (both Mendeley Data and Zenodo support such "draft" status).


### Inspecting the Research Object

Now let's reload the RO and see that we have populated `content` with the two items.

In [320]:
ro = requests.get(ro_uri)
ro.json()

{'id': 31,
 'content': {'data': [{'url': 'https://raw.githubusercontent.com/ResearchObject/research-object-composer/master/examples/repository-sizes.tsv',
    'length': 1982,
    'filename': 'repository-sizes.tsv',
    'checksums': [{'type': 'sha256',
      'checksum': 'c2160e931a6ddb8cddb451190816196fc667c5f25020a89a356a69e75ec8dc0a'}]},
   {'url': 'https://raw.githubusercontent.com/ResearchObject/research-object-composer/master/examples/repository-sizes-chart.png',
    'length': 23803,
    'filename': 'repository-sizes-chart.png',
    'checksums': [{'type': 'sha256',
      'checksum': 'c2160e931a6ddb8cddb451190816196fc667c5f25020a89a356a69e75ec8dc0a'}]},
   {'url': 'https://raw.githubusercontent.com/ResearchObject/research-object-composer/master/examples/repository-sizes.tsv',
    'length': 1982,
    'filename': 'repository-sizes.tsv',
    'checksums': [{'type': 'sha256',
      'checksum': 'c2160e931a6ddb8cddb451190816196fc667c5f25020a89a356a69e75ec8dc0a'}]},
   {'url': 'https://raw.

### Filling in metadata


Next we'll add sufficient *metadata* so that the RO can later be published and assigned a DOI. Let's look at what fields we already have:

In [321]:
metadata = requests.get(fields["_metadata"]).json()
metadata

{'title': 'A good start',
 'creators': [{'name': 'Alice W Land',
   'orcid': 'https://orcid.org/0000-0002-1825-0097'}],
 'description': 'A test dataset of not much interest',
 'access_right': 'open'}

The RO Composer has filled in some default values based on the schema. A `title` is a good start. As this is a single resource we can use `PUT` to replace it's value.

In [322]:
metadata["title"] = "A good start"
updated = requests.put(fields["_metadata"], json=metadata)
updated.json()

{'title': 'A good start',
 'creators': [{'name': 'Alice W Land',
   'orcid': 'https://orcid.org/0000-0002-1825-0097'}],
 'description': 'A test dataset of not much interest',
 'access_right': 'open'}

Again the RO Composer will not allow us to push the RO into an invalid state; we see we also need `description` and at least one element of the `creators` aray.

These attributes correspond to fields in the [DataCite schema](https://schema.datacite.org/), but in JSON instead of XML. From the name and error message we may guess that `creators` is an array, but we don't know it's fields yet. 

Let's fill in our `description` and for now try with `{}` as the creator:

In [323]:
metadata["description"] =  "A test dataset of not much interest"
metadata["creators"] = [{}]
updated = requests.put(fields["_metadata"], json=metadata)
updated.json()

{'keyword': 'required',
 'pointerToViolation': '#/creators/0',
 'message': '#/creators/0: required key [name] not found',
 'causingExceptions': [],
 'schemaLocation': '/schemas/data_bundle.schema.json#/definitions/Author'}

In a sense this highlights one of the advantages of this staged approach to filling the Research Object, you trigger any errors as the item is being set rather than buried deep inside a large schema validation report at a later stage. 

Under the special key `_metadata`, any fields from [DataCite schema](https://schema.datacite.org/) can be added, so we can for instance include `affiliation` or `orcid` - although not required we recommend always including [ORCID](http://orcid.org/) to uniquely identify the creator.

In [324]:
updated = requests.put(fields["_metadata"], json=
                        { "title": "A good start", 
                          "description": "A test dataset of not much interest",
                          "access_right": "open",
                          "creators": [{"name": "Alice W Land", 
                                        "orcid": "https://orcid.org/0000-0002-1825-0097"}] })
updated

<Response [200]>

**Tip:** You may try to deliberately break the `orcid` value above to show that schema validation can also be done on optional fields. 

As a RESTful resource it is safe to use `PUT` multiple times in case we change our mind, e.g. if the user was editing equivalent forms in the UI. [Other metadata fields](https://developers.zenodo.org/#representation) from Zenodo can also be added to `_metadata` to be passed on to the archive at deposit time, e.g. `keywords` or `related_identifiers`.

### Download Research Object

Now we can **download** the Research Object as a [BagIt archive](https://tools.ietf.org/html/rfc8493) (RFC8493) from the `/research_objects/{id}/bag` resource (currently this REST resource is not listed under `_links`).

As this is a binary (ZIP file) we'll use a slightly different Python method to save it to a file.

In [325]:
import shutil
with requests.post(ro_uri + "/bag", stream=True) as bag:
    bag.raise_for_status()    
    with open("bag.zip", "wb") as zipfile:
        shutil.copyfileobj(bag.raw, zipfile)

Note that RO Composer will not let you create the BagIt archive if its `content` is not valid, which is why we call `bag.raise_for_status()` to fail on any `400` errors (we don't want the error message written to the ZIP file).

Let's check the content of the downloaded zip file.

In [326]:
import zipfile
zip = zipfile.ZipFile('bag.zip')
files = zip.namelist()
files.sort() # list in alphabetical order
files

['bag-info.txt',
 'bagit.txt',
 'data/content.json',
 'fetch.txt',
 'manifest-md5.txt',
 'manifest-sha256.txt',
 'manifest-sha512.txt',
 'metadata/manifest.json',
 'tagmanifest-md5.txt',
 'tagmanifest-sha256.txt',
 'tagmanifest-sha512.txt']

These paths follows the [BagIt structure](https://tools.ietf.org/html/rfc8493#section-2) where the [Research Object manifest](https://github.com/ResearchObject/bagit-ro) is under `metadata/manifest.json` and the _payload_ is under `data`.
 
You may notice that the two remote files we added are **not** present in the zip file, they are referenced from `fetch.txt` and `manifest-sha256.txt` - this means that even if large files are added to the Research Object, its download ZIP remains small (until the bag is _completed_ using BagIt tools like [BDBag](http://bd2k.ini.usc.edu/tools/bdbag/)).


In [327]:
with zip.open("fetch.txt") as f:
    fetch = f.read()
print(str(fetch, "utf-8"))

https://raw.githubusercontent.com/ResearchObject/research-object-composer/master/examples/repository-sizes.tsv 1982 data/repository-sizes.tsv
https://raw.githubusercontent.com/ResearchObject/research-object-composer/master/examples/repository-sizes-chart.png 23803 data/repository-sizes-chart.png
https://raw.githubusercontent.com/ResearchObject/research-object-composer/master/examples/repository-sizes.tsv 1982 data/repository-sizes.tsv
https://raw.githubusercontent.com/ResearchObject/research-object-composer/master/examples/repository-sizes.tsv 1982 data/repository-sizes.tsv
https://raw.githubusercontent.com/ResearchObject/research-object-composer/master/examples/repository-sizes-chart.png 23803 data/repository-sizes-chart.png



Similarly, the checksums have been propagated to the BagIt manifest, so that the integrity of the completed bag archive can be _validated_.

In [328]:
with zip.open("manifest-sha256.txt") as f:
    manifest = f.read()
print(str(manifest, "utf-8"))

9e13c9dc4fe38d44b8ec518154cbfda05e7ca12728df5babe5e7054d841d47d6  data/content.json
e8bf79ca6fbe83aa0c34ec12705e34d70c348d53e0795504210e13982725300c  data/repository-sizes-chart.png
c2160e931a6ddb8cddb451190816196fc667c5f25020a89a356a69e75ec8dc0a  data/repository-sizes.tsv



### Research Object manifest

Let's have a look at `metadata/manifest.json` - the manifest of the Research Object.

In [329]:
import json
with zip.open("metadata/manifest.json") as f:
    manifest = json.load(f)
manifest

{'@context': ['https://w3id.org/bundle/context'],
 'id': '../',
 'manifest': ['manifest.json'],
 'createdOn': '2019-06-26T01:10:52.009Z',
 'aggregates': [{'uri': 'https://raw.githubusercontent.com/ResearchObject/research-object-composer/master/examples/repository-sizes.tsv',
   'bundledAs': {'uri': 'urn:uuid:77947c40-fb2f-4c44-89cf-eead6d61e90e',
    'folder': '../data/',
    'filename': 'repository-sizes.tsv'}},
  {'uri': 'https://raw.githubusercontent.com/ResearchObject/research-object-composer/master/examples/repository-sizes-chart.png',
   'bundledAs': {'uri': 'urn:uuid:8654d7f3-1068-4d0d-911e-efbde6a2da58',
    'folder': '../data/',
    'filename': 'repository-sizes-chart.png'}}]}

In the [Research Object manifest](https://github.com/ResearchObject/bagit-ro) we see the two external files have been aggregated and given local paths and identifiers. Future work will explore propagating additional metadata about individual files (e.g. _creator_) into the RO manifest.

### Future metadata work

The RO Composer will currently propagate some of the `content` and `_metadata` properties to higher-level constructs in the archived Research Object. Future work could include:

* Generating `datacite.xml` based on `_metadata` (as proposed by [RDA data packaging](https://docs.google.com/document/d/155lA2BcixTl-zwJHGfLkxsmg7WmQbBK00QWyP8QggkE/edit))
* Align RO Composer with the community effort [RO-Crate](https://researchobject.github.io/ro-crate/) 
  * Mapping `_metadata` to http://schema.org/ terms to include in `manifest.json`
  * Formalize RO-Crate requirements in RO Composer schemas  

## Publishing the Research Object

Now we can **publish** the Research Object to [Zenodo](https://zenodo.org/) to assign a DOI (**Note**: the demo server is configured to use [https//sandbox.zenodo.org/](https//sandbox.zenodo.org/) which does not actually issue DOIs).

To deposit the RO into the archive we `POST` to the `/research_object/{id}/deposit/zenodo` resource.

In [350]:
published = requests.post(ro_uri + "/deposit/zenodo")
published

<Response [200]>

Any errors in `metadata` above that caused issues in the [Zenodo API](https://developers.zenodo.org/#quickstart-upload) would have caused an error, but we got **200 OK**, so let's have a look.

In [351]:
print(published.text)

https://sandbox.zenodo.org/api/records/313548


### Inspecting the deposited RO

In a browser that is logged in to [https://sandbox.zenodo.org/](https://sandbox.zenodo.org/) you can access the above API resource, and will find something similar to:

In [99]:
{"conceptdoi":"10.5072/zenodo.275054","conceptrecid":"275054","created":"2019-04-23T19:07:43.524952+00:00","doi":"10.5072/zenodo.275055","doi_url":"https://doi.org/10.5072/zenodo.275055","files":[{"checksum":"1f555cdc3d5e5d5e50b5fb4dfef4b99e","filename":"data_bundle-20.zip","filesize":4000,"id":"e67803c5-95be-443a-b059-40039bfa9daf","links":{"download":"https://sandbox.zenodo.org/api/files/86ece719-5338-47af-b23f-ff4854e6df9d/data_bundle-20.zip","self":"https://sandbox.zenodo.org/api/deposit/depositions/274851/files/e67803c5-95be-443a-b059-40039bfa9daf"}}],"id":275055,"links":{"badge":"https://sandbox.zenodo.org/badge/doi/10.5072/zenodo.275055.svg","bucket":"https://sandbox.zenodo.org/api/files/86ece719-5338-47af-b23f-ff4854e6df9d","conceptbadge":"https://sandbox.zenodo.org/badge/doi/10.5072/zenodo.275054.svg","conceptdoi":"https://doi.org/10.5072/zenodo.275054","discard":"https://sandbox.zenodo.org/api/deposit/depositions/275055/actions/discard","doi":"https://doi.org/10.5072/zenodo.275055","edit":"https://sandbox.zenodo.org/api/deposit/depositions/275055/actions/edit","files":"https://sandbox.zenodo.org/api/deposit/depositions/275055/files","html":"https://sandbox.zenodo.org/deposit/275055","latest":"https://sandbox.zenodo.org/api/records/275055","latest_html":"https://sandbox.zenodo.org/record/275055","newversion":"https://sandbox.zenodo.org/api/deposit/depositions/275055/actions/newversion","publish":"https://sandbox.zenodo.org/api/deposit/depositions/275055/actions/publish","record":"https://sandbox.zenodo.org/api/records/275055","record_html":"https://sandbox.zenodo.org/record/275055","registerconceptdoi":"https://sandbox.zenodo.org/api/deposit/depositions/275055/actions/registerconceptdoi","self":"https://sandbox.zenodo.org/api/deposit/depositions/275055"},"metadata":{"access_right":"open","communities":[{"identifier":"zenodo"}],"creators":[{"name":"Alice W Land","orcid":"0000-0002-1825-0097"}],"description":"A test dataset of not much interest","doi":"10.5072/zenodo.275055","license":"CC0-1.0","prereserve_doi":{"doi":"10.5072/zenodo.275055","recid":275055},"publication_date":"2019-04-23","title":"A good start","upload_type":"dataset","version":"7484274FFDD99BC6822AF2F1CF805A5ED5F614C504D57FB6B960A2AD16575931"},"modified":"2019-04-23T19:07:45.060077+00:00","owner":25426,"record_id":275055,"state":"done","submitted":true,"title":"A good start"}

{'conceptdoi': '10.5072/zenodo.275054',
 'conceptrecid': '275054',
 'created': '2019-04-23T19:07:43.524952+00:00',
 'doi': '10.5072/zenodo.275055',
 'doi_url': 'https://doi.org/10.5072/zenodo.275055',
 'files': [{'checksum': '1f555cdc3d5e5d5e50b5fb4dfef4b99e',
   'filename': 'data_bundle-20.zip',
   'filesize': 4000,
   'id': 'e67803c5-95be-443a-b059-40039bfa9daf',
   'links': {'download': 'https://sandbox.zenodo.org/api/files/86ece719-5338-47af-b23f-ff4854e6df9d/data_bundle-20.zip',
    'self': 'https://sandbox.zenodo.org/api/deposit/depositions/274851/files/e67803c5-95be-443a-b059-40039bfa9daf'}}],
 'id': 275055,
 'links': {'badge': 'https://sandbox.zenodo.org/badge/doi/10.5072/zenodo.275055.svg',
  'bucket': 'https://sandbox.zenodo.org/api/files/86ece719-5338-47af-b23f-ff4854e6df9d',
  'conceptbadge': 'https://sandbox.zenodo.org/badge/doi/10.5072/zenodo.275054.svg',
  'conceptdoi': 'https://doi.org/10.5072/zenodo.275054',
  'discard': 'https://sandbox.zenodo.org/api/deposit/depositi

The `latest_html` key gives the more human-readable Zenodo record for browsers, e.g. [https://sandbox.zenodo.org/record/275055](https://sandbox.zenodo.org/record/275055), while `doi` gives the DOI (which would work on the production Zenodo)

We recognize our `metadata` properties, which have been augmented to indicate a `dataset` and the `publication_date`.

We can also see that the Research Object appears in the [most recent datasets](https://sandbox.zenodo.org/search?page=1&size=20&type=dataset&sort=mostrecent) on Zenodo.

## Depositing in other archives

It is possible to change the [deposition configuration](https://github.com/ResearchObject/research-object-composer/blob/deposition/src/main/resources/depositor.properties) of the RO composer to support depositing to other archives, e.g. [Mendeley Data](https://data.mendeley.com/), although a corresponding [implementation](https://github.com/ResearchObject/research-object-composer/tree/deposition/src/main/java/uk/org/esciencelab/researchobjectservice/deposition) must be added to the code. 

Current depositors include Zenodo and pure HTTP Post, but we are also planning a [SWORD](http://swordapp.org/) depositor to support multiple repositories. A remaining challenge here is how to unify the minimum metadata across repositories.

# Workflow Research Objects

Let's now look at the more detailed Research Object profile for capturing [scientific workflows](http://slides.com/soilandreyes/2019-03-11-reproducibility-pistoia). 

In [8]:
wf_profile = profiles[1]
wf_profile

{'id': 2,
 'name': 'draft_task',
 'fields': ['input', 'workflow', 'workflow_params', '_metadata'],
 '_links': {'self': {'href': 'http://openphacts.cs.man.ac.uk:8080/profiles/draft_task'},
  'schema': {'href': 'http://openphacts.cs.man.ac.uk:8080/schemas/draft_task.schema.json'},
  'researchObjects': {'href': 'http://openphacts.cs.man.ac.uk:8080/profiles/draft_task/research_objects'}}}

Corresponding to a (potential) workflow run, we see these research objects have fields `workflow` for the workflow definition, `input` for the workflow input data, and `workflow_params` for the workflow configuration.

For the purpose of this demonstration, assume we are going to describe an execution of 