# CanDIGv2 Services Demo
- - -

This notebook outlines how to use the demonstration server and client to make a simple Data Object service that makes available data from a few different sources using CanDIGv2. The CanDIGv2 project is a collection of heterogeneos services designed to work together to facilitate end to end dataflow for genomic data.

## Demo Topology

```
+---------------------------------------------------------------------------------------------+
|                                                                                             |
|                                    +--------------+                                         |
|                                    | candig.local |                                         |
|                                    +--------------+                                         |
|                                            |                                                |
|                                            |                                                |
|                                            |                 +------------------------+     |
|                                 +--------------------+       | consul:8300-8310 (tcp) |     |
|   +-----------------------+     | traefik:8000 (tcp) |       |       :8400      (tcp) |     |
|   | weavescope:4040 (tcp) |-----|        :80   (tcp) |-------|       :8500      (tcp) |     |
|   +-----------------------+     |        :443  (tcp) |       |       :8301-8302 (udp) |     |
|                                 +--------------------+       |       :8600      (udp) |     |
|                                            |                 +------------------------+     |
|                                            |                                                |
|                                            |                                                |
|       +-------------------+     +--------------------+       +----------------------+       |
|       | htsget:4844 (tcp) |-----| jupyter:8888 (tcp) |-------| ga4gh-dos:8080 (tcp) |       |
|       +-------------------+     +--------------------+       +----------------------+       |
|                                            |                                                |
|                                            |                                                |
|                                            |                                                |
|                                            |                                                |
|        +----------------------+            |          +----------------------------------+  |
|        | +------------------+ |            |          | +------------------------------+ |  |
|        | | minio:9000 (tcp) | |            |          | |  toil-master/wes:5050 (tcp)  | |  |
|        | +------------------+ |            |          | +------------------------------+ |  |
|        | +------------------+ |------------+----------| +------------------------------+ |  |
|        | |minio-client (mc) | |                       | |   toil-worker:5051 (tcp)     | |  |
|        | +------------------+ |                       | +------------------------------+ |  |
|        +----------------------+                       +----------------------------------+  |
|                                                                                             |
+---------------------------------------------------------------------------------------------+
```

## Prerequisites

If you are starting from a blank jupyter-lab instance, you won't have the ipython kernel needed for CanDIGv2. So let's create it. Copy/paste the following code into your **bash** shell:

In [None]:
cd /notebooks/demo/CanDIGv2

conda env create -f etc/conda/environment.yml
conda activate candig
python -m ipykernel install --user --name candig

It will be nice to keep a `bash` shell running along with out `iPyKernel`...

### Populate Minio Object Store with Buckets and Data

We will need minio server/client with some demo data before we continue. Normally this is something we can do easily when using CanDIGv2 in `Swarm` or `Kubernetes` mode, but for a local `Compose` demo we need to take some extra steps.

First, we need to copy over the `minio_acess_key` and `minio-secret-key` from our host. Copy/paste into the corresponding files or just run the following docker commands from the **host machine** with the keys:


In [None]:
jp_container=$(docker ps | grep jupyter | awk '{print $1}')
docker cp minio-access-key $jp_container:/notebooks/demo/CanDIGv2/
docker cp minio-secret-key $jp_container:/notebooks/demo/CanDIGv2/

Now, lets get the latest minio client in our env:

In [1]:
# check we are in the project root directory
!pwd

/notebooks/demo/CanDIGv2


In [8]:
# copy over a site.env for compose testing
!cp $(pwd)/etc/site/compose.env $(pwd)/site.env

!make minio

mkdir -p /notebooks/demo/CanDIGv2/bin
curl -Lo /notebooks/demo/CanDIGv2/bin/minio \
	https://dl.minio.io/server/minio/release/linux-amd64/minio
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 40.7M  100 40.7M    0     0  2518k      0  0:00:16  0:00:16 --:--:-- 3223k
curl -Lo /notebooks/demo/CanDIGv2/bin/mc \
	https://dl.minio.io/client/mc/release/linux-amd64/mc
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 16.0M  100 16.0M    0     0  1835k      0  0:00:08  0:00:08 --:--:-- 2046k
chmod 755 /notebooks/demo/CanDIGv2/bin/minio
chmod 755 /notebooks/demo/CanDIGv2/bin/mc


Finally, we can run this script to populate our Minio Object Store:

In [16]:
!bin/mc config host add minio http://minio:9000 $(cat minio-access-key) $(cat minio-secret-key)

[m[32mAdded `minio` successfully.[0m
[0m

In [11]:
!wget -nc --continue https://s3.amazonaws.com/1000genomes/phase3/data/NA20276/alignment/NA20276.mapped.ILLUMINA.bwa.ASW.low_coverage.20120522.bam
!wget -nc --continue http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chr22.fa.gz
#!wget -nc --continue https://dl.dnanex.us/F/D/Pb1QjgQx9j2bZ8Q44x50xf4fQV3YZBgkvkz23FFB/NA12878_recompressed.bam

!bin/mc mb minio/candig/hg38/chromosomes/
!bin/mc mb minio/candig/1000genomes/phase3/data/NA20276/alignment/
!bin/mc mb minio/candig/reads/BroadHiSeqX_b37/NA12878/

!bin/mc cp NA20276.mapped.ILLUMINA.bwa.ASW.low_coverage.20120522.bam minio/candig/1000genomes/phase3/data/NA20276/alignment/
!bin/mc cp chr22.fa.gz minio/candig/hg38/chromosomes/
#!bin/mc cp NA12878_recompressed.bam minio/candig/reads/BroadHiSeqX_b37/NA12878/

File ‘NA20276.mapped.ILLUMINA.bwa.ASW.low_coverage.20120522.bam’ already there; not retrieving.

File ‘chr22.fa.gz’ already there; not retrieving.

[m[32;1mBucket created successfully `minio/candig/hg38/chromosomes/`.[0m
[0m[m[32;1mBucket created successfully `minio/candig/1000genomes/phase3/data/NA20276/alignment/`.[0m
[0m[m[32;1mBucket created successfully `minio/candig/reads/BroadHiSeqX_b37/NA12878/`.[0m
chr22.fa.gz:   11.69 MiB / 11.69 MiB ┃▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓┃ 100.00% 37.89 MiB/s 0s[0m[0m

## Working with DOS/DRS API

A useful Data Object Service might present a list of available reference FASTAs for performing downstream alignment and analysis.

We'll index the UCSC human reference FASTAs into DOS as an example.


In [17]:
import requests, json
import datetime

d = datetime.datetime.utcnow()

headers = {'content-type': 'application/json'}
url = 'http://ga4gh-dos:8080/ga4gh/dos/v1/dataobjects'
data = json.dumps({"data_object": {"id": "hg38-chr22",
     "name": "Human Reference Chromosome 22",
     "created": d.isoformat("T") + "Z",
     "checksums": [{"checksum": "41b47ce1cc21b558409c19b892e1c0d1", "type": "md5"}],
     "urls": [{"url": "http://minio:9000/candig/hg38/chromosomes/chr22.fa.gz"}],
     "size": "12255678"}})

requests.post(url, data, headers=headers)

<Response [200]>

In [84]:
r = requests.get(url)

print r.text

{
  "data_objects": [
    {
      "checksums": [
        {
          "checksum": "41b47ce1cc21b558409c19b892e1c0d1",
          "type": "md5"
        }
      ],
      "created": "2019-05-28T18:38:16.960271Z",
      "id": "hg38-chr22",
      "name": "Human Reference Chromosome 22",
      "size": "12255678",
      "updated": "2019-05-28T18:38:16.960285Z",
      "urls": [
        {
          "url": "http://minio:9000/candig/hg38/chromosomes/chr22.fa.gz"
        }
      ],
      "version": "2019-05-28T18:38:16.960291Z"
    }
  ]
}



### Using the Client to Access the Demo Server

We can now use the Python client to create a simple Data Object. The same could be done using cURL or wget but we want to be more programmatic.


In [2]:
from ga4gh.dos.client import Client

client = Client("http://ga4gh-dos:8080/ga4gh/dos/v1")
c = client.client
models = client.models

### Listing Data Objects

To list the existing Data Objects, we send a `ListDataObjectsRequest` to the ListDataObjects method!

**NOTE:** If you install ga4gh-dos-schema via pip, there is a dependency error with the `ga4gh.dos.client` library. Run the following command to fix:

In [3]:
# Dependency mapping issues - https://github.com/ga4gh/data-repository-service-schemas/issues/147
!pip install jsonschema==2.6.0

Collecting jsonschema==2.6.0
  Using cached https://files.pythonhosted.org/packages/77/de/47e35a97b2b05c2fadbec67d44cfcdcd09b8086951b331d82de90d2912da/jsonschema-2.6.0-py2.py3-none-any.whl
Installing collected packages: jsonschema
  Found existing installation: jsonschema 3.0.1
    Uninstalling jsonschema-3.0.1:
      Successfully uninstalled jsonschema-3.0.1
Successfully installed jsonschema-2.6.0


In [10]:
# FIXED :)
ListDataObjectsRequest = models.get_model('ListDataObjectsRequest')
list_request = c.ListDataObjects(page_size=10000000)
list_response = list_request.result()
print("Number of Data Objects: {} ".format(len(list_response.data_objects)))

Number of Data Objects: 3 


We can filter the DataObject in order to retrieve just the URL:

In [6]:
# FIXED :)
data_objects = list_response.data_objects
data_object = data_objects[1]
print('url: {}, file_size (B): {}'.format(data_object.urls[0].url, data_object.size))

url: http://minio:9000/candig/hg38/chromosomes/chr22.fa.gz, file_size (B): 12255678


There are additional features of the DOS client, such as querying against public DOS/DRS instaces and working with DataBundles, but this is out of scope of our demo. You can review these topics [here](https://github.com/ga4gh/data-repository-service-schemas/blob/master/python/examples/gdc_notebook.ipynb).

## Using DOS with htsget

Data Objects are meant to represent versioned artifacts and can represent an API resource. For example, we could use DOS as a way of exposing htsget resources.

In the [htsget Quickstart documentation](https://htsget.readthedocs.io/en/stable/quickstart.html) a link is made to the following snippet, which will stream the BAM results to the client.

In this example, we will take a subset of the Illumina Platinum Genomes NA12878 and create a DataObject with metadata corresponding to the genomic range of interest. Using `htsget` is useful in cases such as this, since BAM files can be VERY large...

In [13]:
!wget --spider https://dl.dnanex.us/F/D/Pb1QjgQx9j2bZ8Q44x50xf4fQV3YZBgkvkz23FFB/NA12878_recompressed.bam

Spider mode enabled. Check if remote file exists.
--2019-05-29 07:27:25--  https://dl.dnanex.us/F/D/Pb1QjgQx9j2bZ8Q44x50xf4fQV3YZBgkvkz23FFB/NA12878_recompressed.bam
Resolving dl.dnanex.us (dl.dnanex.us)... 3.89.79.27, 54.164.161.183
Connecting to dl.dnanex.us (dl.dnanex.us)|3.89.79.27|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 140786687556 (131G) [application/x-gzip]
Remote file exists.



This particular sequence file is **131GB *compressed***, way too large to download over hotel/conference wi-fi. Furthermore, we do not need the entire dataset for our use case. So let's slice it with `htsget` and create a DataObject that we can use for downstream analysis:

In [15]:
from ga4gh.dos.client import Client
import htsget
import datetime
import pytz

# https://stackoverflow.com/questions/8556398/generate-rfc-3339-timestamp-in-python#8556555
d = datetime.datetime.utcnow()
d_with_timezone = d.replace(tzinfo=pytz.UTC)

# htsget docs - https://htsget.readthedocs.io/en/latest/quickstart.html
url = "http://htsnexus.rnd.dnanex.us/v1/reads/BroadHiSeqX_b37/NA12878"
with open("NA12878_2.bam", "wb") as output:
    htsget.get(url, output, reference_name="2", start=1000, end=20000)

# https://github.com/ga4gh/data-repository-service-schemas/blob/master/python/examples/object-type-examples.ipynb
client = Client("http://ga4gh-dos:8080/ga4gh/dos/v1/")
c = client.client
models = client.models

DataObject = models.get_model('DataObject')
Checksum = models.get_model('Checksum')
URL = models.get_model('URL')

na12878_2 = DataObject()
na12878_2.id = 'na12878_2'
na12878_2.name = 'NA12878_2.bam'
na12878_2.checksums = [
    Checksum(checksum='eaf80af5e9e54db5936578bed06ffcdc', type='md5')]
na12878_2.urls = [
    URL(
        url="http://minio:9000/candig/reads/BroadHiSeqX_b37/NA12878",
        system_metadata={'reference_name': 2, 'start': 1000, 'end': 20000})]
na12878_2.aliases = ['NA12878 chr 2 subset']
na12878_2.size = '555749'
na12878_2.created = d_with_timezone

c.CreateDataObject(body={'data_object': na12878_2}).result()

response = c.GetDataObject(data_object_id='na12878_2').result()

print(response)


GetDataObjectResponse(data_object=DataObject(aliases=[u'NA12878 chr 2 subset'], checksums=[Checksum(checksum=u'eaf80af5e9e54db5936578bed06ffcdc', type=u'md5')], created=datetime.datetime(2019, 5, 28, 22, 12, 4, 831951, tzinfo=tzlocal()), description=None, id=u'na12878_2', mime_type=None, name=u'NA12878_2.bam', size=555749, updated=datetime.datetime(2019, 5, 28, 22, 12, 4, 831960, tzinfo=tzlocal()), urls=[URL(system_metadata=SystemMetadata(end=20000, reference_name=2, start=1000), url=u'http://minio:9000/candig/reads/BroadHiSeqX_b37/NA12878', user_metadata=None)], version=u'2019-05-28T22:12:04.831964Z'))


We will place this file back into our minio-server to so that others can retrieve this file:

In [14]:
!bin/mc cp NA12878_2.bam minio/candig/reads/BroadHiSeqX_b37/NA12878/

...878_2.bam:  542.72 KiB / 542.72 KiB ┃▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓┃ 100.00% 27.02 MiB/s 0s[0m[0m

## Using DOS with WES

The Workflow Execution Service attempts to present the interface for workflow execution over HTTP methods. Simple JSON requests including the inputs and outputs for a workflow are sent to a service. This allows us to "ship code to the data," since data privacy and egress costs require that data is not shared.

UCSC Toil is software for executing workflows. It presents a Python native API, which will not be demonstrated here, as well as a CWL compliant CLI interface. For that reason, any CWLRunner can easily be exposed by the workflow-service, demonstrated here.

In [None]:
!wes-server --backend=wes_service.cwl_runner --opt runner=cwltoil --opt extra=--logLevel=CRITICAL

Background processes aren't supported directly in notebooks, so we close it here. But you can paste this command in a terminal and it will bring up your very own Workflow Execution Service!


### Using the Client

The server is now running, but Toil hasn't started yet as we haven't issued any Workflow Execution requests. Here, using the provided CLI client, we demonstrate a simple workflow which calculates an md5sum.

### Accessing a workflow via Dockstore Tool Registry Service

We will start by accessing the metadata for a workflow from dockstore.

In [22]:
import requests
response = requests.get('https://dockstore.org:8443/api/ga4gh/v2/tools/', params={"name": "md5sum"})
print(response.json()[0]['versions'][7])
md5sum_url = response.json()[0]['versions'][7]['url'] + '/plain-CWL/descriptor/%2FDockstore.cwl'

{u'descriptor_type': [u'CWL', u'WDL'], u'verified': True, u'name': u'master', u'url': u'https://dockstore.org/api/api/ga4gh/v2/tools/quay.io%2Fbriandoconnor%2Fdockstore-tool-md5sum/versions/master', u'image': u'7f82fc51fa35d36bbd61297ee0c05170ab4ba67c969a9a66b28e5ed3c100034b', u'verified_source': u'Phase 1 GA4GH Tool Execution Challenge', u'image_name': u'quay.io/briandoconnor/dockstore-tool-md5sum', u'meta_version': u'2017-07-23 15:45:37.0', u'id': u'quay.io/briandoconnor/dockstore-tool-md5sum:master', u'containerfile': True, u'registry_url': u'quay.io'}


We now have a URL we can pass too WES for execution!

In [23]:
print(md5sum_url)

https://dockstore.org/api/api/ga4gh/v2/tools/quay.io%2Fbriandoconnor%2Fdockstore-tool-md5sum/versions/master/plain-CWL/descriptor/%2FDockstore.cwl


### Using the WES CLI client to Execute

In [10]:
!git clone https://github.com/common-workflow-language/workflow-service.git

Cloning into 'workflow-service'...
remote: Enumerating objects: 101, done.[K
remote: Counting objects: 100% (101/101), done.[K
remote: Compressing objects: 100% (68/68), done.[K
remote: Total 1105 (delta 55), reused 67 (delta 33), pack-reused 1004[K
Receiving objects: 100% (1105/1105), 260.72 KiB | 970.00 KiB/s, done.
Resolving deltas: 100% (629/629), done.


In [None]:
!wes-client --host 172.16.47.156:8181 --proto http $md5sum_url workflow-service/testdata/md5sum.cwl.json

As you can see, the wes-client routed a request and polled the service until the its state was `COMPLETE`. It then shows us the location of the outputs, so we can read them.

## Running Workflows Against Remote WES Systems

In [8]:
!mkdir -p demo
!bin/mc cp --recursive minio/candig/wes/workflows/demo/ demo/

...kflow.cwl:  76.96 MiB / 76.96 MiB ┃▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓┃ 100.00% 397.92 MiB/s 0s[0m[0m

In [25]:
!toil-cwl-runner --clean onError --cleanWorkDir onError --batchSystem mesos --mesosMaster toil-master:5050 demo/workflow.cwl demo/inputs.yml

INFO:cwltool:Resolved 'demo/workflow.cwl' to 'file:///notebooks/demo/CanDIGv2/demo/workflow.cwl'
ERROR:pymesos.process:Failed to process event
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/pymesos/process.py", line 194, in read
    self._callback.process_event(event)
  File "/opt/conda/lib/python3.6/site-packages/pymesos/process.py", line 274, in process_event
    self.on_event(event)
  File "/opt/conda/lib/python3.6/site-packages/pymesos/scheduler.py", line 641, in on_event
    func(event)
  File "/opt/conda/lib/python3.6/site-packages/pymesos/scheduler.py", line 559, in on_offers
    self, [self._dict_cls(offer) for offer in offers]
  File "/opt/conda/lib/python3.6/site-packages/toil/batchSystems/mesos/batchSystem.py", line 423, in resourceOffers
    self._trackOfferedNodes(offers)
  File "/opt/conda/lib/python3.6/site-packages/toil/batchSystems/mesos/batchSystem.py", line 508, in _trackOfferedNodes
    assert(offer.agent_id.has_key('value'))
TypeE