# CANDLE and DLHub

This notebook shows how DLHub can be used to work with ECP-CANDLE models. We first use the DLHubClient to discover existing models. Then we use the client to initiate a publication request of a pre-trained P1B1 model. Finally, we perform on-demand inference of both the P1B1 and Combo models that are published in DLHub.

In [1]:
%matplotlib inline
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import json
import os

Use the DLHub SDK to create a DLHubClient. The DLHubCLient wraps both our REST API and Search catalog. You can use the client to publish, discover, and use models.

In [2]:
import dlhub_sdk
dl = dlhub_sdk.DLHubClient()

DLHub uses a custom metadata schema to encode models. This metadata describes the inputs, outputs, type, and author information. While we provider helper functions to aid in its creation, you can see the structure

In [3]:
df_serv = dl.search_by_servable(servable_name="candle*")
df_serv[0]

{'datacite': {'alternateIdentifiers': [{'alternateIdentifier': 'https://github.com/ECP-CANDLE/Benchmarks/tree/master/Pilot1/Combo',
    'alternateIdentifierType': 'URL'}],
  'creators': [{'affiliations': 'CANDLE',
    'familyName': 'Team',
    'givenName': 'CANDLE'}],
  'descriptions': [{'description': 'CANDLE pilot 1 combo model.',
    'descriptionType': 'Abstract'}],
  'identifier': {'identifier': '10.YET/UNASSIGNED', 'identifierType': 'DOI'},
  'publicationYear': '2019',
  'publisher': 'DLHub',
  'resourceType': {'resourceTypeGeneral': 'InteractiveResource'},
  'titles': [{'title': 'CANDLE Pilot1 Combo Demo1'}]},
 'dlhub': {'domains': ['genomics', 'cancer research'],
  'ecr_uri': '039706667969.dkr.ecr.us-east-1.amazonaws.com/b82c8643-ffd5-4a55-a02c-5b12982ac2ca',
  'files': {'arch': 'saved.model.h5',
   'model': 'saved.weights.h5',
   'other': ['dropout_layer.py']},
  'funcx_id': '0a30913d-31be-4c5e-a36b-38da78a5abfc',
  'id': 'b82c8643-ffd5-4a55-a02c-5b12982ac2ca',
  'name': 'candl

# Publishing Models

To publish a model with DLHub we first gather some metadata about the model itself. Our SDK is designed to assist the user in generating this metadata.

This example shows how to use the DLHub SDK to:
- Using the SDK to automatically extract metadata from a Keras model.
- Describing additional metadata about the model
- Publishing the model

## Publishing a P1B1 model.

I've trained a simple version of the P1B1 model using the code found here:

https://github.com/ECP-CANDLE/Benchmarks/tree/master/Pilot1/P1B1

The resulting model has been exported as "p1b1.h5" and is in the current working directory:

In [4]:
!ls -tho

total 220224
-rw-r--r--  1 marcus    53M Jul 25 10:09 saved.weights.h5
-rw-r--r--  1 marcus    53M Jul 25 10:09 saved.model.h5
-rwxr-xr-x  1 marcus   2.3M Jul 25 10:09 [31mpilot1.npy[m[m
-rw-r--r--  1 marcus   497B Jul 25 10:09 dropout_layer.py
-rw-r--r--  1 marcus   836B Jul 25 10:09 README.md
-rw-r--r--  1 marcus    59K Jul 25 10:09 CANDLE-DLHub-demo.ipynb


The first step to describing the model is to use the SDK to create a model object. In the case of Keras, the model object is able bootstrap the metadata by loading the trained model and automatically extracting metadata regarding its structure.

In [5]:
from dlhub_sdk.models.servables.keras import KerasModel
import pickle as pkl
import json

# Describe the keras model
model_info = KerasModel.create_model('p1b1.h5', list(map(str, range(10))))

Using TensorFlow backend.


OSError: Unable to open file (unable to open file: name = 'p1b1.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)

Now we use the SDK to append other metadata to the model. Below we set the name of the model, dependencies, and describe additional metadata for search purposes.

In [None]:
# Describe the model
model_info.set_title("CANDLE P1B1 Demo2")
model_info.set_name("candle_p1b1_demo2")
model_info.set_domains(["genomics","cancer research"])

# Add dependencies
model_info.add_requirement('keras', 'detect')
model_info.add_requirement('numpy', 'detect')

# Describe the outputs in more detail
model_info['servable']['methods']['run']['output']['description'] = 'Output'
model_info['servable']['methods']['run']['input']['description'] = 'Input'

# Add provenance information
model_info.set_authors(["Team, CANDLE"], ["CANDLE"])
model_info.set_abstract("CANDLE pilot 1 benchmark 1 model.")
model_info.add_alternate_identifier("https://github.com/ECP-CANDLE/Benchmarks/tree/release_01/Pilot1/P1B1", "URL")

Now the metadata is created we can use it to publish the model.

In [None]:
# Print out the result
print(json.dumps(model_info.to_dict(), indent=2))

Here we use the SDK to directly publish the model using the model object. This process will first take each of the files listed in "files" block and create a temporary tar file. The tar file is then transmitted to the DLHub service (via a multipart POST request) and the JSON document is used to guide a publication pipeline. 

The publication process includes:
- Creating a temporary tar of the files specified in the above JSON
- Transmit the tar file to the DLHub service using a multipart POST request
- Start a server-side flow to:
    - Use the specified dependencies to create a docker container of the model
    - Push a copy of the container to AWS ECR
    - Ingest the metadata into the search index

We could also save the above JSON document and use it to publish the model via our CLI or through our GitHub-based repo2docker pipeline.

In [None]:
task_id = dl.publish_servable(model_info)

In [None]:
task_id

# Running models

Below shows how to use the DLHub SDK to invoke the P1B1 model published in DLHub.

I have taken a subset of the data available on the CANDLE FTP site and placed it in a local file called "pilot1.npy".

In [None]:
test_data = np.load("pilot1.npy")

In [None]:
print(test_data)
print("There are {0} entries in the dataset. Each entry has {1} values.".format(len(test_data), len(test_data[0])))

Now we need to find the model's name. I have previously published one called: "candle_p1b1_demo1"

In [None]:
df_serv = dl.search_by_servable(servable_name="candle_p1b1_demo1")
servable_name = df_serv[0]['dlhub']['shorthand_name']
servable_name

In [None]:
p1b1_preds = []
for data in test_data.tolist():
    pred = dl.run(servable_name, [data], input_type='python')
    p1b1_preds.append(np.array(pred))
    break

In [None]:
p1b1_preds

In [None]:
len(p1b1_preds[0][0])

# Publishing and Using Pilot 1: Combo

Here is another example that uses the SDK to markup, publish, and use the Combo model. This example is a little different as the trained model is provided as two files: a set of weights and an architecture file.

The Combo model also requires a custom dropout layer. We have extended the Keras model loader to support this. However, the metadata describing the model must indicate that the custom layer is necessary and the layer needs to be shipped along with the model itself to create the servable.

In [None]:
from dlhub_sdk.models.servables.keras import KerasModel
from dropout_layer import PermanentDropout
import pickle as pkl
import json
# Describe the keras model
model_info = KerasModel.create_model('saved.weights.h5', list(map(str, range(10))), 
                                     arch_path="saved.model.h5", 
                                     custom_objects={"PermanentDropout": PermanentDropout})

In [None]:
# Describe the model
model_info.set_title("CANDLE Pilot1 Combo Demo2")
model_info.set_name("candle_p1_combo_demo2")
model_info.set_domains(["genomics","cancer research"])

# Add dependencies
model_info.add_requirement('keras', 'detect')
model_info.add_requirement('numpy', 'detect')

# Add dropout layer file
model_info.add_file("dropout_layer.py")

# Describe the outputs in more detail
model_info['servable']['methods']['run']['output']['description'] = 'Output'
model_info['servable']['methods']['run']['input']['description'] = 'Input'

# Add provenance information
model_info.set_authors(["Team, CANDLE"], ["CANDLE"])
model_info.set_abstract("CANDLE pilot 1 combo model.")
model_info.add_alternate_identifier("https://github.com/ECP-CANDLE/Benchmarks/tree/master/Pilot1/Combo", "URL")

In [None]:
print(json.dumps(model_info.to_dict(), indent=2))

In [None]:
task_id = dl.publish_servable(model_info)

In [None]:
task_id

## Running the Combo model

I'm not actually sure what data this thing takes, but the model summary states that there are three inputs:

input.cell.expression (InputLay (None, 942)<br>
input.drug1.descriptors (InputL (None, 3820)<br>
input.drug2.descriptors (InputL (None, 3820)<br>

Therefore we can create example input to ensure the model is running correctly.

In [None]:
servable_desc = dl.describe_servable('ryan_globusid', 'candle_p1_combo_demo1')
print(servable_desc['servable']['model_summary'])

The servable description also contains more succinct description of the inputs

In [None]:
servable_desc['servable']['methods']['run']['input']

The client also provides a shortcut for accessing the input descriptions, as we anticipate that being a common need

In [None]:
dl.describe_methods('ryan_globusid', 'candle_p1_combo_demo1', 'run')

Given this information, we can create inputs in the proper format and use them to run the model

In [None]:
combo_input = [np.zeros((200, 942)).tolist(), 
               np.zeros((200, 3820)).tolist(), 
               np.zeros((200, 3820)).tolist()]

In [None]:
res = dl.run('ryan_globusid/candle_p1_combo_demo1', combo_input, input_type='json')

In [None]:
fig, ax = plt.subplots()

ax.hist(np.ravel(res), density=True)

ax.set_xlabel('Output')
ax.set_ylabel('Frequency')

The distribution of the outputs has a non-zero variance, as expected given that the "p1_combo" model contains a dropout layer that is still active on predictions.

In [None]:
res_test = dl.run('ryan_globusid/noop', [1,2,3], input_type='json')

In [None]:
res_test