# DLHub: A Data and Learning Hub for Science

DLHub is a self-service platform for publishing, applying, and creating machine learning (ML) models, including deep learning (DL) models, and associated data transformations. It is:

1. A **model serving infrastructure**: Users can easily run or test models (and also other related services, such as data transformations) via simple Web calls.

2. A **model registry**: Model developers can easily publish models, along with associated descriptive metadata and training data, so that they can then be discovered, cited, and reused by others.

3. A **model development system**: Developers of new models can easily access the data and computing infrastructure needed to re-train models for new applications.

DLHub benefits users in many ways. Data scientists can publish models (i.e., architectures and weights) and methods. Other scientists can apply existing models to new data with ease (e.g., by querying a prediction API for a deployed model). They can easily create new models with state-of-the-art techniques. Together, these capabilities lower barriers to employing ML/DL, making it easier for researchers to benefit from advances in ML/DL technologies.


# Publishing a Scikit-learn model

The example below covers how to publish a scikit-learn model in DLHub. This includes:
* Model dataset description ( *The feature to publish dataset description in DLHub is a future work )
* Model metadata description
* Model publishing

As a simple example, we will show how to submit a SVM model created based on the [Iris Dataset](https://archive.ics.uci.edu/ml/datasets/Iris).

To publish a model with DLHub we first gather some metadata about the model itself. Our SDK is designed to assist the user in generating this metadata.

### Describing the training dataset

The first step is to describe the training data, which we assume is in a csv file named iris.csv in this example.

To make the training dataset usable for others, we want to tell them how to read it and what the columns are. Also, to make sure the authors of the data can be properly recognized, we need to provide provenance information. `dlhub_sdk` provides a simple tool for specifying this information: `TabularDataset`.

With `TabularDataset` class, the DLHub SDK supports any data **format** readable by the Pandas.



In [None]:
from dlhub_sdk.models.datasets import TabularDataset
import pandas as pd
import json

# Read in the dataset
data = pd.read_csv('scikit_learn_model/iris.csv', header=1)

# Make the dataset information
dataset_info = TabularDataset.create_model('scikit_learn_model/iris.csv', read_kwargs=dict(header=1))

Now we can append other metadata to the dataset model, including alternate identifier, paper link, domain and column descriptions, etc..

In [None]:
# Add link to where this data was downloaded from
dataset_info.add_alternate_identifier(identifier="https://archive.ics.uci.edu/ml/datasets/Iris", 
                                      identifier_type="URL")

# Add link to paper describing the dataset
dataset_info.add_related_identifier(identifier="10.1111/j.1469-1809.1936.tb02137.x",
                                    identifier_type="DOI", 
                                    relation_type="IsDescribedBy")

# Mark the domain of the dataset
dataset_info.set_domains(["biology"])

# Describe the columns
dataset_info.annotate_column("sepal_length", description="Length of sepal", units="cm")
dataset_info.annotate_column("sepal_width", description="Width of sepal", units="cm")
dataset_info.annotate_column("petal_length", description="Length of petal", units="cm")
dataset_info.annotate_column("petal_width", description="Width of petal", units="cm")
dataset_info.annotate_column("species", description="Species", data_type='string')

# Mark which columns are inputs and outputs
dataset_info.mark_inputs(data.columns[:-1])
dataset_info.mark_labels(data.columns[-1:])

# Describe the data provenance
dataset_info.set_title("Iris Dataset")
dataset_info.set_name("iris_dataset")
dataset_info.set_authors(["Marshall, R.A."])

After running this script, the model produces a simple JSON description of the dataset that we will send to DLHub.
Note that the SDK automatically put the metadata in **DataCite** format and includes data automatically pulled from the dataset (e.g., that the inputs are floats).

In [None]:
# Print out the result
print(json.dumps(dataset_info.to_dict(), indent=2))

### Make a model using the Iris dataset

Now we create a simple SVM model using scikit-learn based on the Iris dataset.

In [None]:
from sklearn.svm import SVC
import pickle as pkl
import pandas as pd

# Load the data
data = pd.read_csv('scikit_learn_model/iris.csv', header=1)
print('Loaded {} rows with {} columns:'.format(len(data), len(data.columns)),
      data.columns.tolist())

# Make the model
model = SVC(kernel='linear', C=1, probability=True)
model.fit(data.values[:, :-1], data.values[:, -1])
print('Trained a SVC model')

# Save the model using pickle
with open('scikit_learn_model/model.pkl', 'wb') as fp:
    pkl.dump(model, fp)
print('Saved model to disk')

### Describe the model

For brevity, we will upload much less metadata about a model created using Scikit-Learn.

We simply load in a Scikit-Learn model from a pickle file, and then provide a minimal amount of information about it.

The SDK will inspect the pickle file to determine the type of the model and the version of scikit-learn that was used to create it.

In [None]:
from dlhub_sdk.models.servables.sklearn import ScikitLearnModel

model_info = ScikitLearnModel.create_model('scikit_learn_model/model.pkl', n_input_columns=len(data.columns) - 1,
                                           classes=data['species'].unique())

Now we use the SDK to append other metadata to the model. Below we set the name, title and domain of the model.

In [None]:
#    Describe the model
model_info.set_title("Example Scikit-Learn Model")
model_info.set_name("iris_svm")
model_info.set_domains(["biology"])

Now the metadata is created we can use it to publish the model.

In [None]:
print(json.dumps(model_info.to_dict(), indent=2))

### Publishing the model to DLHub

We can use the DLHub SDK to create a DLHubClient. The DLHubClient wraps both our REST API and Search catalog. You can use the client to publish, discover, and use models.

This may take ~10 minutes to publish the model to DLHub.

In [None]:
import dlhub_sdk
dl = dlhub_sdk.DLHubClient()

# Publish the model to DLHub
task_id = dl.publish_servable(model_info)
print(task_id)