# Summary

This example shows basic metadata tracking using Kubeflow Metadata.  This allows for artifacts (datasets, models, metrics) and executions (training runs, deployments) to be browsed in the Lineage Tracker and to allow for better traceability of how models and data relate.  

![title](images/artifact_explorer_lineage_explorer.png)

More details are available [here](https://www.kubeflow.org/docs/components/metadata/#learn-more-about-the-metadata-sdk).  This notebook is based on [this demo](https://github.com/kubeflow/metadata/blob/master/sdk/python/sample/demo.ipynb)

Note: If you get errors like `ModuleNotFoundError: No module named 'tensorflow_core.estimator'`, try this in a terminal and then restart your notebook:
```
pip uninstall kubeflow-metadata tensorflow-estimator
pip install tensorflow-estimator==2.3.0 kubeflow-metadata
```

In [1]:
import pandas
from kubeflow.metadata import metadata
from datetime import datetime
from uuid import uuid4

# Settings

(you do not need to change these)

In [2]:
# TODO: Should we include this in all environments as an environment variable?
# (these are defaults of Kubeflow Metadata gRPC serivce)
METADATA_STORE_HOST = "metadata-grpc-service.kubeflow"
METADATA_STORE_PORT = 8080

# Storing metadata in kubeflow.metadata

Kubeflow Metadata works around [these five key objects](https://www.kubeflow.org/docs/components/metadata/#learn-more-about-the-metadata-sdk), three describing data and two describing contexts:

* Data
    * DataSet to capture metadata for a dataset that forms the input into or the output of a component in your workflow
    * Metrics to capture metadata for the metrics used to evaluate an ML model
    * Model to capture metadata for an ML model that your workflow produces

* Context
    * Execution to capture metadata for an execution (run) of your ML workflow, which might use or produce one or more Data objects
    * Workspace to group many objects related to a specific task
    
For example, a ML training `Execution` might be run using two input `DataSet`s (say ds_train and ds_validate) and produce a `Model` (model), some scoring `Metric`s, and a `DataSet` of results for the validation data.  A second deployment `Execution` might then be run using an input `Dataset` (ds_test) and the previous `Model`, producing some deployment `Metric`s (execution time, score on ds_test, etc.).  Both of these might be part of the same train-and-deploy-myModel `Workspace`.  Samples of how to achieve this are shown below.

Behind the scenes, kubeflow.metadata takes advantage of [ml-metadata](https://www.tensorflow.org/tfx/guide/mlmd) (mlmd), a generic implementation of a metadata store.  kubeflow.metadata classes wrap mlmd for convenience and to apply some conventions (for example, defining a `type` for each artifact type, like `kubeflow.org/alpha/data_set` for `DataSet`).  As seen below, sometimes we need to interact directly with the mlmd store when kubeflow.metadata does not offer convenience functions for what we want to do.  

kubeflow.metadata uses the `Workspace` object as its manager of the mlmd store, so we generally interact with the store through a `Workspace`

## Create a workspace

In [3]:
# Note this will by default either return an existing workspace of this name,
# or create a new one if it does not exist
ws = metadata.Workspace(
    store=metadata.Store(grpc_host=METADATA_STORE_HOST,
                         grpc_port=METADATA_STORE_PORT),
    name="kubeflow-metadata-demo-workspace",
    description="a workspace for testing and demos",
)

## Create a run in a workspace

(these are optional and entirely sure what they're needed for.  Seems to be an additional sub context we can use within workspaces for more organization?)

In [4]:
r = metadata.Run(
    workspace=ws,
    name="run-" + datetime.utcnow().isoformat("T"),
    description="a run in our test workspace",
)

## Create an execution in a run

Unlike Workspaces, executions created like this will always be unique, even if they have the same name.  Try creating one twice here (but make sure to use the same name twice, don't just rerun this block as the timestamp in the name will change)

In [5]:
name = "execution-training-" + datetime.utcnow().isoformat("T")
ex = metadata.Execution(
    name=name,
    workspace=ws,
    run=r,  # Optional
    description="an example training execution"
)
print(f"An execution {ex.name} was created with id {ex.id}")

An execution execution-training-2020-08-31T11:28:16.700096 was created with id 83


# Create a DataSet and log it to an Execution

The kubeflow.metadata wrappers provide some convenience functionality around committing and reusuing `DataSet`s.  To log a `DataSet`, we instantiate a `metadata.DataSet` object, and then `log()` it to a `Workspace`.  The `log` functions check if this `DataSet` already exists in the store before committing, returning the existing `DataSet` if it exists otherwise returning the newly committed `DataSet`.  `DataSet`s are keyed by (`name`, `uri`, `version`) - if all three match another object in the store, we reuse that object.  This lets us use the same `DataSet` for multiple executions and track lineage.

Note that a fresh `metadata.DataSet` does not have an `id` until it has been `log`ged into the `Workspace.store`

In [6]:
ds_input = metadata.DataSet(
    description="example input data",
    name="my-input-data",
    uri="my-minio-path/to-my/data.csv",
    version="12345.67890",  # This could be a hash of the data, ensuring data
                            # is exactly as expected
    query="SELECT * from data",  # Optional documentation of query
)
print(f"DataSet {ds_input.name} with id {ds_input.id}")

DataSet my-input-data with id None


In [7]:
ds_returned = ex.log_input(ds_input)
print(f"(via ds_input): DataSet {ds_input.name} with id {ds_input.id}")
print(f"(via ds_returned): DataSet {ds_returned.name} with id {ds_returned.id}")

(via ds_input): DataSet my-input-data with id 125
(via ds_returned): DataSet my-input-data with id 125


If we create a new DataSet with the same details as above and log it to `ex` (or any other object), we will reuse the existing DataSet (as seen by the ID)

In [8]:
ds_input_duplicated = metadata.DataSet(
    name="my-input-data",
    description="this_changed_but_doesn't_matter_for_duplication",
    uri="my-minio-path/to-my/data.csv",
    version="12345.67890",
    query="this_changed_but_doesn't_matter_for_duplication",
)
print(f"(before .log_input()): DataSet {ds_input_duplicated.name} with id "
      f" {ds_input_duplicated.id}")
ds_returned = ex.log_input(ds_input_duplicated)
print(f"(after .log_input()):  DataSet {ds_input_duplicated.name} with id "
      f" {ds_input_duplicated.id}")

(before .log_input()): DataSet my-input-data with id  None
(after .log_input()):  DataSet my-input-data with id  125


TODO: while we are reusing the same `DataSet` when logging, it might still be linking the dataset multiple times to the `Execution` (in the Kubeflow Artifacts UI we can see multiple associations between the same `DataSet` and `Execution`.  Maybe a bug?  This is only a problem if we have a possibility of logging the same `DataSet` to the same `Execution` multiple times though - easy to do in our example here, but not common in practice since each `Execution` is probably a pipeline run or training event. 

## Log a model as output

`Model`s are logged similar to `DataSet`s.  `Model`s also have the same logic for duplication - logging a model with the same (`uri`, `name`, `version`) as another in the store will **reuse** the existing model rather than create a new one.  `Model` objects have additional optional metadata that can be stored on them, such as model_type or hyperparameter values

In [9]:
# uuid4 just gives us a unique model version.  This could be a kubeflow
# pipeline run_id, a timestamp (if you know it will be unique), etc
model_version = "model_version_" + str(uuid4())
model = metadata.Model(
    name="myModel",
    uri="minio/path/to/my/model.pkl.gz",
    version=model_version,
    description="a sample model",
    model_type="neural network",
    training_framework={
        "name": "tensorflow",
        "version": "v1.0",
    },
    hyperparameters={
        "learning_rate": 0.5,
        "layers": [10, 3, 1],
        "early_stop": True,
    },
)
print(f"(before .log_output()): Model {model.name} with id {model.id}")
model = ex.log_output(model)
print(f"(after .log_output()):  Model {model.name} with id {model.id}")

(before .log_output()): Model myModel with id None
(after .log_output()):  Model myModel with id 131


## Log metrics as output

In [10]:
ds_scoring = metadata.DataSet(
    name="my-input-scoring-data",
    description="example scoring data",
    uri="my-minio-path/to-my/scoring-data.csv",
    version="12345.67890",  # This could be a hash of the data, ensuring data
                            # is exactly as expected
    query="SELECT * from scoring_data",  # Optional documentation of query
)

# Commit the DataSet to the execution that created the metrics so it gets an
# `id` in the database
ex.log_input(ds_scoring)

print(f"DataSet {ds_scoring.name} with id {ds_scoring.id}")

DataSet my-input-scoring-data with id 129


Commit the dataset to the run that created the metrics so that it gets an `id` in the database

In [17]:
metrics_values = {
    'f1': 0.85,
    'accuracy': 0.91,
}
metrics = metadata.Metrics(
    name="myModel-metrics",
    uri="minio/path/to/my/metrics.yaml",
    data_set_id=123, #str(ds_scoring.id),  # Link the data for later recovery
    model_id=str(model.id),  # Note that IDs must be str
    metrics_type=metadata.Metrics.TESTING,
    value=metrics_values,
)

If you get a ModuleNo

In [18]:
ex.log_output(metrics)

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3331, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-18-544a9e7e549e>", line 1, in <module>
    ex.log_output(metrics)
  File "/opt/conda/lib/python3.7/site-packages/kubeflow/metadata/metadata.py", line 348, in log_output
    self._log(artifact)
  File "/opt/conda/lib/python3.7/site-packages/kubeflow/metadata/metadata.py", line 373, in _log
    ser = artifact.serialization()
  File "/opt/conda/lib/python3.7/site-packages/kubeflow/metadata/metadata.py", line 741, in serialization
    mlpb.Value(string_value=self.data_set_id),
TypeError: 123 has type int, but expected one of: bytes, unicode

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 2044, in showtraceback
    stb = value._render_t

TypeError: 123 has type int, but expected one of: bytes, unicode

### Add Metadata for serving the model

This is an example of how you could later make an execution for serving, and refer to an existing model that's already in the metadata.

In [13]:
serving_application = metadata.Execution(
    name="serving model",
    workspace=ws,
    description="an execution to represent model serving component",
)
# Noticed we use model name, version, uri to uniquely identify existing model.
served_model = metadata.Model(
    name="myModel",
    uri="minio/path/to/my/model.pkl.gz",
    version=model.version,  # Reusing the model version from above
)
m = serving_application.log_input(served_model)
print("Found the model with id {0.id} and version '{0.version}'.".format(m))

Found the model with id 131 and version 'model_version_83f55dfb-1275-446d-8168-77ece8ceb2e5'.


# Lineage tracking

## By UI

Now that we've logged some artifacts and executions into the store, go to the [Artifacts](https://kubeflow.covid.cloud.statcan.ca/_/pipeline/#/artifacts) or [Executions](https://kubeflow.covid.cloud.statcan.ca/_/pipeline/#/executions) UI's and check them out.  From there you can see all the 

TODO: `Execution`s created by kubeflow.metadata don't show up well in the Execution page.  See [this issue](https://github.com/StatCan/daaas/issues/215) for details.  They're still accessible via the Lineage Explorer or API

## By API

An example of accessing metadata by API is shown below.  This requires us to interact directly with the `mlmd.Store` object.  Note that this `Store` has other access methods, such as `get_artifacts_by_context()`, etc. 

TODO: Build convenience functions to make this more direct?

Find all executions (eg: trainings or deployments) that use a given model

In [None]:
print("Find Model id is %s\n" % model.id)
model_events = ws.store.get_events_by_artifact_ids([model.id])

execution_ids = set(e.execution_id for e in model_events)
print("All executions related to the model are {}".format(execution_ids))

Find all artifacts associated with a particular training event

In [None]:
trainer_events = ws.store.get_events_by_execution_ids([ex.id])
artifact_ids = set(e.artifact_id for e in trainer_events)
print(f"All artifacts related to the training event are {artifact_ids}")