A philosopher's Nachlass is the set of notes he or she leaves behind unpublished. This project is a tool to make it easier to publish your Jupyter notebooks; it started with some experiments building tooling to treat notebooks as software artifacts and currently includes a source-to-image builder to translate from scikit-learn pipelines to scoring services that publish prediction metrics for classifiers and regressors. Our goal with this work is to streamline the model lifecycle, shorten feedback loops, and make models more reproducible.
Sophie Watson and Will Benton gave a talk at OSCON 2019 that includes a demonstration of an early version of Nachlass, including pipeline service generation and metrics publishing.
First, download the s2i
command-line tool.
s2i build -e S2I_SOURCE_NOTEBOOK_LIST=03-feature-engineering-tfidf.ipynb,04-model-logistic-regression.ipynb quay.io/willbenton/nachlass-s2i:latest https://github.com/willb/ml-workflows-notebook
This will generate a container image for a model service. Deploy this service in your favorite Kubernetes environment and post text data to it with cURL:
curl --request GET \
--url http://pipeline:8080/metrics \
--header 'content-type: application/x-www-form-urlencoded' \
--data 'json_args="example text goes here"' \
--data =
or with Python's requests
library:
DEFAULT_BASE_URL = "http://pipeline:8080/%s"
import requests
from urllib.parse import urlencode
import json
def score_text(text, url = None):
url = (url or (DEFAULT_BASE_URL % "predict"))
if type(text) == str:
text = [text]
payload = urlencode({"json_args" : json.dumps(text)})
headers = {'content-type': 'application/x-www-form-urlencoded'}
response = requests.request("POST", url, data=payload, headers=headers)
return json.loads(response.text)
(In either case, substitute your actual service name for pipeline
.)
This service will classify text as either spam
or legitimate
-- in the context of the deployed model, "spam" means that supplied text resembles customer-supplied food reviews from a retail website and "legitimate" means that the supplied text resembles Jane Austen novels. (See this repository for background on this model.)
One of our design goals for Nachlass was to require as little as possible of machine learning practitioners -- a data scientist does not need to know that a notebook or set of notebooks will run on Kubernetes, as a service, or under Nachlass so long as she follows some basic principles:
- notebooks must be committed to a
git
repository, - notebook requirements must be specified in a
requirements.txt
file - notebooks must produce the expected output when cells are executed in order from start to finish, and
- each notebook that produces a pipeline stage must save it to a file whose location is specified by the environment variable
S2I_PIPELINE_STAGE_SAVE_FILE
.
The last of these is the only requirement that should change the contents of a disciplined notebook, but we can make it less of a burden by supplying a simple helper function:
def serialize_to(obj, default_filename):
filename = os.getenv("S2I_PIPELINE_STAGE_SAVE_FILE", default_filename)
cp.dump(obj, open(filename, "wb"))
If practitioners save pipeline stages with serialize_to
, they can specify filenames that will hold for interactive use but these will be overridden by Nachlass builds.
To build a model service, use s2i
(or a source build strategy in OpenShift) and set S2I_SOURCE_NOTEBOOK_LIST
the notebooks corresponding to each pipeline stage in a comma-separated list. The general form of the command line is this:
s2i build -e S2I_SOURCE_NOTEBOOK_LIST=notebook1.ipynb,notebook2.ipynb quay.io/willbenton/nachlass-s2i:latest https://github.com/your-username/your-repo
If your notebooks require access to nonpublic data, they will need to pass credentials and service endpoints in the build environment or in Kubernetes secrets. (Small, nonsensitive data sets may be stored in the git
repo alongside models.)
Our work in progress includes:
- a Tekton backend for builds,
- a Kubernetes operator and custom resources to make it simpler to deploy pipelines, and
- more flexible training and publishing options