Model Deployment Pipeline #9

Simsso · 2018-06-11T07:09:16Z

Let's use this issue as a thread to communicate different, possible deployment pipelines for ML models. Once we have decided upon something we can go ahead and create a wiki page.

ghost · 2018-06-11T16:03:30Z

Yesterday, while I was specifying our general requirements and defining the processes of training, testing and generating scores, I also had a few thoughts on our future deployment pipeline. Below you'll see my initial draft.

The researcher develops the model.
The model is finished and tested against a local environment (should reflect base image).
The research can tag his very last push on this model as 'release'. Note that every model is in his own branch.
The private docker registry detects the 'release' tag due to Webhooks, uses the name of the branch as image tag and builds the model. The model is build by using a general Dockerfile.
Once the build is finished, the image is deployed on our chosen cloud provider: Generating and starting the container.
The containers grabs all required data to start the training, testing and generating of scores.
Training, testing and generating of scores are done.
All process information such as training rate, accuracy and etc. are logged accordingly for the possibility of tracking deployment jobs in real time
A report regarding the evaluation of the given model is saved.
Persist the frozen model.

Regarding the general Dockerfile:
The idea behind the general Dockerfile is to prevent errors caused manual human interference. The general Dockerfile generates a container having a full stack environment, where only the given model has to be injected accordingly during the process of building the image.

You mentioned that you have some few change request. I would be glad if you could specify them here, so I can think about them before our discussion.

Simsso · 2018-06-11T16:33:56Z

The overall concept looks really good. There are a few things that I would change a bit / where I see a need to discuss:

Committing for every singly hyper parameter change is over-engineered. I think trying out a model must be as simple as hitting play in an IDE. One idea would be to have a console application that pushes the model to our cloud service provider (or performs the commit?!).
I've got to say though, that I'm not entirely sure about that point. Maybe prototyping should happen locally and deployment to a cloud should happen only after extensive, local evaluation.
The entire evaluation stack is more complicated. We have to have a way of deploying new evaluation algorithms, aka. attacks (in this case deploying means adding them to the pipeline).
Everything that we run on the cloud should also be executable locally. I think this is important because it allows for local prototyping with new attacking algorithms and so on. Python is not good at checking types so it must be possible to execute everything locally for quick testing, before pushing.
In the current setup, how would a researcher cancel a model execution? We have to discuss how much user-friendliness we want. Logging on to the cloud is always an option, but it might be too tedious.

My points are a bit vague. I'm aware of that. The reason is that I'm not having a clear picture of the development workflow just yet. We have to iterate and optimize.

Simsso · 2018-06-11T19:35:03Z

Dockerization

Just throwing in a few links...

General information on TensorFlow dockerization
Making TensorBoard accessible on Google Cloud

ghost · 2018-06-11T20:12:14Z

Some answers from our recent discussion:

As we dockerize our models and all necessary parts to train and monitor our models. You shouldn't have a problem by just building your image locally and do e2e testings. After some iterations, you can push everything with a tag and let the pipeline deploy everything on the cloud as it would run on your local computer. It is your decision, when to deploy the model.
At this point we have to wait for further clarification regarding the interfaces, which the attackers deliver and we could implement. My idea is to save the attacks (which should be obviously just a .py file) central and load them during the evaluation process dynamically into the container.
See first point
As we save all kind of information regarding our deployments, we should use this data and simply visualize it on our own custom frontend. Furthermore we could extend our frontend to start and stop whole containers as we have direct access to the k8s API through the gcloud API.

We should get some hands-on experience regarding using dockerized Tensorflow and deploying Docker containers on GCP/AWS. After we know our tools, we can define a proper pipeline.

ghost · 2018-06-18T19:49:03Z

We should consider comparing GCE and GKE in regards of deployment of Docker container as discussed as in 2. Working Group Meeting #7. GKE might be overkill for as we are not going to use capabilities such as load-balancing, high availability and more. Furthermore deleting VM instances on GCE might be easier in terms of reducing our infrastructure costs.
We have to think about how we manage multiple Dockerfiles in our repository, listen for specific tags initiating the CD-pipeline, automated Image builds

ghost · 2018-06-19T19:28:08Z

Today we had our first successful test on GCP. We managed to trigger a build of an image using the Image Building functionality of Google Container Registry. The build was triggered as intended previously by a simple tag. The final image was automatically pushed to the Google Container Registry ready for deployment.

I will track all my steps, which I have done to setup everything. Please note that this is only testing. Branches such as GCP will be deleted.

Catch me up here: GCP Building the Pipeline

ghost · 2018-06-22T09:31:02Z

I tried to access Google Cloud Storage from Python, to save our models, logs and more in a persistence storage in GCP. I used Timo's linear-combination project as base. Unfortunately after creating service accounts on GCP, importing the access keys and the needed google cloud, I ran into access problems (403).

ghost · 2018-06-22T22:10:12Z

The issue was solved without changing anything in regards of GCP or local configuration. The one and only difference was the access from a different network (university then and now private network).

Following pictures show that Tensorflow automatically reads the path to key-files from the environment variable and successfully saves the trained model. We are also able to save other files.

ghost · 2018-06-22T22:25:56Z

Next steps to take in regard of our deployment pipeline:

The next step is to finish Compare GCP and AWS regarding Deployment and Training of Models #3.
Decide whether to build everything on top of AWS or GCP (objective comparison and discussion in this issue or upcoming meeting).
Design and write down the final pipeline. Answering following existing questions:

multiple Dockerfiles in the repository
versioning of models and so Docker images
generating unique image names using tag(?)
automatic killing of VMs/nodes after successful training, testing and evaluation

Build the basic pipeline, which includes:

setup all needed products on the cloud provider
finishing our reference model (implement defined interfaces for full pipeline integration)
do e2e-testing (tracking naming conventions, training, uploading of models, Tensorboard access etc.)

Simsso · 2018-06-23T16:34:34Z

Within GCP, have you made a decision on whether to use (1) VMs, (2) Kubernetes, or (3) ML Engine? We can also discuss that at #12.

ghost · 2018-06-25T16:42:51Z

@Simsso I am sorry for the delayed answer, as I am very limited in time these days.

Currently I have made no decision on whether we are going to use VMs, K8s or the ML Engine in GCP, as I am still working into ML Engine currently.

ghost · 2018-07-01T10:41:51Z

Today we made a decision on whether we are going to deploy on GCP or on AWS.

The decision was that we use GCP, due to reasons which were already mentioned in #3. Within GCP we are going to use the ML Engine, due it high-abstraction level and design for research in ML.

All previous tasks are extended / partly replaced by defined tasks for @doktorgibson #18 . All results regarding these tasks can be found in #11 .

This issue should only contain discussion about the abstract design of our pipeline.

ghost · 2018-08-26T09:49:38Z

Deployment pipeline has been successfully created. Next steps are to build tools for better enablement of research.

Simsso added code Software is relevant or involved infrastructure Cloud services, infrastructure, CI, and deployment labels Jun 11, 2018

Simsso assigned Simsso and ghost Jun 11, 2018

Simsso mentioned this issue Jun 11, 2018

2. Working Group Meeting #7

Closed

Simsso referenced this issue Jun 11, 2018

Docker and IDE setup

6384ca8

Simsso mentioned this issue Jul 1, 2018

4. Working Group Meeting #18

Closed

Simsso mentioned this issue Jul 22, 2018

7. Working Group Meeting #27

Closed

Simsso mentioned this issue Aug 12, 2018

9. Working Group Meeting #30

Closed

Simsso added the work-item Tasks label Aug 13, 2018

ghost closed this as completed Aug 26, 2018

Simsso mentioned this issue Aug 26, 2018

Deployment Pipeline Documentation #33

Closed

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model Deployment Pipeline #9

Model Deployment Pipeline #9

Simsso commented Jun 11, 2018

ghost commented Jun 11, 2018 •

edited by ghost

Loading

Simsso commented Jun 11, 2018 •

edited

Loading

Simsso commented Jun 11, 2018 •

edited

Loading

ghost commented Jun 11, 2018

ghost commented Jun 18, 2018

ghost commented Jun 19, 2018

ghost commented Jun 22, 2018 •

edited by ghost

Loading

ghost commented Jun 22, 2018

ghost commented Jun 22, 2018 •

edited by ghost

Loading

Simsso commented Jun 23, 2018

ghost commented Jun 25, 2018 •

edited by ghost

Loading

ghost commented Jul 1, 2018 •

edited by ghost

Loading

ghost commented Aug 26, 2018

Model Deployment Pipeline #9

Model Deployment Pipeline #9

Comments

Simsso commented Jun 11, 2018

ghost commented Jun 11, 2018 • edited by ghost Loading

Simsso commented Jun 11, 2018 • edited Loading

Simsso commented Jun 11, 2018 • edited Loading

Dockerization

ghost commented Jun 11, 2018

ghost commented Jun 18, 2018

ghost commented Jun 19, 2018

ghost commented Jun 22, 2018 • edited by ghost Loading

ghost commented Jun 22, 2018

ghost commented Jun 22, 2018 • edited by ghost Loading

Simsso commented Jun 23, 2018

ghost commented Jun 25, 2018 • edited by ghost Loading

ghost commented Jul 1, 2018 • edited by ghost Loading

ghost commented Aug 26, 2018

ghost commented Jun 11, 2018 •

edited by ghost

Loading

Simsso commented Jun 11, 2018 •

edited

Loading

Simsso commented Jun 11, 2018 •

edited

Loading

ghost commented Jun 22, 2018 •

edited by ghost

Loading

ghost commented Jun 22, 2018 •

edited by ghost

Loading

ghost commented Jun 25, 2018 •

edited by ghost

Loading

ghost commented Jul 1, 2018 •

edited by ghost

Loading