Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model Deployment Pipeline #9

Closed
Simsso opened this issue Jun 11, 2018 · 13 comments
Closed

Model Deployment Pipeline #9

Simsso opened this issue Jun 11, 2018 · 13 comments
Assignees
Labels
code Software is relevant or involved infrastructure Cloud services, infrastructure, CI, and deployment work-item Tasks

Comments

@Simsso
Copy link
Owner

Simsso commented Jun 11, 2018

Let's use this issue as a thread to communicate different, possible deployment pipelines for ML models. Once we have decided upon something we can go ahead and create a wiki page.

@Simsso Simsso added code Software is relevant or involved infrastructure Cloud services, infrastructure, CI, and deployment labels Jun 11, 2018
@Simsso Simsso assigned Simsso and ghost Jun 11, 2018
@ghost
Copy link

ghost commented Jun 11, 2018

Yesterday, while I was specifying our general requirements and defining the processes of training, testing and generating scores, I also had a few thoughts on our future deployment pipeline. Below you'll see my initial draft.
nips-2018-adversarial-vision-challenge deployment pipeline v1

  1. The researcher develops the model.
  2. The model is finished and tested against a local environment (should reflect base image).
  3. The research can tag his very last push on this model as 'release'. Note that every model is in his own branch.
  4. The private docker registry detects the 'release' tag due to Webhooks, uses the name of the branch as image tag and builds the model. The model is build by using a general Dockerfile.
  5. Once the build is finished, the image is deployed on our chosen cloud provider: Generating and starting the container.
  6. The containers grabs all required data to start the training, testing and generating of scores.
  7. Training, testing and generating of scores are done.
  8. All process information such as training rate, accuracy and etc. are logged accordingly for the possibility of tracking deployment jobs in real time
  9. A report regarding the evaluation of the given model is saved.
  10. Persist the frozen model.

Regarding the general Dockerfile:
The idea behind the general Dockerfile is to prevent errors caused manual human interference. The general Dockerfile generates a container having a full stack environment, where only the given model has to be injected accordingly during the process of building the image.

You mentioned that you have some few change request. I would be glad if you could specify them here, so I can think about them before our discussion.

@Simsso
Copy link
Owner Author

Simsso commented Jun 11, 2018

The overall concept looks really good. There are a few things that I would change a bit / where I see a need to discuss:

  • Committing for every singly hyper parameter change is over-engineered. I think trying out a model must be as simple as hitting play in an IDE. One idea would be to have a console application that pushes the model to our cloud service provider (or performs the commit?!).
    I've got to say though, that I'm not entirely sure about that point. Maybe prototyping should happen locally and deployment to a cloud should happen only after extensive, local evaluation.
  • The entire evaluation stack is more complicated. We have to have a way of deploying new evaluation algorithms, aka. attacks (in this case deploying means adding them to the pipeline).
  • Everything that we run on the cloud should also be executable locally. I think this is important because it allows for local prototyping with new attacking algorithms and so on. Python is not good at checking types so it must be possible to execute everything locally for quick testing, before pushing.
  • In the current setup, how would a researcher cancel a model execution? We have to discuss how much user-friendliness we want. Logging on to the cloud is always an option, but it might be too tedious.

My points are a bit vague. I'm aware of that. The reason is that I'm not having a clear picture of the development workflow just yet. We have to iterate and optimize.

@Simsso
Copy link
Owner Author

Simsso commented Jun 11, 2018

Dockerization

Just throwing in a few links...

@ghost
Copy link

ghost commented Jun 11, 2018

Some answers from our recent discussion:

  • As we dockerize our models and all necessary parts to train and monitor our models. You shouldn't have a problem by just building your image locally and do e2e testings. After some iterations, you can push everything with a tag and let the pipeline deploy everything on the cloud as it would run on your local computer. It is your decision, when to deploy the model.

  • At this point we have to wait for further clarification regarding the interfaces, which the attackers deliver and we could implement. My idea is to save the attacks (which should be obviously just a .py file) central and load them during the evaluation process dynamically into the container.

  • See first point

  • As we save all kind of information regarding our deployments, we should use this data and simply visualize it on our own custom frontend. Furthermore we could extend our frontend to start and stop whole containers as we have direct access to the k8s API through the gcloud API.

We should get some hands-on experience regarding using dockerized Tensorflow and deploying Docker containers on GCP/AWS. After we know our tools, we can define a proper pipeline.

@ghost
Copy link

ghost commented Jun 18, 2018

  • We should consider comparing GCE and GKE in regards of deployment of Docker container as discussed as in 2. Working Group Meeting #7. GKE might be overkill for as we are not going to use capabilities such as load-balancing, high availability and more. Furthermore deleting VM instances on GCE might be easier in terms of reducing our infrastructure costs.

  • We have to think about how we manage multiple Dockerfiles in our repository, listen for specific tags initiating the CD-pipeline, automated Image builds

@ghost
Copy link

ghost commented Jun 19, 2018

Today we had our first successful test on GCP. We managed to trigger a build of an image using the Image Building functionality of Google Container Registry. The build was triggered as intended previously by a simple tag. The final image was automatically pushed to the Google Container Registry ready for deployment.

I will track all my steps, which I have done to setup everything. Please note that this is only testing. Branches such as GCP will be deleted.

Catch me up here: GCP Building the Pipeline

@ghost
Copy link

ghost commented Jun 22, 2018

I tried to access Google Cloud Storage from Python, to save our models, logs and more in a persistence storage in GCP. I used Timo's linear-combination project as base. Unfortunately after creating service accounts on GCP, importing the access keys and the needed google cloud, I ran into access problems (403).

@ghost
Copy link

ghost commented Jun 22, 2018

The issue was solved without changing anything in regards of GCP or local configuration. The one and only difference was the access from a different network (university then and now private network).

Following pictures show that Tensorflow automatically reads the path to key-files from the environment variable and successfully saves the trained model. We are also able to save other files.

gcs-pycharm-succesful-model-upload_1

gcs-pycharm-succesful-model-upload_2

@ghost
Copy link

ghost commented Jun 22, 2018

Next steps to take in regard of our deployment pipeline:

  1. The next step is to finish Compare GCP and AWS regarding Deployment and Training of Models #3.
  2. Decide whether to build everything on top of AWS or GCP (objective comparison and discussion in this issue or upcoming meeting).
  3. Design and write down the final pipeline. Answering following existing questions:
  • multiple Dockerfiles in the repository
  • versioning of models and so Docker images
  • generating unique image names using tag(?)
  • automatic killing of VMs/nodes after successful training, testing and evaluation
  1. Build the basic pipeline, which includes:
  • setup all needed products on the cloud provider
  • finishing our reference model (implement defined interfaces for full pipeline integration)
  • do e2e-testing (tracking naming conventions, training, uploading of models, Tensorboard access etc.)

@Simsso
Copy link
Owner Author

Simsso commented Jun 23, 2018

Within GCP, have you made a decision on whether to use (1) VMs, (2) Kubernetes, or (3) ML Engine? We can also discuss that at #12.

@ghost
Copy link

ghost commented Jun 25, 2018

@Simsso I am sorry for the delayed answer, as I am very limited in time these days.

Currently I have made no decision on whether we are going to use VMs, K8s or the ML Engine in GCP, as I am still working into ML Engine currently.

@ghost
Copy link

ghost commented Jul 1, 2018

Today we made a decision on whether we are going to deploy on GCP or on AWS.

The decision was that we use GCP, due to reasons which were already mentioned in #3. Within GCP we are going to use the ML Engine, due it high-abstraction level and design for research in ML.

All previous tasks are extended / partly replaced by defined tasks for @doktorgibson #18 . All results regarding these tasks can be found in #11 .

This issue should only contain discussion about the abstract design of our pipeline.

@ghost
Copy link

ghost commented Aug 26, 2018

Deployment pipeline has been successfully created. Next steps are to build tools for better enablement of research.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
code Software is relevant or involved infrastructure Cloud services, infrastructure, CI, and deployment work-item Tasks
Projects
None yet
Development

No branches or pull requests

1 participant