Development Branch of Training Enablement Tools #35

ghost · 2018-08-28T14:37:25Z

The current status of this branch is Docker image of the given model is successfully build, GCE instances created and configured, the training of the model is started and results of the training are served using Tensorboard.

The deployment pipeline was successfully built. The next steps are to develop the training enablement tools and integrate them to the deployment pipeline. The development tools are mandatory to solve the issue 'Create Cloud Testing CNN' (#11) in a end-to-end-matter.

Enablement Tools are:

Tools, which notify the current status/lifecycle of the training process
Tools, which allow to submit training jobs in a simple matter
Tools, which enable a centralized Tensorboard even after a training model container has died. Having an overview over all past training jobs
Tools, which show the current running jobs and allow to stop these

Currently there are no plans to execute attacks against the trained model as the official website is used for this.

The branch 'training-enablement-tools' denotes the start of the development. The 'tiny-imagenet-classifier-resnet' should not be merged into masters as it is still in a early stage. It will happen that in 'training-enablement-tools' the given model is developed further in terms of infrastructure and deployment pipeline integration. In any cases of development of the given resnet-model please contact me for rebasing (in both ways).

In my opinion this merge is mandatory as new branch for the development of the enablement tools is needed and furthermore we have to split infrastructure from the ml part.

…till gives errors)

- This is not the final version, changes will follow

- Models should be places in models/ for now

First attempt for training model on GCP

- fix Ansible commands

- downgrade to tf-gpu 1.4, due CUDA mismatch - try new Ansible command

- includes TF-GPU and tiny-imagenet

- increase to 2 cores on GCP - use start.sh

- enable detached mode for pipeline to finish

ghost · 2018-08-28T14:45:00Z

@Simsso @FlorianPfisterer I totally forgot about the fact that we made on sunday for testing purpose a new branch with the suffix '-branch' (base branch was tiny-imagenet-classifier-resnet).

Would you like to develop on the new one further and discontinue the one without the suffix? That would make future rebasement much simpler!

Simsso · 2018-08-29T20:24:50Z

Agree with the feature list for enablement tools, is there an issue for it / do you think it makes sense to open one?
Currently there are no plans to execute attacks against the trained model as the official website is used for this.

So far that's fine, in the long run (i.e. as soon as our classifiers work) we might have to integrate attacks into our pipeline
Agree with merging, so far the tiny-imagenet-classifier-xxx branches are experiments and as soon as we achieve good accuracies, we'll re-write the Python code in a cleaner fashion
Did not read all the code but will browse it real quick; will do a full review, sure enough, as soon as we merge into master
We continue the branch tiny-imagenet-classifier-resnet, i.e. without postfix

Simsso · 2018-08-29T20:29:54Z

deployment/configure_gce_instance.sh

+echo "Increase connection failure retries .."
+export ANSIBLE_SSH_RETRIES=99
+
+echo "Configure Instance at $1 and start model training .."


Just to get a better understanding... Where could one see those logs?

You can see those logs on GCP during the build as like you would in Jenkins.

Simsso · 2018-08-29T20:30:06Z

deployment/deployment.json

+                "--plaintext-file=deployment/cloudbuild-service-account.json",
+                "--location=global",
+                "--keyring=nips-2018-challenge-keyring",
+                "--key=nips-2018-challenge-key"


Is that a key?

Those are keys on GCP, which are downloaded and used during the build to encrypt access-keys to our service accounts e.g. Service Account for GCE.

Simsso · 2018-08-29T20:30:52Z

deployment/nips-tensorflow-base-image/Dockerfile

@@ -0,0 +1,10 @@
+FROM tensorflow/tensorflow:latest-gpu


I think we should not use latest, but rather go with one version throughout the challenge. Open to discuss that though.

Do you know which version of TensorFlow is recommended to use?

No I do not. Maybe the latest?

To be honest I would stickt with latest-gpu as it should be downwards-compatible and include serious fixes, performance improvements and so on. I think we will get to know if something broken very fast.

Okay, that's fine with me.

Simsso · 2018-08-29T20:31:58Z

models/reference-cnn/Dockerfile

@@ -0,0 +1,19 @@
+FROM tensorflow/tensorflow:latest


Is tensorflow correct or rather tensorflow-gpu?

Again, tag latest?

That is an old version of the Dockerfile. I had worked on the tiny-imagenet-classifier, due to that I completely forgot to update this one.

ghost · 2018-08-30T06:53:38Z

@Simsso I don't think it makes any sense to open a new one. I will merge the list into #11 as it is part of that.

- early versions of client (TS) and server (TM) - define gRPC protobuf for communcation - missing docu on files other than the proto

FlorianPfisterer and others added 30 commits June 27, 2018 20:23

Setup project structure

3e4d594

Add code to load the Tiny-ImageNet training set

9aee1dd

Add minimal README.md contents

c34f992

Define simple CNN graph

c013b2c

Fix: separate optimizer from graph definition

abf10d9

Update import statements of files, define training code in adam.py (s…

2d9d28e

…till gives errors)

Fix imports, paths and data types - now functional

73bd17f

Implement pressing fixes from @Simsso

7fae960

Add Dockerfile contents and start.sh script

1dcb1dd

Add GCS Data Helper

92f0794

Finalize GCS integration

bdf1e8b

Project setup

f7293bc

Add data set loader script

f69a743

Implement multinomial logistic regression

fc97e18

Copy data set loader

5dfa2a9

Implement SGD training

23c1fe3

Add deep CNN model

7227a81

Add batch normalization

427565d

Add deployment.json for GCP Build Trigger

53e4c2a

- This is not the final version, changes will follow

Changes in folder structure

6d3eb8b

- Models should be places in models/ for now

Add initial Terraform configs

6f6819a

Implement variable validation cycle

ae5da20

Customize kernel initializiation

3a74795

More Terraform configs

ae6a867

Initial Ansible Playbook

95e99b2

Update model path for Docker build

5db7ec8

Change Terraform buildstep

08cf501

Update path to create_gce_instance.sh

5ba300a

Install curl in Build Container

9fa7d76

Update deployment scripts

de5315d

samedguener and others added 16 commits August 26, 2018 11:05

Merge pull request #32 from Simsso/reference-cnn-branch

317eef6

First attempt for training model on GCP

Create tiny-imagnet-classifier model folder

76126e1

Make unzip process of unzip quiet

50a53ae

Fix quiet unzip process

f1c25a1

Remove interactive mode from Docker run

c8e6b29

Custom Machine Time and TF-GPU Enablement

2602ed5

Update base-image to larger hard-disk

db3a9ab

- fix Ansible commands

Increas build timeouts

b697105

- downgrade to tf-gpu 1.4, due CUDA mismatch - try new Ansible command

Add CUDA deps to Dockerfile

2489712

Switch to tf-gpu base-image

3c7ea91

Remove specific tf-gpu version

0146532

Add TensorFlow base-image for Docker

a5a2b88

- includes TF-GPU and tiny-imagenet

Fix path for TensorBoard

e24d27b

- increase to 2 cores on GCP - use start.sh

Increase connection retries in Ansible

202695d

- enable detached mode for pipeline to finish

Update tf_log path in SGD

72491d9

Expose Docker ports

c08fc45

ghost requested review from Simsso and FlorianPfisterer August 28, 2018 14:37

Simsso approved these changes Aug 29, 2018

View reviewed changes

Add TrainingService and TrainingManager

7e33a50

- early versions of client (TS) and server (TM) - define gRPC protobuf for communcation - missing docu on files other than the proto

ghost merged commit 2658dce into training-enablement-tools Aug 30, 2018

ghost deleted the tiny-imagenet-classifier-resnet-branch branch August 30, 2018 06:53

FlorianPfisterer mentioned this pull request Sep 11, 2018

12. Working Group Meeting #44

Closed

Simsso added code Software is relevant or involved infrastructure Cloud services, infrastructure, CI, and deployment labels Sep 12, 2018

Simsso assigned ghost Sep 14, 2018

ghost mentioned this pull request Sep 21, 2018

13. Working Group Meeting #56

Closed

11 tasks

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Development Branch of Training Enablement Tools #35

Development Branch of Training Enablement Tools #35

ghost commented Aug 28, 2018 •

edited by ghost

Loading

ghost commented Aug 28, 2018

Simsso commented Aug 29, 2018 •

edited

Loading

Simsso Aug 29, 2018

ghost Aug 30, 2018

Simsso Aug 29, 2018

ghost Aug 30, 2018

Simsso Aug 29, 2018

ghost Aug 30, 2018

Simsso Aug 30, 2018

ghost Aug 31, 2018

Simsso Aug 31, 2018

Simsso Aug 29, 2018

ghost Aug 30, 2018

ghost commented Aug 30, 2018

Development Branch of Training Enablement Tools #35

Development Branch of Training Enablement Tools #35

Conversation

ghost commented Aug 28, 2018 • edited by ghost Loading

ghost commented Aug 28, 2018

Simsso commented Aug 29, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghost commented Aug 30, 2018

ghost commented Aug 28, 2018 •

edited by ghost

Loading

Simsso commented Aug 29, 2018 •

edited

Loading