Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Development Branch of Training Enablement Tools #35

Merged
103 commits merged into from
Aug 30, 2018

Conversation

ghost
Copy link

@ghost ghost commented Aug 28, 2018

The current status of this branch is Docker image of the given model is successfully build, GCE instances created and configured, the training of the model is started and results of the training are served using Tensorboard.

The deployment pipeline was successfully built. The next steps are to develop the training enablement tools and integrate them to the deployment pipeline. The development tools are mandatory to solve the issue 'Create Cloud Testing CNN' (#11) in a end-to-end-matter.

Enablement Tools are:

  • Tools, which notify the current status/lifecycle of the training process
  • Tools, which allow to submit training jobs in a simple matter
  • Tools, which enable a centralized Tensorboard even after a training model container has died. Having an overview over all past training jobs
  • Tools, which show the current running jobs and allow to stop these

Currently there are no plans to execute attacks against the trained model as the official website is used for this.

The branch 'training-enablement-tools' denotes the start of the development. The 'tiny-imagenet-classifier-resnet' should not be merged into masters as it is still in a early stage. It will happen that in 'training-enablement-tools' the given model is developed further in terms of infrastructure and deployment pipeline integration. In any cases of development of the given resnet-model please contact me for rebasing (in both ways).

In my opinion this merge is mandatory as new branch for the development of the enablement tools is needed and furthermore we have to split infrastructure from the ml part.

@ghost ghost requested review from Simsso and FlorianPfisterer August 28, 2018 14:37
@ghost
Copy link
Author

ghost commented Aug 28, 2018

@Simsso @FlorianPfisterer I totally forgot about the fact that we made on sunday for testing purpose a new branch with the suffix '-branch' (base branch was tiny-imagenet-classifier-resnet).

Would you like to develop on the new one further and discontinue the one without the suffix? That would make future rebasement much simpler!

@Simsso
Copy link
Owner

Simsso commented Aug 29, 2018

  • Agree with the feature list for enablement tools, is there an issue for it / do you think it makes sense to open one?

  • Currently there are no plans to execute attacks against the trained model as the official website is used for this.

    So far that's fine, in the long run (i.e. as soon as our classifiers work) we might have to integrate attacks into our pipeline

  • Agree with merging, so far the tiny-imagenet-classifier-xxx branches are experiments and as soon as we achieve good accuracies, we'll re-write the Python code in a cleaner fashion

  • Did not read all the code but will browse it real quick; will do a full review, sure enough, as soon as we merge into master

  • We continue the branch tiny-imagenet-classifier-resnet, i.e. without postfix

echo "Increase connection failure retries .."
export ANSIBLE_SSH_RETRIES=99

echo "Configure Instance at $1 and start model training .."
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to get a better understanding... Where could one see those logs?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see those logs on GCP during the build as like you would in Jenkins.

"--plaintext-file=deployment/cloudbuild-service-account.json",
"--location=global",
"--keyring=nips-2018-challenge-keyring",
"--key=nips-2018-challenge-key"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that a key?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are keys on GCP, which are downloaded and used during the build to encrypt access-keys to our service accounts e.g. Service Account for GCE.

@@ -0,0 +1,10 @@
FROM tensorflow/tensorflow:latest-gpu
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not use latest, but rather go with one version throughout the challenge. Open to discuss that though.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know which version of TensorFlow is recommended to use?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I do not. Maybe the latest?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest I would stickt with latest-gpu as it should be downwards-compatible and include serious fixes, performance improvements and so on. I think we will get to know if something broken very fast.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that's fine with me.

@@ -0,0 +1,19 @@
FROM tensorflow/tensorflow:latest
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is tensorflow correct or rather tensorflow-gpu?

Again, tag latest?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is an old version of the Dockerfile. I had worked on the tiny-imagenet-classifier, due to that I completely forgot to update this one.

@ghost
Copy link
Author

ghost commented Aug 30, 2018

@Simsso I don't think it makes any sense to open a new one. I will merge the list into #11 as it is part of that.

- early versions of client (TS) and server (TM)
- define gRPC protobuf for communcation
- missing docu on files other than the proto
@ghost ghost merged commit 2658dce into training-enablement-tools Aug 30, 2018
@ghost ghost deleted the tiny-imagenet-classifier-resnet-branch branch August 30, 2018 06:53
@Simsso Simsso added code Software is relevant or involved infrastructure Cloud services, infrastructure, CI, and deployment labels Sep 12, 2018
@Simsso Simsso assigned ghost Sep 14, 2018
@ghost ghost mentioned this pull request Sep 21, 2018
11 tasks
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
code Software is relevant or involved infrastructure Cloud services, infrastructure, CI, and deployment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants