-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Development Branch of Training Enablement Tools #35
Development Branch of Training Enablement Tools #35
Conversation
…till gives errors)
- This is not the final version, changes will follow
- Models should be places in models/ for now
First attempt for training model on GCP
- fix Ansible commands
- downgrade to tf-gpu 1.4, due CUDA mismatch - try new Ansible command
- includes TF-GPU and tiny-imagenet
- increase to 2 cores on GCP - use start.sh
- enable detached mode for pipeline to finish
@Simsso @FlorianPfisterer I totally forgot about the fact that we made on sunday for testing purpose a new branch with the suffix '-branch' (base branch was tiny-imagenet-classifier-resnet). Would you like to develop on the new one further and discontinue the one without the suffix? That would make future rebasement much simpler! |
|
echo "Increase connection failure retries .." | ||
export ANSIBLE_SSH_RETRIES=99 | ||
|
||
echo "Configure Instance at $1 and start model training .." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to get a better understanding... Where could one see those logs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can see those logs on GCP during the build as like you would in Jenkins.
"--plaintext-file=deployment/cloudbuild-service-account.json", | ||
"--location=global", | ||
"--keyring=nips-2018-challenge-keyring", | ||
"--key=nips-2018-challenge-key" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that a key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are keys on GCP, which are downloaded and used during the build to encrypt access-keys to our service accounts e.g. Service Account for GCE.
@@ -0,0 +1,10 @@ | |||
FROM tensorflow/tensorflow:latest-gpu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should not use latest
, but rather go with one version throughout the challenge. Open to discuss that though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know which version of TensorFlow is recommended to use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I do not. Maybe the latest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest I would stickt with latest-gpu
as it should be downwards-compatible and include serious fixes, performance improvements and so on. I think we will get to know if something broken very fast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, that's fine with me.
@@ -0,0 +1,19 @@ | |||
FROM tensorflow/tensorflow:latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is tensorflow
correct or rather tensorflow-gpu
?
Again, tag latest
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is an old version of the Dockerfile. I had worked on the tiny-imagenet-classifier, due to that I completely forgot to update this one.
- early versions of client (TS) and server (TM) - define gRPC protobuf for communcation - missing docu on files other than the proto
The current status of this branch is Docker image of the given model is successfully build, GCE instances created and configured, the training of the model is started and results of the training are served using Tensorboard.
The deployment pipeline was successfully built. The next steps are to develop the training enablement tools and integrate them to the deployment pipeline. The development tools are mandatory to solve the issue 'Create Cloud Testing CNN' (#11) in a end-to-end-matter.
Enablement Tools are:
Currently there are no plans to execute attacks against the trained model as the official website is used for this.
The branch 'training-enablement-tools' denotes the start of the development. The 'tiny-imagenet-classifier-resnet' should not be merged into masters as it is still in a early stage. It will happen that in 'training-enablement-tools' the given model is developed further in terms of infrastructure and deployment pipeline integration. In any cases of development of the given resnet-model please contact me for rebasing (in both ways).
In my opinion this merge is mandatory as new branch for the development of the enablement tools is needed and furthermore we have to split infrastructure from the ml part.