Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pre-0.1 release #79

Merged
merged 147 commits into from
Jun 26, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
147 commits
Select commit Hold shift + click to select a range
d6fd7a8
Merge from upstream
fplk May 14, 2018
9540160
Manual modifications before test.
fplk May 14, 2018
8ad82d2
Updated test-submit. Now need to specify AWS environment variables fo…
fplk May 14, 2018
dd5a31b
added a test comment
May 16, 2018
74f4baa
Merge pull request #2 from sboagibm/my_flpk_test_2
fplk May 16, 2018
cf15bcd
Fixes for bad build in travis
May 18, 2018
2f04453
Merge pull request #3 from sboagibm/flpk_travis_fix_1
fplk May 18, 2018
cb47092
Removed Scott's test comment and temporarily disabled python-distutil…
May 18, 2018
f8b0d31
Added debug statements.
fplk May 18, 2018
5ab5f9c
Another Travis debug commit.
fplk May 18, 2018
f61d70b
Another Travis debug commit.
fplk May 18, 2018
003ba8c
Another Travis debug commit.
fplk May 18, 2018
a5cd4db
Extended test-submit.
fplk May 19, 2018
2f5e7ce
Pointed travis testing to do hostmount minikube
May 19, 2018
3ab475e
Merge pull request #4 from sboagibm/flpk_travis_fix_2
fplk May 19, 2018
5461cd3
Debugging permissions error.
fplk May 19, 2018
ff8f58c
Fix to mkdir problems.
fplk May 19, 2018
20f34b3
Fixed Makefile syntax.
fplk May 19, 2018
b658805
Added Minikube CI make target to travis.yml
fplk May 19, 2018
a72d19a
Printing debugging information about pods.
fplk May 19, 2018
fba5126
Printing debugging information about pods.
fplk May 19, 2018
cd4163d
Printing debugging information about pods.
fplk May 19, 2018
fe67660
Printing debugging information incl kubectl get pod.
fplk May 19, 2018
586d590
Dumping tons of debug including statefulsets.
fplk May 19, 2018
c431a52
Added more debug statements.
fplk May 19, 2018
b87c19f
Fixed debug code.
fplk May 20, 2018
a50de4a
Enabled debug mode.
fplk May 20, 2018
8817b4a
Again.
fplk May 20, 2018
4d0e02a
Another one.
fplk May 20, 2018
9652b29
Set debug as default.
fplk May 20, 2018
fdd8c16
Added debug notification.
fplk May 20, 2018
8b7b85e
Added trainer logging.
fplk May 20, 2018
c8bf2de
tracing from the trainer to lcm
May 20, 2018
2b5360a
Merge pull request #5 from sboagibm/fplk_add_lcm_trace
fplk May 20, 2018
f6ff626
Logging result.
fplk May 20, 2018
7b00125
tweaked coord debugging
May 20, 2018
9da2fee
more debugging
May 20, 2018
e8a656a
Merge pull request #6 from sboagibm/fplk_add_lcm_dump_kube_objects
fplk May 20, 2018
c9efcd3
fixed bad continuation line in makefile
May 20, 2018
ea94aec
Merge pull request #7 from sboagibm/fplk_try_to_fix_build_error
fplk May 20, 2018
6263f60
hard-coded disable push metrics
May 20, 2018
b193515
Merge pull request #8 from sboagibm/disable_metrics
fplk May 20, 2018
6a1dbd6
added lower level logging
May 20, 2018
850d650
dist: xenial
May 20, 2018
6cceb2d
Merge pull request #9 from sboagibm/fplk_low_level_logging
fplk May 20, 2018
c2f0a11
Update .travis.yml
fplk May 20, 2018
98c1ff3
fix typo
May 20, 2018
90af5c3
Merge pull request #10 from sboagibm/fix_yet_another_build_problem
fplk May 20, 2018
ef5092b
try dumb stuff to get logging working
May 20, 2018
8144e7a
Merge pull request #11 from sboagibm/try_to_get_lcm_logging_working
fplk May 21, 2018
d99cec3
do lcm describe
May 21, 2018
483a425
Merge pull request #12 from sboagibm/try_lcm_describe
fplk May 21, 2018
112b3c0
add a -p to lcm logs
May 21, 2018
142b54e
Merge pull request #13 from sboagibm/try_previous_lcm_logs
fplk May 21, 2018
f1f0f68
set shared_volume_storage_class etc
May 21, 2018
57c9b6c
Merge pull request #14 from sboagibm/set_shared_volume_storage_class_etc
fplk May 21, 2018
c0e6b4b
Trying to fix resource constraint.
fplk May 21, 2018
cb6df62
monkeying with cpu allocation
May 21, 2018
72139c6
Merge pull request #15 from sboagibm/lower_cpu
fplk May 21, 2018
1eb37f4
Added DIND-specific setup scripts.
fplk May 27, 2018
f0fc73f
Improved DIND scripts.
fplk May 27, 2018
3d7c9f2
Fix in DIND scripts.
fplk May 27, 2018
fe711c2
Added compile script for s3fs, since open source S3 driver needs manu…
fplk May 27, 2018
3093f0d
Minor DIND script update
fplk May 27, 2018
e80dfb2
Final DIND commit.
fplk May 27, 2018
3fb6509
Removed information about proprietary S3 driver, reverted tf-summary-…
fplk May 29, 2018
41c601d
Merge branch 'master' into merge_20180514_1536
fplk May 29, 2018
8e53d72
Added concrete example to get from fresh SL node to running a FfDL job.
fplk May 29, 2018
c2beaa7
Merge branch 'merge_20180514_1536' of https://github.com/fplk/FfDL in…
fplk May 29, 2018
eb400d3
Trying to fix Travis issue.
fplk May 29, 2018
8d4a6ee
Fixed Travis issue.
fplk May 29, 2018
61ad832
Removed hardcoded user name.
fplk May 29, 2018
a21146b
Moved S3 driver compilation out of s3_driver.sh, so it can be used fo…
fplk May 29, 2018
ddc403a
Followed Tommy's request and increased resource limits to values from…
fplk May 30, 2018
ba6f57f
Parameterized memory values like Tommy requested.
fplk May 31, 2018
4dc6c82
Removed remaining placeholder values for parameterization.
fplk May 31, 2018
b2b9524
Changed lcm-deployment.yml indentation in attempt to fix CI with para…
fplk Jun 1, 2018
5b1cf1b
Attempt to fix CI.
fplk Jun 1, 2018
59f9fdf
Removed excessive debug statements and cleaned comments. Probably bre…
fplk Jun 4, 2018
b5ba13a
Removed unused declaration
fplk Jun 4, 2018
ec540ab
Migrated S3FS compilation into Docker environment for DIND setup.
fplk Jun 11, 2018
e3c3c33
Added Keras example.
fplk Jun 11, 2018
da0762b
Merge branch 'master' into merge_20180514_1536
fplk Jun 11, 2018
4b52602
Added changed requested by Tommy.
fplk Jun 12, 2018
49b5862
Merge branch 'master' into merge_20180514_1536
fplk Jun 16, 2018
50666f5
Added script for IBM Cloud deployment with open source S3 driver.
fplk Jun 17, 2018
2881b84
Tried to add missing copyright headers in semi-automated fashion. Wil…
fplk Jun 19, 2018
0f8393b
Merge branch 'master' into merge_20180514_1536
animeshsingh Jun 19, 2018
cebf747
update draft instructions
Tomcli Jun 20, 2018
a81b451
update draft instructions
Tomcli Jun 20, 2018
f4ae06e
update helm chart
Tomcli Jun 21, 2018
f0db780
fix script path
Tomcli Jun 21, 2018
413f847
update instructions
Tomcli Jun 21, 2018
d9de742
update instructions
Tomcli Jun 21, 2018
fd9682e
update instructions
Tomcli Jun 21, 2018
d8fa8d2
update instructions
Tomcli Jun 21, 2018
28b86c2
update instructions
Tomcli Jun 21, 2018
e1e9bf6
update helm chart
Tomcli Jun 21, 2018
613310e
update instructions
Tomcli Jun 21, 2018
794ef85
update instructions
Tomcli Jun 21, 2018
bb950c8
fix script
Tomcli Jun 21, 2018
76a9c89
update instructions
Tomcli Jun 21, 2018
2877755
update instructions
Tomcli Jun 21, 2018
41eb377
update instructions
Tomcli Jun 21, 2018
d1a6f9a
update instructions
Tomcli Jun 21, 2018
083a3a6
Merge branch 'master' into merge_20180514_1536
fplk Jun 21, 2018
09608c4
update makefile
Tomcli Jun 22, 2018
c3838e1
update makefile
Tomcli Jun 22, 2018
601b8f4
update makefile
Tomcli Jun 22, 2018
4003d7f
update makefile
Tomcli Jun 22, 2018
52cc7cb
Merge branch 'merge_20180514_1536' into merge_20180514_1536
Tomcli Jun 22, 2018
f6a9174
Merge pull request #19 from Tomcli/merge_20180514_1536
fplk Jun 22, 2018
d83e1f8
Merge branch 'master' into merge_20180514_1536
animeshsingh Jun 22, 2018
b2601c5
fix makefile
Tomcli Jun 22, 2018
cfe6450
update makefile
Tomcli Jun 22, 2018
8111c31
Merge pull request #20 from Tomcli/merge_20180514_1536
fplk Jun 22, 2018
669af51
replace instructions with script
Tomcli Jun 22, 2018
823b0d3
update cleanup steps
Tomcli Jun 22, 2018
19b4b09
update readme
Tomcli Jun 22, 2018
ae57494
commented out minikube references
Tomcli Jun 22, 2018
68e0bd0
add cleanup command
Tomcli Jun 22, 2018
0019df6
add cleanup command
Tomcli Jun 22, 2018
73434d8
add cleanup command
Tomcli Jun 22, 2018
cbc8939
update troubleshooting and makefile
Tomcli Jun 22, 2018
ae4fa05
update troubleshooting;
Tomcli Jun 22, 2018
8233d3c
update developer guide
Tomcli Jun 22, 2018
f5b5ab5
update developer guide
Tomcli Jun 22, 2018
ff3b6c5
Merge pull request #21 from Tomcli/merge_20180514_1536
fplk Jun 22, 2018
3949214
Removed quick start guide that is obsolete with usability improvements.
fplk Jun 25, 2018
31e0194
Extended developer guide a little (fixed typos, added comments regard…
fplk Jun 25, 2018
84109df
Cleaned up bin subdirectory.
fplk Jun 25, 2018
c44d224
fix docker push with docker.io
Tomcli Jun 25, 2018
a262d9d
fix docker push with docker.io
Tomcli Jun 25, 2018
a3b2c88
Merge pull request #22 from Tomcli/merge_20180514_1536
fplk Jun 25, 2018
d5f0780
update h2o-3 entrypoint script to be compatible with FfDL v0.1
Tomcli Jun 25, 2018
3f9c445
update minimun requirement
Tomcli Jun 25, 2018
2e9d5e5
fix minor bug
Tomcli Jun 25, 2018
d9597f8
fix bugs for christian comments
Tomcli Jun 25, 2018
53b96ff
fix minor bug
Tomcli Jun 25, 2018
f3f63b4
Merge pull request #23 from Tomcli/merge_20180514_1536
fplk Jun 25, 2018
50491b7
Quote docker username and passowrd
ckadner Jun 26, 2018
5cf1800
Extended developer guide, added troubleshooting document, give hint a…
fplk Jun 26, 2018
d9b43e5
Merge pull request #24 from ckadner/patch-1
fplk Jun 26, 2018
368f74b
Switched user repo in command to official one.
fplk Jun 26, 2018
8e6c2bf
Minor syntax changes.
fplk Jun 26, 2018
20b08b0
convert all examples to use mount_cos
Tomcli Jun 26, 2018
562ea5f
Merge pull request #25 from Tomcli/merge_20180514_1536
fplk Jun 26, 2018
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
8 changes: 8 additions & 0 deletions .gitignore
Original file line number Original file line Diff line number Diff line change
Expand Up @@ -13,7 +13,15 @@ bin/copy.sh
.vagrant/ .vagrant/
.DS_Store .DS_Store
*.log *.log
*.csr
*.crt
*.key
lcm/coord/tmp.etcd lcm/coord/tmp.etcd
tmp tmp
.venv/ .venv/
.idea/ .idea/
venv/
bin/dind_scripts/config.json
bin/dind_scripts/go1.10.1.linux-amd64.tar.gz
bin/dind_scripts/helm-v2.8.2-linux-amd64.tar.gz
bin/dind_scripts/linux-amd64/
2 changes: 2 additions & 0 deletions .helmignore
Original file line number Original file line Diff line number Diff line change
Expand Up @@ -21,3 +21,5 @@ demos/
samples/ samples/
cli/ cli/
community/ community/
persistentvol/
design/
10 changes: 9 additions & 1 deletion .travis.yml
Original file line number Original file line Diff line number Diff line change
Expand Up @@ -31,14 +31,22 @@ script:
# fail fast # fail fast
- set -e - set -e
- export MAKE_ARGS=--no-print-directory - export MAKE_ARGS=--no-print-directory
# Open SSH
# - echo travis:$sshpassword | sudo chpasswd
# - sudo sed -i 's/ChallengeResponseAuthentication no/ChallengeResponseAuthentication yes/' /etc/ssh/sshd_config
# - sudo service ssh restart
# - sudo apt-get install sshpass
# - sshpass -p $sshpassword ssh -R 9999:localhost:22 -o StrictHostKeyChecking=no travis@$bouncehostip
# compile and build Docker images # compile and build Docker images
- glide -q install - glide -q install
- make $MAKE_ARGS docker-build-base
- make $MAKE_ARGS gen-certs
- make $MAKE_ARGS build - make $MAKE_ARGS build
- make $MAKE_ARGS docker-build - make $MAKE_ARGS docker-build
# deploy services # deploy services
- make $MAKE_ARGS deploy - make $MAKE_ARGS deploy
# submit a test job # submit a test job
- make $MAKE_ARGS test-submit - make $MAKE_ARGS test-submit-minikube-ci


notifications: notifications:
email: false email: false
1 change: 1 addition & 0 deletions CONTRIBUTING.md
Original file line number Original file line Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ via code, tests, or documentation:
* Michael Kalantar * Michael Kalantar
* Tommy Li * Tommy Li
* Vinod Muthusamy * Vinod Muthusamy
* Falk Pollok
* Florian Rosenberg * Florian Rosenberg
* Animesh Singh * Animesh Singh
* Atin Sood * Atin Sood
Empty file modified Chart.yaml
Whitespace-only changes.
1 change: 1 addition & 0 deletions MAINTAINERS.md
Original file line number Original file line Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ The maintainers are listed in alphabetical order.
- Scott Boag ([sboagibm](https://github.com/sboagibm)) - Scott Boag ([sboagibm](https://github.com/sboagibm))
- Waldemar Hummer ([whummer](https://github.com/whummer)) - Waldemar Hummer ([whummer](https://github.com/whummer))
- Tommy Li ([Tomcli](https://github.com/Tomcli)) - Tommy Li ([Tomcli](https://github.com/Tomcli))
- Falk Pollok ([fplk](https://github.com/fplk))
- Animesh Singh ([animeshsingh](https://github.com/animeshsingh)) - Animesh Singh ([animeshsingh](https://github.com/animeshsingh))


## Methodology ## Methodology
Expand Down
486 changes: 450 additions & 36 deletions Makefile

Large diffs are not rendered by default.

170 changes: 117 additions & 53 deletions README.md
Original file line number Original file line Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ FfDL is a collaboration platform for:


![ffdl-architecture](docs/images/ffdl-architecture.png) ![ffdl-architecture](docs/images/ffdl-architecture.png)


To know more about the architectural details, please read [here](design/design_docs.md) To know more about the architectural details, please read the [design document](design/design_docs.md).


## Prerequisites ## Prerequisites


Expand All @@ -26,12 +26,12 @@ To know more about the architectural details, please read [here](design/design_d


* `S3 CLI`: The [command-line interface](https://aws.amazon.com/cli/) to configure your Object Storage * `S3 CLI`: The [command-line interface](https://aws.amazon.com/cli/) to configure your Object Storage


* An existing Kubernetes cluster (e.g., [Minikube](https://github.com/kubernetes/minikube) for local testing). * An existing Kubernetes cluster (e.g., [Kubeadm-DIND](https://github.com/kubernetes-sigs/kubeadm-dind-cluster#using-preconfigured-scripts) for local testing).
For Minikube, use the command `make minikube` to start Minikube and set up local network routes. Minikube **v0.25.1** is tested with Travis CI. <!-- For Minikube, use the command `make minikube` to start Minikube and set up local network routes. Minikube **v0.25.1** is tested with Travis CI. -->


* Follow the appropriate instructions for standing up your Kubernetes cluster using [IBM Cloud Public](https://github.com/IBM/container-journey-template/blob/master/README.md) or [IBM Cloud Private](https://github.com/IBM/deploy-ibm-cloud-private/blob/master/README.md) * Follow the appropriate instructions for standing up your Kubernetes cluster using [IBM Cloud Public](https://github.com/IBM/container-journey-template/blob/master/README.md) or [IBM Cloud Private](https://github.com/IBM/deploy-ibm-cloud-private/blob/master/README.md)


* The minimum recommended capacity for FfDL is 4GB Memory and 3 CPUs. * The minimum capacity requirement for FfDL is 4GB Memory and 3 CPUs.


## Usage Scenarios ## Usage Scenarios


Expand All @@ -46,9 +46,8 @@ To know more about the architectural details, please read [here](design/design_d
## Steps ## Steps


1. [Quick Start](#1-quick-start) 1. [Quick Start](#1-quick-start)
- 1.1 [Installation using Minikube](#11-installation-using-minikube) - 1.1 [Installation using Kubeadm-DIND](#11-installation-using-kubeadm-dind)
- 1.2 [Installation using Kubernetes Cluster](#12-installation-using-kubernetes-cluster) - 1.2 [Installation using Kubernetes Cluster](#12-installation-using-kubernetes-cluster)
- 1.3 [Installation using IBM Cloud Kubernetes Cluster](#13-installation-using-ibm-cloud-kubernetes-cluster)
2. [Test](#2-test) 2. [Test](#2-test)
3. [Monitoring](#3-monitoring) 3. [Monitoring](#3-monitoring)
4. [Development](#4-development) 4. [Development](#4-development)
Expand All @@ -62,47 +61,42 @@ To know more about the architectural details, please read [here](design/design_d


## 1. Quick Start ## 1. Quick Start


There are multiple installation paths for installing FfDL locally ("1-click-install") or into an existing Kubernetes cluster. There are multiple installation paths for installing FfDL locally ("1-click-install") or into an existing Kubernetes cluster. You can visit [Step 5](#5-detailed-installation-instructions) for more details on the deployment instructions.


> Note: If your Kubernetes Cluster version is 1.7 or below, please go to the [values.yaml](values.yaml) and change `k8s_1dot8_or_above` to **false**. ### 1.1 Installation using Kubeadm-DIND


### 1.1 Installation using Minikube If you have [Kubeadm-DIND](https://github.com/kubernetes-sigs/kubeadm-dind-cluster#using-preconfigured-scripts) installed on your machine, use these commands to deploy the FfDL platform:

If you have Minikube installed on your machine, use these commands to deploy the FfDL platform:
``` shell ``` shell
export VM_TYPE=minikube export VM_TYPE=dind
make minikube export PUBLIC_IP=localhost
make deploy export SHARED_VOLUME_STORAGE_CLASS="";
make deploy-plugin
make quickstart-deploy
``` ```


### 1.2 Installation using Kubernetes Cluster ### 1.2 Installation using Kubernetes Cluster


To install FfDL to a proper Kubernetes cluster, make sure `kubectl` points to the right namespace, To install FfDL to any proper Kubernetes cluster, make sure `kubectl` points to the right namespace,
then deploy the platform services: then deploy the platform services:
> Note: For PUBLIC_IP, put down one of your Cluster Public IP that can access your Cluster's NodePorts. > Note: For PUBLIC_IP, put down one of your Cluster Public IP that can access your Cluster's NodePorts. For IBM Cloud, you can get your Public IP with `bx cs workers <cluster_name>`.


``` shell ``` shell
export VM_TYPE=none export VM_TYPE=none
export PUBLIC_IP=<Cluster Public IP> export PUBLIC_IP=<Cluster Public IP>
make deploy
```


### 1.3 Installation using IBM Cloud Kubernetes Cluster # Change the storage class to what's available on your Cloud Kubernetes Cluster.

export SHARED_VOLUME_STORAGE_CLASS="ibmc-file-gold";
To install FfDL to a proper IBM Cloud Kubernetes cluster, make sure `kubectl` points to the right namespace make deploy-plugin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we missing a helm init before make deploy-plugin here? without that I get Error: could not find a ready tiller pod:

[ffdl]$ make deploy-plugin

Deploying services to Kubernetes. This may take a while.
Installing helm/tiller
Error: could not find a ready tiller pod
Wait while kubectl get pvc shows static-volume-1 in state Pending
Creating persistent volume claim 1
Error: could not find a ready tiller pod
persistentvolumeclaim "static-volume-1" created
Error from server (NotFound): configmaps "static-volumes" not found

Using volumes with label type=dlaas-static-volume:
NAME              STATUS    VOLUME    CAPACITY   ACCESS MODES   STORAGECLASS     AGE
static-volume-1   Pending                                       ibmc-file-gold   1s

configmap "static-volumes" created

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry we did add helm init, but we forgot to put a sleep loop to wait for tiller to be ready.

and your machine is logged in with `bx login`, then deploy the platform services: make quickstart-deploy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using the Mac OS Terminal App (Bash 3.2) and I get this error message:

/bin/sh: @set: command not found

although seemingly without negative consequences. The make target succeeds.

$ make quickstart-deploy

collecting existing pods
calling big command
/bin/sh: @set: command not found
Deploying the stack via Helm. This will take a while.
...
STATUS: DEPLOYED
...

``` shell
export VM_TYPE=ibmcloud
export CLUSTER_NAME=<Your Cluster Name> # Replace <Your Cluster Name> with your IBM Cloud Cluster Name
make deploy
``` ```


## 2. Test ## 2. Test


To submit a simple example training job that is included in this repo (see `etc/examples` folder): To submit a simple example training job that is included in this repo (see `etc/examples` folder):


``` ``` shell
make test-submit make test-push-data-s3
make test-job-submit
``` ```


## 3. Monitoring ## 3. Monitoring
Expand All @@ -115,6 +109,9 @@ Please refer to the [developer guide](docs/developer-guide.md) for more details.


## 5. Detailed Installation Instructions ## 5. Detailed Installation Instructions


0. If you don't have a Kubernetes Cluster, you can create a [Kubeadm-DIND](https://github.com/kubernetes-sigs/kubeadm-dind-cluster#using-preconfigured-scripts) Kubernetes Cluster on your local machine. We recommend you give at least 4 CPUs and 8GB of memory to your Docker.
> For Mac users, visit the instructions on the [Docker website](https://docs.docker.com/docker-for-mac/#advanced) and learn how to give more memory to your Docker.

1. First, clone this repository and install the helm tiller on your Kubernetes cluster. 1. First, clone this repository and install the helm tiller on your Kubernetes cluster.
``` shell ``` shell
helm init helm init
Expand All @@ -124,11 +121,46 @@ kubectl get pods --all-namespaces | grep tiller-deploy
# kube-system tiller-deploy-fb8d7b69c-pcvc2 1/1 Running # kube-system tiller-deploy-fb8d7b69c-pcvc2 1/1 Running
``` ```


2. Now let's install all the necessary FfDL components using helm install. 2. Define the necessary environment variables.
> Note: If your Kubernetes Cluster version is 1.7 or below, please go to the [values.yaml](values.yaml) and change `k8s_1dot8_or_above` to **false**. * 2.a. For Kubeadm-DIND Cluster only
```shell
export FFDL_PATH=$(pwd)
export SHARED_VOLUME_STORAGE_CLASS=""
```

* 2.b. For Cloud Kubernetes Cluster
```shell
# Change the storage class to what's available on your Cloud Kubernetes Cluster.
export SHARED_VOLUME_STORAGE_CLASS="ibmc-file-gold"
```

3. Install the Object Storage driver using helm install.
* 3.a. For Kubeadm-DIND Cluster only
```shell
export FFDL_PATH=$(pwd)
./bin/s3_driver.sh
helm install storage-plugin --set dind=true,cloud=false
```

* 3.b. For Cloud Kubernetes Cluster
```shell
helm install storage-plugin
```

4. Create a static volume to store any metadata from FfDL.

```shell
pushd bin
./create_static_volumes.sh
./create_static_volumes_config.sh
# Wait while kubectl get pvc shows static-volume-1 in state Pending
popd
```

5. Now let's install all the necessary FfDL components using helm install.


``` shell ``` shell
helm install . helm install . --set lcm.shared_volume_storage_class=$SHARED_VOLUME_STORAGE_CLASS
``` ```
> Note: If you want to upgrade an older version of FfDL, run > Note: If you want to upgrade an older version of FfDL, run
> `helm upgrade $(helm list | grep ffdl | awk '{print $1}' | head -n 1) .` > `helm upgrade $(helm list | grep ffdl | awk '{print $1}' | head -n 1) .`
Expand All @@ -153,32 +185,44 @@ helm status $(helm list | grep ffdl | awk '{print $1}' | head -n 1) | grep STATU
# STATUS: DEPLOYED # STATUS: DEPLOYED
``` ```


3. Run the following script to configure Grafana for monitoring FfDL using the logging information from prometheus. 6. Obtain the necessary port for Grafana, FfDL Web UI, local object storage, and FfDL restapi.
> Note: If you are using a IBM Cloud Cluster, make sure you are logged in with `bx login`. ```shell
grafana_port=$(kubectl get service grafana -o jsonpath='{.spec.ports[0].nodePort}')
ui_port=$(kubectl get service ffdl-ui -o jsonpath='{.spec.ports[0].nodePort}')
restapi_port=$(kubectl get service ffdl-restapi -o jsonpath='{.spec.ports[0].nodePort}')
s3_port=$(kubectl get service s3 -o jsonpath='{.spec.ports[0].nodePort}')
```

* For Kubeadm-DIND Cluster, we need to run the below script to forward the port to the localhost machine since we don't want to exec into the docker image and install various dependencies.
```shell
./bin/dind-port-forward.sh
```


7. Run the following commands to configure Grafana for monitoring FfDL using the logging information from prometheus.
* 7.a. For Kubeadm-DIND Cluster only
```shell ```shell
# If your Cluster is running on Minikube, replace "ibmcloud" to "minikube" export VM_TYPE=none
# If your Cluster is not running on Minikube or IBM Cloud, replace "ibmcloud" to "none" export PUBLIC_IP=localhost
export VM_TYPE=ibmcloud
./bin/grafana.init.sh
```


# Replace <Your Cluster Name> with your IBM Cloud Cluster Name if your cluster is on IBM Cloud.
# Use export PUBLIC_IP if you are using a none VM_TYPE. A Cluster Public IP that can access your Cluster's NodePorts. * 7.b. For Cloud Kubernetes Cluster.
export CLUSTER_NAME=<Your Cluster Name> > Note: If you are using IBM Cloud Cluster, you can obtain your k8s public ip using `bx cs workers <cluster-name>`.

``` shell
export VM_TYPE=none
export PUBLIC_IP=<Cluster Public IP> export PUBLIC_IP=<Cluster Public IP>


./bin/grafana.init.sh ./bin/grafana.init.sh
``` ```


4. Lastly, run the following commands to obtain your Grafana, FfDL Web UI, and FfDL restapi endpoints. 8. Lastly, run the following commands to obtain your Grafana, FfDL Web UI, and FfDL restapi endpoints.
``` shell ``` shell
# Note: $(make --no-print-directory kubernetes-ip) simply gets the Public IP for your cluster. # Note: $(make --no-print-directory kubernetes-ip) simply gets the Public IP for your cluster.
node_ip=$(make --no-print-directory kubernetes-ip) node_ip=$(make --no-print-directory kubernetes-ip)


# Obtain all the necessary NodePorts for Grafana, Web UI, and RestAPI.
grafana_port=$(kubectl get service grafana -o jsonpath='{.spec.ports[0].nodePort}')
ui_port=$(kubectl get service ffdl-ui -o jsonpath='{.spec.ports[0].nodePort}')
restapi_port=$(kubectl get service ffdl-restapi -o jsonpath='{.spec.ports[0].nodePort}')

# Echo statements to print out Grafana and Web UI URLs. # Echo statements to print out Grafana and Web UI URLs.
echo "Monitoring dashboard: http://$node_ip:$grafana_port/ (login: admin/admin)" echo "Monitoring dashboard: http://$node_ip:$grafana_port/ (login: admin/admin)"
echo "Web UI: http://$node_ip:$ui_port/#/login?endpoint=$node_ip:$restapi_port&username=test-user" echo "Web UI: http://$node_ip:$ui_port/#/login?endpoint=$node_ip:$restapi_port&username=test-user"
Expand All @@ -191,8 +235,6 @@ Congratulation, FfDL is now running on your Cluster. Now you can go to [Step 6](
In this example, we will run some simple jobs to train a convolutional network model using TensorFlow and Caffe. We will download a set of In this example, we will run some simple jobs to train a convolutional network model using TensorFlow and Caffe. We will download a set of
MNIST handwritten digit images, store them with Object Storage, and use the FfDL CLI to train a handwritten digit classification model. MNIST handwritten digit images, store them with Object Storage, and use the FfDL CLI to train a handwritten digit classification model.


> Note: For Minikube, make sure you have the latest TensorFlow Docker image by running `docker pull tensorflow/tensorflow`

### 6.1. Using FfDL Local S3 Based Object Storage ### 6.1. Using FfDL Local S3 Based Object Storage


1. Run the following commands to obtain the object storage endpoint from your cluster. 1. Run the following commands to obtain the object storage endpoint from your cluster.
Expand Down Expand Up @@ -232,8 +274,19 @@ binary).
```shell ```shell
restapi_port=$(kubectl get service ffdl-restapi -o jsonpath='{.spec.ports[0].nodePort}') restapi_port=$(kubectl get service ffdl-restapi -o jsonpath='{.spec.ports[0].nodePort}')
export DLAAS_URL=http://$node_ip:$restapi_port; export DLAAS_USERNAME=test-user; export DLAAS_PASSWORD=test; export DLAAS_URL=http://$node_ip:$restapi_port; export DLAAS_USERNAME=test-user; export DLAAS_PASSWORD=test;
```


# Obtain the correct CLI for your machine and run the training job with our default TensorFlow model Replace the default object storage path with your s3_url. You can skip this step if your already modified the object storage path with your s3_url.
```shell
if [ "$(uname)" = "Darwin" ]; then
sed -i '' s#"http://s3.default.svc.cluster.local"#"$s3_url"# etc/examples/tf-model/manifest.yml
else
sed -i s#"http://s3.default.svc.cluster.local"#"$s3_url"# etc/examples/tf-model/manifest.yml
fi
```

Define the FfDL command line interface and run the training job with our default TensorFlow model
```shell
CLI_CMD=$(pwd)/cli/bin/ffdl-$(if [ "$(uname)" = "Darwin" ]; then echo 'osx'; else echo 'linux'; fi) CLI_CMD=$(pwd)/cli/bin/ffdl-$(if [ "$(uname)" = "Darwin" ]; then echo 'osx'; else echo 'linux'; fi)
$CLI_CMD train etc/examples/tf-model/manifest.yml etc/examples/tf-model $CLI_CMD train etc/examples/tf-model/manifest.yml etc/examples/tf-model
``` ```
Expand Down Expand Up @@ -284,8 +337,6 @@ you can simply run `$CLI_CMD logs <MODEL_ID>`


In this section we will demonstrate how to run a TensorFlow job with training data stored in Cloud Object Storage. In this section we will demonstrate how to run a TensorFlow job with training data stored in Cloud Object Storage.


> Note: This also can be done with other Cloud providers' Object Storage, but we will demonstrate how to use IBM Cloud Object Storage in this instructions.

1. Provision an S3 based Object Storage from your Cloud provider. Take note of your Authentication Endpoints, Access Key ID and Secret. 1. Provision an S3 based Object Storage from your Cloud provider. Take note of your Authentication Endpoints, Access Key ID and Secret.


> For IBM Cloud, you can provision an Object Storage from [IBM Cloud Dashboard](https://console.bluemix.net/catalog/infrastructure/cloud-object-storage?taxonomyNavigation=apps) or from [SoftLayer Portal](https://control.softlayer.com/storage/objectstorage). > For IBM Cloud, you can provision an Object Storage from [IBM Cloud Dashboard](https://console.bluemix.net/catalog/infrastructure/cloud-object-storage?taxonomyNavigation=apps) or from [SoftLayer Portal](https://control.softlayer.com/storage/objectstorage).
Expand Down Expand Up @@ -351,16 +402,27 @@ $CLI_CMD train etc/examples/tf-model/manifest.yml etc/examples/tf-model
``` ```


## 7. Clean Up ## 7. Clean Up
If you want to remove FfDL from your cluster, simply use the command below or run `helm delete <your FfDL release name>` If you want to remove FfDL from your cluster, simply use the following commands.
```shell ```shell
helm delete $(helm list | grep ffdl | awk '{print $1}' | head -n 1) helm delete $(helm list | grep ffdl | awk '{print $1}' | head -n 1)
``` ```


If you want to remove the storage driver and pvc from your cluster, run:
```shell
kubectl delete pvc static-volume-1
helm delete $(helm list | grep ibmcloud-object-storage-plugin | awk '{print $1}' | head -n 1)
```

For Kubeadm-DIND, you need to kill your forwarded ports. Note that the below command will kill all the ports that are created with `kubectl`.
```shell
kill $(lsof -i | grep kubectl | awk '{printf $2 " " }')
```

## 8. Troubleshooting ## 8. Troubleshooting


* FfDL has only been tested under Mac OS and Linux * FfDL has only been tested under Mac OS and Linux


* The default Minikube driver under Mac OS is VirtualBox, which is known for having issues with networking. <!-- * The default Minikube driver under Mac OS is VirtualBox, which is known for having issues with networking.
We generally recommend Mac OS users to install Minikube using the xhyve driver. We generally recommend Mac OS users to install Minikube using the xhyve driver.


* Also, when testing locally with Minikube, make sure to point the `docker` CLI to Minikube's Docker daemon: * Also, when testing locally with Minikube, make sure to point the `docker` CLI to Minikube's Docker daemon:
Expand All @@ -369,7 +431,7 @@ helm delete $(helm list | grep ffdl | awk '{print $1}' | head -n 1)
eval $(minikube docker-env) eval $(minikube docker-env)
``` ```
* If you run into DNS name resolution issues using Minikube, make sure that the system uses only `10.0.0.10` * If you run into DNS name resolution issues using Minikube, make sure that the system uses only `10.0.0.10`
as the single nameserver. Using multiple nameservers can result in problems, in particular under Mac OS. as the single nameserver. Using multiple nameservers can result in problems, in particular under Mac OS. -->


* If `glide install` fails with an error complaining about non-existing paths (e.g., "Without src, cannot continue"), * If `glide install` fails with an error complaining about non-existing paths (e.g., "Without src, cannot continue"),
make sure to follow the standard Go directory layout (see [Prerequisites section]{#Prerequisites}). make sure to follow the standard Go directory layout (see [Prerequisites section]{#Prerequisites}).
Expand All @@ -378,6 +440,8 @@ helm delete $(helm list | grep ffdl | awk '{print $1}' | head -n 1)


* When using the FfDL CLI to train a model, make sure your directory path doesn't have slashes `/` at the end. * When using the FfDL CLI to train a model, make sure your directory path doesn't have slashes `/` at the end.


* If your job is stuck in pending stage, you can try to redeploy the plugin with `helm install storage-plugin --set dind=true,cloud=false` for Kubeadm-DIND and `helm install storage-plugin` for general Kubernetes Cluster. Also, double check your training job manifest file to make sure you have the correct object storage credentials.

## 9. References ## 9. References


Based on IBM Research work in Deep Learning. Based on IBM Research work in Deep Learning.
Expand Down
17 changes: 17 additions & 0 deletions bin/addnew_march_15_2018.sh
Original file line number Original file line Diff line number Diff line change
@@ -0,0 +1,17 @@
#!/usr/bin/env bash

git add lcm/lcmconfig/lcmconfig_test.go
git add lcm/service/lcm/helper/
git add lcm/service/lcm/learner/
git add lcm/service/lcm/nonSplitTraining.go
git add lcm/service/lcm/policies/
git add lcm/service/lcm/splitTraining.go
git add metrics/log_collectors/tensorboard/Dockerfile
git add metrics/log_collectors/training_data_service_client/match_log_file.py
git add metrics/log_collectors/training_data_service_client/scan_log_dirs.py
git add metrics/log_collectors/training_data_service_client/states.py
git add metrics/log_collectors/training_data_service_client/training_data_buffered.py
git add trainer/glide.yaml
git add trainer/trainer/mongo.go
git add trainer/trainer/queue.go
git add trainer/trainer/queue_test.go