-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pre-0.1 release #79
Merged
Merged
Pre-0.1 release #79
Changes from all commits
Commits
Show all changes
147 commits
Select commit
Hold shift + click to select a range
d6fd7a8
Merge from upstream
fplk 9540160
Manual modifications before test.
fplk 8ad82d2
Updated test-submit. Now need to specify AWS environment variables fo…
fplk dd5a31b
added a test comment
74f4baa
Merge pull request #2 from sboagibm/my_flpk_test_2
fplk cf15bcd
Fixes for bad build in travis
2f04453
Merge pull request #3 from sboagibm/flpk_travis_fix_1
fplk cb47092
Removed Scott's test comment and temporarily disabled python-distutil…
f8b0d31
Added debug statements.
fplk 5ab5f9c
Another Travis debug commit.
fplk f61d70b
Another Travis debug commit.
fplk 003ba8c
Another Travis debug commit.
fplk a5cd4db
Extended test-submit.
fplk 2f5e7ce
Pointed travis testing to do hostmount minikube
3ab475e
Merge pull request #4 from sboagibm/flpk_travis_fix_2
fplk 5461cd3
Debugging permissions error.
fplk ff8f58c
Fix to mkdir problems.
fplk 20f34b3
Fixed Makefile syntax.
fplk b658805
Added Minikube CI make target to travis.yml
fplk a72d19a
Printing debugging information about pods.
fplk fba5126
Printing debugging information about pods.
fplk cd4163d
Printing debugging information about pods.
fplk fe67660
Printing debugging information incl kubectl get pod.
fplk 586d590
Dumping tons of debug including statefulsets.
fplk c431a52
Added more debug statements.
fplk b87c19f
Fixed debug code.
fplk a50de4a
Enabled debug mode.
fplk 8817b4a
Again.
fplk 4d0e02a
Another one.
fplk 9652b29
Set debug as default.
fplk fdd8c16
Added debug notification.
fplk 8b7b85e
Added trainer logging.
fplk c8bf2de
tracing from the trainer to lcm
2b5360a
Merge pull request #5 from sboagibm/fplk_add_lcm_trace
fplk f6ff626
Logging result.
fplk 7b00125
tweaked coord debugging
9da2fee
more debugging
e8a656a
Merge pull request #6 from sboagibm/fplk_add_lcm_dump_kube_objects
fplk c9efcd3
fixed bad continuation line in makefile
ea94aec
Merge pull request #7 from sboagibm/fplk_try_to_fix_build_error
fplk 6263f60
hard-coded disable push metrics
b193515
Merge pull request #8 from sboagibm/disable_metrics
fplk 6a1dbd6
added lower level logging
850d650
dist: xenial
6cceb2d
Merge pull request #9 from sboagibm/fplk_low_level_logging
fplk c2f0a11
Update .travis.yml
fplk 98c1ff3
fix typo
90af5c3
Merge pull request #10 from sboagibm/fix_yet_another_build_problem
fplk ef5092b
try dumb stuff to get logging working
8144e7a
Merge pull request #11 from sboagibm/try_to_get_lcm_logging_working
fplk d99cec3
do lcm describe
483a425
Merge pull request #12 from sboagibm/try_lcm_describe
fplk 112b3c0
add a -p to lcm logs
142b54e
Merge pull request #13 from sboagibm/try_previous_lcm_logs
fplk f1f0f68
set shared_volume_storage_class etc
57c9b6c
Merge pull request #14 from sboagibm/set_shared_volume_storage_class_etc
fplk c0e6b4b
Trying to fix resource constraint.
fplk cb6df62
monkeying with cpu allocation
72139c6
Merge pull request #15 from sboagibm/lower_cpu
fplk 1eb37f4
Added DIND-specific setup scripts.
fplk f0fc73f
Improved DIND scripts.
fplk 3d7c9f2
Fix in DIND scripts.
fplk fe711c2
Added compile script for s3fs, since open source S3 driver needs manu…
fplk 3093f0d
Minor DIND script update
fplk e80dfb2
Final DIND commit.
fplk 3fb6509
Removed information about proprietary S3 driver, reverted tf-summary-…
fplk 41c601d
Merge branch 'master' into merge_20180514_1536
fplk 8e53d72
Added concrete example to get from fresh SL node to running a FfDL job.
fplk c2beaa7
Merge branch 'merge_20180514_1536' of https://github.com/fplk/FfDL in…
fplk eb400d3
Trying to fix Travis issue.
fplk 8d4a6ee
Fixed Travis issue.
fplk 61ad832
Removed hardcoded user name.
fplk a21146b
Moved S3 driver compilation out of s3_driver.sh, so it can be used fo…
fplk ddc403a
Followed Tommy's request and increased resource limits to values from…
fplk ba6f57f
Parameterized memory values like Tommy requested.
fplk 4dc6c82
Removed remaining placeholder values for parameterization.
fplk b2b9524
Changed lcm-deployment.yml indentation in attempt to fix CI with para…
fplk 5b1cf1b
Attempt to fix CI.
fplk 59f9fdf
Removed excessive debug statements and cleaned comments. Probably bre…
fplk b5ba13a
Removed unused declaration
fplk ec540ab
Migrated S3FS compilation into Docker environment for DIND setup.
fplk e3c3c33
Added Keras example.
fplk da0762b
Merge branch 'master' into merge_20180514_1536
fplk 4b52602
Added changed requested by Tommy.
fplk 49b5862
Merge branch 'master' into merge_20180514_1536
fplk 50666f5
Added script for IBM Cloud deployment with open source S3 driver.
fplk 2881b84
Tried to add missing copyright headers in semi-automated fashion. Wil…
fplk 0f8393b
Merge branch 'master' into merge_20180514_1536
animeshsingh cebf747
update draft instructions
Tomcli a81b451
update draft instructions
Tomcli f4ae06e
update helm chart
Tomcli f0db780
fix script path
Tomcli 413f847
update instructions
Tomcli d9de742
update instructions
Tomcli fd9682e
update instructions
Tomcli d8fa8d2
update instructions
Tomcli 28b86c2
update instructions
Tomcli e1e9bf6
update helm chart
Tomcli 613310e
update instructions
Tomcli 794ef85
update instructions
Tomcli bb950c8
fix script
Tomcli 76a9c89
update instructions
Tomcli 2877755
update instructions
Tomcli 41eb377
update instructions
Tomcli d1a6f9a
update instructions
Tomcli 083a3a6
Merge branch 'master' into merge_20180514_1536
fplk 09608c4
update makefile
Tomcli c3838e1
update makefile
Tomcli 601b8f4
update makefile
Tomcli 4003d7f
update makefile
Tomcli 52cc7cb
Merge branch 'merge_20180514_1536' into merge_20180514_1536
Tomcli f6a9174
Merge pull request #19 from Tomcli/merge_20180514_1536
fplk d83e1f8
Merge branch 'master' into merge_20180514_1536
animeshsingh b2601c5
fix makefile
Tomcli cfe6450
update makefile
Tomcli 8111c31
Merge pull request #20 from Tomcli/merge_20180514_1536
fplk 669af51
replace instructions with script
Tomcli 823b0d3
update cleanup steps
Tomcli 19b4b09
update readme
Tomcli ae57494
commented out minikube references
Tomcli 68e0bd0
add cleanup command
Tomcli 0019df6
add cleanup command
Tomcli 73434d8
add cleanup command
Tomcli cbc8939
update troubleshooting and makefile
Tomcli ae4fa05
update troubleshooting;
Tomcli 8233d3c
update developer guide
Tomcli f5b5ab5
update developer guide
Tomcli ff3b6c5
Merge pull request #21 from Tomcli/merge_20180514_1536
fplk 3949214
Removed quick start guide that is obsolete with usability improvements.
fplk 31e0194
Extended developer guide a little (fixed typos, added comments regard…
fplk 84109df
Cleaned up bin subdirectory.
fplk c44d224
fix docker push with docker.io
Tomcli a262d9d
fix docker push with docker.io
Tomcli a3b2c88
Merge pull request #22 from Tomcli/merge_20180514_1536
fplk d5f0780
update h2o-3 entrypoint script to be compatible with FfDL v0.1
Tomcli 3f9c445
update minimun requirement
Tomcli 2e9d5e5
fix minor bug
Tomcli d9597f8
fix bugs for christian comments
Tomcli 53b96ff
fix minor bug
Tomcli f3f63b4
Merge pull request #23 from Tomcli/merge_20180514_1536
fplk 50491b7
Quote docker username and passowrd
ckadner 5cf1800
Extended developer guide, added troubleshooting document, give hint a…
fplk d9b43e5
Merge pull request #24 from ckadner/patch-1
fplk 368f74b
Switched user repo in command to official one.
fplk 8e6c2bf
Minor syntax changes.
fplk 20b08b0
convert all examples to use mount_cos
Tomcli 562ea5f
Merge pull request #25 from Tomcli/merge_20180514_1536
fplk File filter
Filter by extension
Conversations
Failed to load comments.
Jump to
The table of contents is too big for display.
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Original file line | Diff line number | Diff line change |
---|---|---|---|
|
@@ -21,3 +21,5 @@ demos/ | ||
samples/ | samples/ | ||
cli/ | cli/ | ||
community/ | community/ | ||
persistentvol/ | |||
design/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Whitespace-only changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Original file line | Diff line number | Diff line change |
---|---|---|---|
|
@@ -14,7 +14,7 @@ FfDL is a collaboration platform for: | ||
|
|
||
![ffdl-architecture](docs/images/ffdl-architecture.png) | ![ffdl-architecture](docs/images/ffdl-architecture.png) | ||
|
|
||
To know more about the architectural details, please read [here](design/design_docs.md) | To know more about the architectural details, please read the [design document](design/design_docs.md). | ||
|
|
||
## Prerequisites | ## Prerequisites | ||
|
|
||
|
@@ -26,12 +26,12 @@ To know more about the architectural details, please read [here](design/design_d | ||
|
|
||
* `S3 CLI`: The [command-line interface](https://aws.amazon.com/cli/) to configure your Object Storage | * `S3 CLI`: The [command-line interface](https://aws.amazon.com/cli/) to configure your Object Storage | ||
|
|
||
* An existing Kubernetes cluster (e.g., [Minikube](https://github.com/kubernetes/minikube) for local testing). | * An existing Kubernetes cluster (e.g., [Kubeadm-DIND](https://github.com/kubernetes-sigs/kubeadm-dind-cluster#using-preconfigured-scripts) for local testing). | ||
For Minikube, use the command `make minikube` to start Minikube and set up local network routes. Minikube **v0.25.1** is tested with Travis CI. | <!-- For Minikube, use the command `make minikube` to start Minikube and set up local network routes. Minikube **v0.25.1** is tested with Travis CI. --> | ||
|
|
||
* Follow the appropriate instructions for standing up your Kubernetes cluster using [IBM Cloud Public](https://github.com/IBM/container-journey-template/blob/master/README.md) or [IBM Cloud Private](https://github.com/IBM/deploy-ibm-cloud-private/blob/master/README.md) | * Follow the appropriate instructions for standing up your Kubernetes cluster using [IBM Cloud Public](https://github.com/IBM/container-journey-template/blob/master/README.md) or [IBM Cloud Private](https://github.com/IBM/deploy-ibm-cloud-private/blob/master/README.md) | ||
|
|
||
* The minimum recommended capacity for FfDL is 4GB Memory and 3 CPUs. | * The minimum capacity requirement for FfDL is 4GB Memory and 3 CPUs. | ||
|
|
||
## Usage Scenarios | ## Usage Scenarios | ||
|
|
||
|
@@ -46,9 +46,8 @@ To know more about the architectural details, please read [here](design/design_d | ||
## Steps | ## Steps | ||
|
|
||
1. [Quick Start](#1-quick-start) | 1. [Quick Start](#1-quick-start) | ||
- 1.1 [Installation using Minikube](#11-installation-using-minikube) | - 1.1 [Installation using Kubeadm-DIND](#11-installation-using-kubeadm-dind) | ||
- 1.2 [Installation using Kubernetes Cluster](#12-installation-using-kubernetes-cluster) | - 1.2 [Installation using Kubernetes Cluster](#12-installation-using-kubernetes-cluster) | ||
- 1.3 [Installation using IBM Cloud Kubernetes Cluster](#13-installation-using-ibm-cloud-kubernetes-cluster) | |||
2. [Test](#2-test) | 2. [Test](#2-test) | ||
3. [Monitoring](#3-monitoring) | 3. [Monitoring](#3-monitoring) | ||
4. [Development](#4-development) | 4. [Development](#4-development) | ||
|
@@ -62,47 +61,42 @@ To know more about the architectural details, please read [here](design/design_d | ||
|
|
||
## 1. Quick Start | ## 1. Quick Start | ||
|
|
||
There are multiple installation paths for installing FfDL locally ("1-click-install") or into an existing Kubernetes cluster. | There are multiple installation paths for installing FfDL locally ("1-click-install") or into an existing Kubernetes cluster. You can visit [Step 5](#5-detailed-installation-instructions) for more details on the deployment instructions. | ||
|
|
||
> Note: If your Kubernetes Cluster version is 1.7 or below, please go to the [values.yaml](values.yaml) and change `k8s_1dot8_or_above` to **false**. | ### 1.1 Installation using Kubeadm-DIND | ||
|
|
||
### 1.1 Installation using Minikube | If you have [Kubeadm-DIND](https://github.com/kubernetes-sigs/kubeadm-dind-cluster#using-preconfigured-scripts) installed on your machine, use these commands to deploy the FfDL platform: | ||
|
|||
If you have Minikube installed on your machine, use these commands to deploy the FfDL platform: | |||
``` shell | ``` shell | ||
export VM_TYPE=minikube | export VM_TYPE=dind | ||
make minikube | export PUBLIC_IP=localhost | ||
make deploy | export SHARED_VOLUME_STORAGE_CLASS=""; | ||
make deploy-plugin | |||
make quickstart-deploy | |||
``` | ``` | ||
|
|
||
### 1.2 Installation using Kubernetes Cluster | ### 1.2 Installation using Kubernetes Cluster | ||
|
|
||
To install FfDL to a proper Kubernetes cluster, make sure `kubectl` points to the right namespace, | To install FfDL to any proper Kubernetes cluster, make sure `kubectl` points to the right namespace, | ||
then deploy the platform services: | then deploy the platform services: | ||
> Note: For PUBLIC_IP, put down one of your Cluster Public IP that can access your Cluster's NodePorts. | > Note: For PUBLIC_IP, put down one of your Cluster Public IP that can access your Cluster's NodePorts. For IBM Cloud, you can get your Public IP with `bx cs workers <cluster_name>`. | ||
|
|
||
``` shell | ``` shell | ||
export VM_TYPE=none | export VM_TYPE=none | ||
export PUBLIC_IP=<Cluster Public IP> | export PUBLIC_IP=<Cluster Public IP> | ||
make deploy | |||
``` | |||
|
|
||
### 1.3 Installation using IBM Cloud Kubernetes Cluster | # Change the storage class to what's available on your Cloud Kubernetes Cluster. | ||
|
export SHARED_VOLUME_STORAGE_CLASS="ibmc-file-gold"; | ||
To install FfDL to a proper IBM Cloud Kubernetes cluster, make sure `kubectl` points to the right namespace | make deploy-plugin | ||
and your machine is logged in with `bx login`, then deploy the platform services: | make quickstart-deploy | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm using the Mac OS Terminal App (Bash 3.2) and I get this error message:
although seemingly without negative consequences. The
|
|||
``` shell | |||
export VM_TYPE=ibmcloud | |||
export CLUSTER_NAME=<Your Cluster Name> # Replace <Your Cluster Name> with your IBM Cloud Cluster Name | |||
make deploy | |||
``` | ``` | ||
|
|
||
## 2. Test | ## 2. Test | ||
|
|
||
To submit a simple example training job that is included in this repo (see `etc/examples` folder): | To submit a simple example training job that is included in this repo (see `etc/examples` folder): | ||
|
|
||
``` | ``` shell | ||
make test-submit | make test-push-data-s3 | ||
make test-job-submit | |||
``` | ``` | ||
|
|
||
## 3. Monitoring | ## 3. Monitoring | ||
|
@@ -115,6 +109,9 @@ Please refer to the [developer guide](docs/developer-guide.md) for more details. | ||
|
|
||
## 5. Detailed Installation Instructions | ## 5. Detailed Installation Instructions | ||
|
|
||
0. If you don't have a Kubernetes Cluster, you can create a [Kubeadm-DIND](https://github.com/kubernetes-sigs/kubeadm-dind-cluster#using-preconfigured-scripts) Kubernetes Cluster on your local machine. We recommend you give at least 4 CPUs and 8GB of memory to your Docker. | |||
> For Mac users, visit the instructions on the [Docker website](https://docs.docker.com/docker-for-mac/#advanced) and learn how to give more memory to your Docker. | |||
|
|||
1. First, clone this repository and install the helm tiller on your Kubernetes cluster. | 1. First, clone this repository and install the helm tiller on your Kubernetes cluster. | ||
``` shell | ``` shell | ||
helm init | helm init | ||
|
@@ -124,11 +121,46 @@ kubectl get pods --all-namespaces | grep tiller-deploy | ||
# kube-system tiller-deploy-fb8d7b69c-pcvc2 1/1 Running | # kube-system tiller-deploy-fb8d7b69c-pcvc2 1/1 Running | ||
``` | ``` | ||
|
|
||
2. Now let's install all the necessary FfDL components using helm install. | 2. Define the necessary environment variables. | ||
> Note: If your Kubernetes Cluster version is 1.7 or below, please go to the [values.yaml](values.yaml) and change `k8s_1dot8_or_above` to **false**. | * 2.a. For Kubeadm-DIND Cluster only | ||
```shell | |||
export FFDL_PATH=$(pwd) | |||
export SHARED_VOLUME_STORAGE_CLASS="" | |||
``` | |||
|
|||
* 2.b. For Cloud Kubernetes Cluster | |||
```shell | |||
# Change the storage class to what's available on your Cloud Kubernetes Cluster. | |||
export SHARED_VOLUME_STORAGE_CLASS="ibmc-file-gold" | |||
``` | |||
|
|||
3. Install the Object Storage driver using helm install. | |||
* 3.a. For Kubeadm-DIND Cluster only | |||
```shell | |||
export FFDL_PATH=$(pwd) | |||
./bin/s3_driver.sh | |||
helm install storage-plugin --set dind=true,cloud=false | |||
``` | |||
|
|||
* 3.b. For Cloud Kubernetes Cluster | |||
```shell | |||
helm install storage-plugin | |||
``` | |||
|
|||
4. Create a static volume to store any metadata from FfDL. | |||
|
|||
```shell | |||
pushd bin | |||
./create_static_volumes.sh | |||
./create_static_volumes_config.sh | |||
# Wait while kubectl get pvc shows static-volume-1 in state Pending | |||
popd | |||
``` | |||
|
|||
5. Now let's install all the necessary FfDL components using helm install. | |||
|
|
||
``` shell | ``` shell | ||
helm install . | helm install . --set lcm.shared_volume_storage_class=$SHARED_VOLUME_STORAGE_CLASS | ||
``` | ``` | ||
> Note: If you want to upgrade an older version of FfDL, run | > Note: If you want to upgrade an older version of FfDL, run | ||
> `helm upgrade $(helm list | grep ffdl | awk '{print $1}' | head -n 1) .` | > `helm upgrade $(helm list | grep ffdl | awk '{print $1}' | head -n 1) .` | ||
|
@@ -153,32 +185,44 @@ helm status $(helm list | grep ffdl | awk '{print $1}' | head -n 1) | grep STATU | ||
# STATUS: DEPLOYED | # STATUS: DEPLOYED | ||
``` | ``` | ||
|
|
||
3. Run the following script to configure Grafana for monitoring FfDL using the logging information from prometheus. | 6. Obtain the necessary port for Grafana, FfDL Web UI, local object storage, and FfDL restapi. | ||
> Note: If you are using a IBM Cloud Cluster, make sure you are logged in with `bx login`. | ```shell | ||
grafana_port=$(kubectl get service grafana -o jsonpath='{.spec.ports[0].nodePort}') | |||
ui_port=$(kubectl get service ffdl-ui -o jsonpath='{.spec.ports[0].nodePort}') | |||
restapi_port=$(kubectl get service ffdl-restapi -o jsonpath='{.spec.ports[0].nodePort}') | |||
s3_port=$(kubectl get service s3 -o jsonpath='{.spec.ports[0].nodePort}') | |||
``` | |||
|
|||
* For Kubeadm-DIND Cluster, we need to run the below script to forward the port to the localhost machine since we don't want to exec into the docker image and install various dependencies. | |||
```shell | |||
./bin/dind-port-forward.sh | |||
``` | |||
|
|
||
7. Run the following commands to configure Grafana for monitoring FfDL using the logging information from prometheus. | |||
* 7.a. For Kubeadm-DIND Cluster only | |||
```shell | ```shell | ||
# If your Cluster is running on Minikube, replace "ibmcloud" to "minikube" | export VM_TYPE=none | ||
# If your Cluster is not running on Minikube or IBM Cloud, replace "ibmcloud" to "none" | export PUBLIC_IP=localhost | ||
export VM_TYPE=ibmcloud |
|
||
./bin/grafana.init.sh | |||
``` | |||
|
|
||
# Replace <Your Cluster Name> with your IBM Cloud Cluster Name if your cluster is on IBM Cloud. |
|
||
# Use export PUBLIC_IP if you are using a none VM_TYPE. A Cluster Public IP that can access your Cluster's NodePorts. | * 7.b. For Cloud Kubernetes Cluster. | ||
export CLUSTER_NAME=<Your Cluster Name> | > Note: If you are using IBM Cloud Cluster, you can obtain your k8s public ip using `bx cs workers <cluster-name>`. | ||
|
|||
``` shell | |||
export VM_TYPE=none | |||
export PUBLIC_IP=<Cluster Public IP> | export PUBLIC_IP=<Cluster Public IP> | ||
|
|
||
./bin/grafana.init.sh | ./bin/grafana.init.sh | ||
``` | ``` | ||
|
|
||
4. Lastly, run the following commands to obtain your Grafana, FfDL Web UI, and FfDL restapi endpoints. | 8. Lastly, run the following commands to obtain your Grafana, FfDL Web UI, and FfDL restapi endpoints. | ||
``` shell | ``` shell | ||
# Note: $(make --no-print-directory kubernetes-ip) simply gets the Public IP for your cluster. | # Note: $(make --no-print-directory kubernetes-ip) simply gets the Public IP for your cluster. | ||
node_ip=$(make --no-print-directory kubernetes-ip) | node_ip=$(make --no-print-directory kubernetes-ip) | ||
|
|
||
# Obtain all the necessary NodePorts for Grafana, Web UI, and RestAPI. | |||
grafana_port=$(kubectl get service grafana -o jsonpath='{.spec.ports[0].nodePort}') | |||
ui_port=$(kubectl get service ffdl-ui -o jsonpath='{.spec.ports[0].nodePort}') | |||
restapi_port=$(kubectl get service ffdl-restapi -o jsonpath='{.spec.ports[0].nodePort}') | |||
|
|||
# Echo statements to print out Grafana and Web UI URLs. | # Echo statements to print out Grafana and Web UI URLs. | ||
echo "Monitoring dashboard: http://$node_ip:$grafana_port/ (login: admin/admin)" | echo "Monitoring dashboard: http://$node_ip:$grafana_port/ (login: admin/admin)" | ||
echo "Web UI: http://$node_ip:$ui_port/#/login?endpoint=$node_ip:$restapi_port&username=test-user" | echo "Web UI: http://$node_ip:$ui_port/#/login?endpoint=$node_ip:$restapi_port&username=test-user" | ||
|
@@ -191,8 +235,6 @@ Congratulation, FfDL is now running on your Cluster. Now you can go to [Step 6]( | ||
In this example, we will run some simple jobs to train a convolutional network model using TensorFlow and Caffe. We will download a set of | In this example, we will run some simple jobs to train a convolutional network model using TensorFlow and Caffe. We will download a set of | ||
MNIST handwritten digit images, store them with Object Storage, and use the FfDL CLI to train a handwritten digit classification model. | MNIST handwritten digit images, store them with Object Storage, and use the FfDL CLI to train a handwritten digit classification model. | ||
|
|
||
> Note: For Minikube, make sure you have the latest TensorFlow Docker image by running `docker pull tensorflow/tensorflow` | |||
|
|||
### 6.1. Using FfDL Local S3 Based Object Storage | ### 6.1. Using FfDL Local S3 Based Object Storage | ||
|
|
||
1. Run the following commands to obtain the object storage endpoint from your cluster. | 1. Run the following commands to obtain the object storage endpoint from your cluster. | ||
|
@@ -232,8 +274,19 @@ binary). | ||
```shell | ```shell | ||
restapi_port=$(kubectl get service ffdl-restapi -o jsonpath='{.spec.ports[0].nodePort}') | restapi_port=$(kubectl get service ffdl-restapi -o jsonpath='{.spec.ports[0].nodePort}') | ||
export DLAAS_URL=http://$node_ip:$restapi_port; export DLAAS_USERNAME=test-user; export DLAAS_PASSWORD=test; | export DLAAS_URL=http://$node_ip:$restapi_port; export DLAAS_USERNAME=test-user; export DLAAS_PASSWORD=test; | ||
``` | |||
|
|
||
# Obtain the correct CLI for your machine and run the training job with our default TensorFlow model | Replace the default object storage path with your s3_url. You can skip this step if your already modified the object storage path with your s3_url. | ||
```shell | |||
if [ "$(uname)" = "Darwin" ]; then | |||
sed -i '' s#"http://s3.default.svc.cluster.local"#"$s3_url"# etc/examples/tf-model/manifest.yml | |||
else | |||
sed -i s#"http://s3.default.svc.cluster.local"#"$s3_url"# etc/examples/tf-model/manifest.yml | |||
fi | |||
``` | |||
|
|||
Define the FfDL command line interface and run the training job with our default TensorFlow model | |||
```shell | |||
CLI_CMD=$(pwd)/cli/bin/ffdl-$(if [ "$(uname)" = "Darwin" ]; then echo 'osx'; else echo 'linux'; fi) | CLI_CMD=$(pwd)/cli/bin/ffdl-$(if [ "$(uname)" = "Darwin" ]; then echo 'osx'; else echo 'linux'; fi) | ||
$CLI_CMD train etc/examples/tf-model/manifest.yml etc/examples/tf-model | $CLI_CMD train etc/examples/tf-model/manifest.yml etc/examples/tf-model | ||
``` | ``` | ||
|
@@ -284,8 +337,6 @@ you can simply run `$CLI_CMD logs <MODEL_ID>` | ||
|
|
||
In this section we will demonstrate how to run a TensorFlow job with training data stored in Cloud Object Storage. | In this section we will demonstrate how to run a TensorFlow job with training data stored in Cloud Object Storage. | ||
|
|
||
> Note: This also can be done with other Cloud providers' Object Storage, but we will demonstrate how to use IBM Cloud Object Storage in this instructions. | |||
|
|||
1. Provision an S3 based Object Storage from your Cloud provider. Take note of your Authentication Endpoints, Access Key ID and Secret. | 1. Provision an S3 based Object Storage from your Cloud provider. Take note of your Authentication Endpoints, Access Key ID and Secret. | ||
|
|
||
> For IBM Cloud, you can provision an Object Storage from [IBM Cloud Dashboard](https://console.bluemix.net/catalog/infrastructure/cloud-object-storage?taxonomyNavigation=apps) or from [SoftLayer Portal](https://control.softlayer.com/storage/objectstorage). | > For IBM Cloud, you can provision an Object Storage from [IBM Cloud Dashboard](https://console.bluemix.net/catalog/infrastructure/cloud-object-storage?taxonomyNavigation=apps) or from [SoftLayer Portal](https://control.softlayer.com/storage/objectstorage). | ||
|
@@ -351,16 +402,27 @@ $CLI_CMD train etc/examples/tf-model/manifest.yml etc/examples/tf-model | ||
``` | ``` | ||
|
|
||
## 7. Clean Up | ## 7. Clean Up | ||
If you want to remove FfDL from your cluster, simply use the command below or run `helm delete <your FfDL release name>` | If you want to remove FfDL from your cluster, simply use the following commands. | ||
```shell | ```shell | ||
helm delete $(helm list | grep ffdl | awk '{print $1}' | head -n 1) | helm delete $(helm list | grep ffdl | awk '{print $1}' | head -n 1) | ||
``` | ``` | ||
|
|
||
If you want to remove the storage driver and pvc from your cluster, run: | |||
```shell | |||
kubectl delete pvc static-volume-1 | |||
helm delete $(helm list | grep ibmcloud-object-storage-plugin | awk '{print $1}' | head -n 1) | |||
``` | |||
|
|||
For Kubeadm-DIND, you need to kill your forwarded ports. Note that the below command will kill all the ports that are created with `kubectl`. | |||
```shell | |||
kill $(lsof -i | grep kubectl | awk '{printf $2 " " }') | |||
``` | |||
|
|||
## 8. Troubleshooting | ## 8. Troubleshooting | ||
|
|
||
* FfDL has only been tested under Mac OS and Linux | * FfDL has only been tested under Mac OS and Linux | ||
|
|
||
* The default Minikube driver under Mac OS is VirtualBox, which is known for having issues with networking. | <!-- * The default Minikube driver under Mac OS is VirtualBox, which is known for having issues with networking. | ||
We generally recommend Mac OS users to install Minikube using the xhyve driver. | We generally recommend Mac OS users to install Minikube using the xhyve driver. | ||
|
|
||
* Also, when testing locally with Minikube, make sure to point the `docker` CLI to Minikube's Docker daemon: | * Also, when testing locally with Minikube, make sure to point the `docker` CLI to Minikube's Docker daemon: | ||
|
@@ -369,7 +431,7 @@ helm delete $(helm list | grep ffdl | awk '{print $1}' | head -n 1) | ||
eval $(minikube docker-env) | eval $(minikube docker-env) | ||
``` | ``` | ||
* If you run into DNS name resolution issues using Minikube, make sure that the system uses only `10.0.0.10` | * If you run into DNS name resolution issues using Minikube, make sure that the system uses only `10.0.0.10` | ||
as the single nameserver. Using multiple nameservers can result in problems, in particular under Mac OS. | as the single nameserver. Using multiple nameservers can result in problems, in particular under Mac OS. --> | ||
|
|
||
* If `glide install` fails with an error complaining about non-existing paths (e.g., "Without src, cannot continue"), | * If `glide install` fails with an error complaining about non-existing paths (e.g., "Without src, cannot continue"), | ||
make sure to follow the standard Go directory layout (see [Prerequisites section]{#Prerequisites}). | make sure to follow the standard Go directory layout (see [Prerequisites section]{#Prerequisites}). | ||
|
@@ -378,6 +440,8 @@ helm delete $(helm list | grep ffdl | awk '{print $1}' | head -n 1) | ||
|
|
||
* When using the FfDL CLI to train a model, make sure your directory path doesn't have slashes `/` at the end. | * When using the FfDL CLI to train a model, make sure your directory path doesn't have slashes `/` at the end. | ||
|
|
||
* If your job is stuck in pending stage, you can try to redeploy the plugin with `helm install storage-plugin --set dind=true,cloud=false` for Kubeadm-DIND and `helm install storage-plugin` for general Kubernetes Cluster. Also, double check your training job manifest file to make sure you have the correct object storage credentials. | |||
|
|||
## 9. References | ## 9. References | ||
|
|
||
Based on IBM Research work in Deep Learning. | Based on IBM Research work in Deep Learning. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Original file line | Diff line number | Diff line change |
---|---|---|---|
@@ -0,0 +1,17 @@ | |||
#!/usr/bin/env bash | |||
|
|||
git add lcm/lcmconfig/lcmconfig_test.go | |||
git add lcm/service/lcm/helper/ | |||
git add lcm/service/lcm/learner/ | |||
git add lcm/service/lcm/nonSplitTraining.go | |||
git add lcm/service/lcm/policies/ | |||
git add lcm/service/lcm/splitTraining.go | |||
git add metrics/log_collectors/tensorboard/Dockerfile | |||
git add metrics/log_collectors/training_data_service_client/match_log_file.py | |||
git add metrics/log_collectors/training_data_service_client/scan_log_dirs.py | |||
git add metrics/log_collectors/training_data_service_client/states.py | |||
git add metrics/log_collectors/training_data_service_client/training_data_buffered.py | |||
git add trainer/glide.yaml | |||
git add trainer/trainer/mongo.go | |||
git add trainer/trainer/queue.go | |||
git add trainer/trainer/queue_test.go |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we missing a
helm init
beforemake deploy-plugin
here? without that I getError: could not find a ready tiller pod
:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sorry we did add helm init, but we forgot to put a sleep loop to wait for tiller to be ready.