Update instructions with Cloud Object Storage by Tomcli · Pull Request #3 · IBM/FfDL

This repository was archived by the owner on Jan 29, 2026. It is now read-only.
README.md
            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -6,7 +6,7 @@ This repository contains the core services of the *FfDL* (Fabric for Deep Learni
  
    FfDL is a collaboration platform for:

    - Framework-independent training of Deep Learning models on distributed hardware

    - Open Deep Learning APIs  

    - Open Deep Learning APIs  

    - Common instrumentation

    - Inferencing in the cloud

    - Running Deep Learning hosting in user's private or public cloud

    @@ -36,6 +36,8 @@ FfDL is a collaboration platform for:
  
    4. [Development](#4-development)

    5. [Detailed Installation Instructions](#5-detailed-installation-instructions)

    6. [Detailed Testing Instructions](#6-detailed-testing-instructions)

      - 6.1 [Using FfDL Local S3 Based Object Storage](#61-using-ffdl-local-s3-based-object-storage)

      - 6.2 [Using Cloud Object Storage](#62-using-cloud-object-storage)

    7. [Clean Up](#7-clean-up)

    8. [Troubleshooting](#8-troubleshooting)

    9. [References](#9-references)

    @@ -197,18 +199,20 @@ Congratulation, FfDL is now running on your Cluster.
  
    ## 6. Detailed Testing Instructions

    In this example, we will run some simple jobs to train a convolutional network model using TensorFlow and Caffe. We will download a set of

    MNIST handwritten digit images, store them in the S3 storage pod, and use the FfDL CLI to train a handwritten digit classification model.

    MNIST handwritten digit images, store them with Object Storage, and use the FfDL CLI to train a handwritten digit classification model.

    > Note: For Minikube, make sure you have the latest TensorFlow Docker image by running `docker pull tensorflow/tensorflow`

    1. Run the following commands to obtain the S3 storage endpoint from your cluster.

    ### 6.1. Using FfDL Local S3 Based Object Storage

    1. Run the following commands to obtain the object storage endpoint from your cluster.

    ```shell

    node_ip=$(make --no-print-directory kubernetes-ip)

    s3_port=$(kubectl get service s3 -o jsonpath='{.spec.ports[0].nodePort}')

    s3_url=http://$node_ip:$s3_port

    ```

    2. Next, set up the default S3 storage access ID and KEY. Then create S3 buckets for all the necessary training data and models.

    2. Next, set up the default object storage access ID and KEY. Then create buckets for all the necessary training data and models.

    ```shell

    export AWS_ACCESS_KEY_ID=test; export AWS_SECRET_ACCESS_KEY=test; export AWS_DEFAULT_REGION=us-east-1;

    @@ -220,7 +224,7 @@ $s3cmd mb s3://dlaas-trained-models
  
    ```

    3. Now, create a temporary repository, download the necessary images for training and labeling our TensorFlow model, and upload those images

    to your S3 tf_training_data bucket.

    to your tf_training_data bucket.

    ```shell

    mkdir tmp

    @@ -231,8 +235,8 @@ do
  
    done

    ```

    4. Next, let's download all the necessary training and testing images in LMDB format for our Caffe model

    and upload those images to your S3 mnist_lmdb_data bucket.

    4. Next, let's download all the necessary training and testing images in [LMDB format](https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database) for our Caffe model

    and upload those images to your mnist_lmdb_data bucket.

    ```shell

    for phase in train test;

    @@ -246,7 +250,7 @@ do
  
    done

    ```

    5. Now you should have all the necessary training data set in your S3 storage. Let's go ahead to set up your restapi endpoint

    5. Now you should have all the necessary training data set in your object storage. Let's go ahead to set up your restapi endpoint

    and default credentials for Deep Learning as a Service. Once you done that, you can start running jobs using the FfDL CLI(executable

    binary).

    @@ -278,6 +282,76 @@ Then, click `Submit Training Job` to run your job.
  
    ![ui-example](docs/images/ui-example.png)

    ### 6.2. Using Cloud Object Storage

    In this section we will demonstrate how to run a TensorFlow job with training data stored in Cloud Object Storage.

    > Note: This also can be done with other Cloud providers' Object Storage, but we will demonstrate how to use IBM Cloud Object Storage in this instructions.

    1. Provision an S3 based Object Storage from your Cloud provider. Take note of your Authentication Endpoints, Access Key ID and Secret.

    > For IBM Cloud, you can provision an Object Storage from [IBM Cloud Dashboard](https://console.bluemix.net/catalog/infrastructure/cloud-object-storage?taxonomyNavigation=apps) or from [SoftLayer Portal](https://control.softlayer.com/storage/objectstorage).

    2. Setup your S3 command with the Object Storage credentials you just obtained.

    ```shell

    s3_url=http://<Your object storage Authentication Endpoints>

    export AWS_ACCESS_KEY_ID=<Your object storage Access Key ID>

    export AWS_SECRET_ACCESS_KEY=<Your object storage Access Key Secret>

    s3cmd="aws --endpoint-url=$s3_url s3"

    ```

    3. Next, let us create 2 buckets, one for storing the training data and another one for storing the training result.

    ```shell

    trainingDataBucket=<unique bucket name for training data storage>

    trainingResultBucket=<unique bucket name for training result storage>

    $s3cmd mb s3://$trainingDataBucket

    $s3cmd mb s3://$trainingResultBucket

    ```

    4. Now, create a temporary repository, download the necessary images for training and labeling our TensorFlow model, and upload those images to your training data bucket.

    ```shell

    mkdir tmp

    for file in t10k-images-idx3-ubyte.gz t10k-labels-idx1-ubyte.gz train-images-idx3-ubyte.gz train-labels-idx1-ubyte.gz;

    do

      test -e tmp/$file || wget -q -O tmp/$file http://yann.lecun.com/exdb/mnist/$file

      $s3cmd cp tmp/$file s3://$trainingDataBucket/$file

    done

    ```

    5. Next, we need to modify our example job to use your Cloud Object Storage using the following sed commands.

    ```shell

    if [ "$(uname)" = "Darwin" ]; then

      sed -i '' s#"tf_training_data"#"$trainingDataBucket"# etc/examples/tf-model/manifest.yml

      sed -i '' s#"tf_trained_model"#"$trainingResultBucket"# etc/examples/tf-model/manifest.yml

      sed -i '' s#"http://s3.default.svc.cluster.local"#"$s3_url"# etc/examples/tf-model/manifest.yml

      sed -i '' s#"user_name: test"#"user_name: $AWS_ACCESS_KEY_ID"# etc/examples/tf-model/manifest.yml

      sed -i '' s#"password: test"#"password: $AWS_SECRET_ACCESS_KEY"# etc/examples/tf-model/manifest.yml

    else

      sed -i s#"tf_training_data"#"$trainingDataBucket"# etc/examples/tf-model/manifest.yml

      sed -i s#"tf_trained_model"#"$trainingResultBucket"# etc/examples/tf-model/manifest.yml

      sed -i s#"http://s3.default.svc.cluster.local"#"$s3_url"# etc/examples/tf-model/manifest.yml

      sed -i s#"user_name: test"#"user_name: $AWS_ACCESS_KEY_ID"# etc/examples/tf-model/manifest.yml

      sed -i s#"password: test"#"password: $AWS_SECRET_ACCESS_KEY"# etc/examples/tf-model/manifest.yml

    fi

    ```

    6. Now you should have all the necessary training data set in your training data bucket. Let's go ahead to set up your restapi endpoint

    and default credentials for Deep Learning as a Service. Once you done that, you can start running jobs using the FfDL CLI(executable

    binary).

    ```shell

    restapi_port=$(kubectl get service ffdl-restapi -o jsonpath='{.spec.ports[0].nodePort}')

    export DLAAS_URL=http://$node_ip:$restapi_port; export DLAAS_USERNAME=test-user; export DLAAS_PASSWORD=test;

    # Obtain the correct CLI for your machine and run the training job with our default TensorFlow model

    CLI_CMD=cli/bin/ffdl-$(if [ "$(uname)" = "Darwin" ]; then echo 'osx'; else echo 'linux'; fi)

    $CLI_CMD train etc/examples/tf-model/manifest.yml etc/examples/tf-model

    ```

    ## 7. Clean Up

    If you want to remove FfDL from your cluster, simply use the command below or run `helm delete <your FfDL release name>`

    ```shell
Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update instructions with Cloud Object Storage #3

Uh oh!

Diff view

Diff view

There are no files selected for viewing