# Getting started

First, we enable the cluster to scale up. Note that if you run an auto-scaling cluster,
Google will suspend your nodes. Make sure to have the experiment prepared before running the commands.

The following is assumed ready:
* GKE/Kubernetes cluster (see also `terraform/terraform_notebook.ipynb`)
    * 2 nodes pools (default for system & dependencies, experiment pool)
* Docker image (including dataset, to speed-up starting experiments).
    * Within a bash shell
        * Make sure to have the `requirements-cpu.txt` installed (or `requirements-gpu.txt (in a virtual venv/conda environment). You can run `pip3 install -r requirements-cpu.txt`
    * First run the extractor (locally) `python3 -m fltk extractor configs/example_cloud_experiment.json`
        *  This downloads datasets to be included in the docker image.
    * Build the container `DOCKER_BUILDKIT=1 docker build --platform linux/amd64 . --tag gcr.io/$PROJECT_ID/fltk`
    * Push to your gcr.io repository `docker push gcr.io/$PROJECT_ID/fltk`


With that setup, first set some variables used throughout the experiment.


In [1]:
##################
### CHANGE ME! ###
##################
PROJECT_ID="test-bed-fltk-group16-mb"
CLUSTER_NAME="fltk-testbed-cluster"
DEFAULT_POOL="default-node-pool"
EXPERIMENT_POOL="medium-fltk-pool-1"
REGION="us-central1-c"

# In case we do not yet have the credentials/kubeconfig
gcloud container clusters get-credentials $CLUSTER_NAME --region $REGION --project $PROJECT_ID

Fetching cluster endpoint and auth data.
kubeconfig entry generated for fltk-testbed-cluster.


Scale the default-node-pool up.

In [2]:
# These commands might take a while to complete.
gcloud container clusters resize $CLUSTER_NAME --node-pool $DEFAULT_POOL \
     --num-nodes 1 --region $REGION --quiet

                                                                               Resizing fltk-testbed-cluster...                                                                               Resizing fltk-testbed-cluster...⠛                                                                               Resizing fltk-testbed-cluster...⠹                                                                               Resizing fltk-testbed-cluster...⠼                                                                               Resizing fltk-testbed-cluster...⠶                                                                               Resizing fltk-testbed-cluster...⠧                                                                               Resizing fltk-testbed-cluster...⠏                                                                               Resizing fltk-testbed-cluster...⠛                                                                               Resizing

Updated [https://container.googleapis.com/v1/projects/test-bed-fltk-group16-mb/zones/us-central1-c/clusters/fltk-testbed-cluster].


## Preparation
In case you have already tested something or ran another experiment, we have to remove the deployment of the Orchestrator. This will not delete any experiment data, as this persists on one of the ReadWriteMany PVCs.


Currently, the Orchestrator is deployed using a `Deployment` definition, a future version will replace this with a `Deployment` definition, to make this step unnecessary. For experiments this means the following:

1. A single deployment can exist at a single time in a single namespace. This includes 'completed' experiments.
2. For running batches of experiments, a BatchOrchestrator is provided.


ℹ️ This will not remove any data, but if your orchestrator is still/already running experiments, this will stop the deployment. Running training jobs will not be stopped, for this you can use `kubectl`. ConfigMaps created by the Orchestrator (to provide experiment configurations), will not be removed. See the commented code in the cell below.

In [3]:
# If you want to delete all pytorch trainjobs, uncomment the command below.
# kubectl delete pytorchjobs.kubeflow.org --all --namespace test

# If you want to delete all existing configuration map objects in a namespace, run teh command below
# kubectl delete configmaps --all --namespace test

helm uninstall -n test flearner

To learn more, consult https://cloud.google.com/blog/products/containers-kubernetes/kubectl-auth-changes-in-gke
Error: uninstall: Release not loaded: flearner: release: not found


: 1

## Install extractor

Deploy the TensorBoard service and persistent volumes, required for deployment of the orchestrator's chart.

In [4]:
helm upgrade --install -n test extractor ../charts/extractor -f ../charts/fltk-values.yaml \
    --set provider.projectName=$PROJECT_ID

To learn more, consult https://cloud.google.com/blog/products/containers-kubernetes/kubectl-auth-changes-in-gke
Release "extractor" does not exist. Installing it now.
NAME: extractor
LAST DEPLOYED: Mon Oct 17 09:48:15 2022
NAMESPACE: test
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
Get the FLTK extractors Tensorboard URL by running:

export POD_NAME=$(kubectl get pods -n test -l "app.kubernetes.io/name=fltk.extractor" -o jsonpath="{.items[0].metadata.name}")
echo http://localhost:6006/
kubectl -n test port-forward $POD_NAME 6006:6006


## Define experiment configuration files

Deployment of experiments is currently done through a Helm Deployment. A future release (™️) will rework this to a Job definition, as this allows to re-use the template more easily.


> The `EXPERIMENT_FILE` will contain the description of the experiments
> The `CLUSTER_CONFIG` will contain shared configurations for logging, Orchestrator configuration and replication information.

In [35]:
# HOMEWORKS
EXPERIMENT_FILE="/home/mattiacs/Documents/Developer/Edu/TUDelft/Year_4/Quarter_1/CS4215-QPECS/homeworks_2/ex6_configs/distributed_4_node.json"

# PROJECT
# EXPERIMENT_FILE="../configs/federated_tasks/example_arrival_config.json"
# EXPERIMENT_FILE="../configs/distributed_tasks/example_arrival_config.json"
CLUSTER_CONFIG="../configs/example_cloud_experiment.json"

# EXPERIMENT_FILE="../project/configs/project_arrival_config_comb_6.json"
# CLUSTER_CONFIG="../project/configs/project_cloud_experiment.json"

## Setup experiment variables
Next, we will deploy the experiments.


We provide a configuration file, `charts/fltk-values.yaml`, in here change the values under the `provider` block. Change `projectName` to your Google Cloud Project ID.

```yaml
provider:
    domain: gcr.io
    projectName: CHANGE_ME!
    imageName: fltk:latest
```

We use the `--set-file` flag for `helm`, as currently, Helm does not support using files outside of the chart root directory (in this case `charts/orchestrator`). Using `--set-file` we can dynamically provide these files. See also issue [here](https://github.com/helm/helm/issues/3276)


In [36]:
helm uninstall -n test experiment-orchestrator
helm install -n test experiment-orchestrator ../charts/orchestrator -f ../charts/fltk-values.yaml \
    --set-file orchestrator.experiment=$EXPERIMENT_FILE,orchestrator.configuration=$CLUSTER_CONFIG \
    --set provider.projectName=$PROJECT_ID

To learn more, consult https://cloud.google.com/blog/products/containers-kubernetes/kubectl-auth-changes-in-gke
release "experiment-orchestrator" uninstalled
To learn more, consult https://cloud.google.com/blog/products/containers-kubernetes/kubectl-auth-changes-in-gke
NAME: experiment-orchestrator
LAST DEPLOYED: Mon Oct 17 12:24:59 2022
NAMESPACE: test
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
You successfully launched an experiment configuration on your cluster in test.

N.B. Make sure to collect all data after completing your experiment!
N.B. Re-installing the orchestrator WILL RESULT IN DELETION OF ALL TRAINJOBS and PODS!


In [37]:
# To get logs from the orchestrator
kubectl logs -n test fl-server

To learn more, consult https://cloud.google.com/blog/products/containers-kubernetes/kubectl-auth-changes-in-gke
10-17-2022 10:25:08 root         INFO     Loading file config/configuration.fltk.json
10-17-2022 10:25:08 root         INFO     Starting in cluster mode.
10-17-2022 10:25:08 root         INFO     Starting with experiment replication: 0 with seed: 42
No argument path is provided.
10-17-2022 10:25:08 root         INFO     Starting as Orchestrator
10-17-2022 10:25:08 root         INFO     Starting Orchestrator, initializing resources....
10-17-2022 10:25:08 root         INFO     Loading in cluster configuration file
10-17-2022 10:25:08 root         INFO     Pointing configuration to in cluster configuration.
10-17-2022 10:25:08 root         INFO     Starting cluster manager
10-17-2022 10:25:08 ClusterManager INFO     Spinning up cluster manager...
10-17-2022 10:25:08 ResourceWatchDog INFO     Starting resource watchdog
10-17-2022 10:25:08 ResourceWatchDog INFO     Fetching node 

In [62]:
# To get logs from learners (example)
kubectl logs -n test trainjob-3a0b2d2c-864c-4f4c-8647-248465c441c4-master-0

# To get logs from learners (federated learning)
kubectl logs -n test trainjob-3a0b2d2c-864c-4f4c-8647-248465c441c4-worker-0

To learn more, consult https://cloud.google.com/blog/products/containers-kubernetes/kubectl-auth-changes-in-gke
10-17-2022 10:25:16 root         INFO     Loading file config/configuration.fltk.json
10-17-2022 10:25:16 root         INFO     Starting in client mode
10-17-2022 10:25:16 root         INFO     Starting with host=trainjob-3a0b2d2c-864c-4f4c-8647-248465c441c4-master-0 and port=23456
10-17-2022 10:25:16 root         INFO     Initializing backend for training process: gloo
10-17-2022 10:25:17 torch.distributed.distributed_c10d INFO     Added key: store_based_barrier_key:1 to store for rank: 0
10-17-2022 10:25:17 torch.distributed.distributed_c10d INFO     Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
10-17-2022 10:25:17 root         INFO     Starting Creating client with 0
10-17-2022 10:25:17 Client-0-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     Initializing learning client
10-17-2022 10:25:18 root         INFO     Getting net: Nets.fashi

10-17-2022 11:07:46 Client-0-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [24,    50] loss: 0.402
10-17-2022 11:08:28 Client-0-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [24,   100] loss: 0.400
10-17-2022 11:08:50 Client-0-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [25,     0] loss: 0.010
10-17-2022 11:09:31 Client-0-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [25,    50] loss: 0.397
10-17-2022 11:10:12 Client-0-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [25,   100] loss: 0.395
10-17-2022 11:10:35 Client-0-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [26,     0] loss: 0.010
10-17-2022 11:11:16 Client-0-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [26,    50] loss: 0.392
10-17-2022 11:11:56 Client-0-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [26,   100] loss: 0.390
10-17-2022 11:12:18 Client-0-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [27,     0] loss: 0.010
10-17-2022 11:12:59 Client-0-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [27,    50] loss: 0.388
10-17-2022

10-17-2022 11:56:14 Client-0-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [52,     0] loss: 0.008
10-17-2022 11:56:56 Client-0-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [52,    50] loss: 0.318
10-17-2022 11:57:38 Client-0-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [52,   100] loss: 0.315
10-17-2022 11:58:00 Client-0-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [53,     0] loss: 0.008
10-17-2022 11:58:43 Client-0-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [53,    50] loss: 0.316
10-17-2022 11:59:26 Client-0-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [53,   100] loss: 0.314
10-17-2022 11:59:48 Client-0-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [54,     0] loss: 0.008
10-17-2022 12:00:29 Client-0-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [54,    50] loss: 0.315
10-17-2022 12:01:11 Client-0-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [54,   100] loss: 0.313
10-17-2022 12:01:33 Client-0-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [55,     0] loss: 0.008
10-17-2022

10-17-2022 12:45:20 Client-0-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [79,   100] loss: 0.290
10-17-2022 12:45:43 Client-0-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [80,     0] loss: 0.008
10-17-2022 12:46:24 Client-0-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [80,    50] loss: 0.292
10-17-2022 12:47:05 Client-0-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [80,   100] loss: 0.289
No argument path is provided.
[EpochData(epoch_id=1, num_epochs=81, duration_train=98596, duration_test=7019, loss_train=1.2657363176345826, accuracy=71.64, loss=20.952548563480377, class_precision=array([0.69230769, 0.85958904, 0.58113208, 0.64261168, 0.55932203,
       0.79674797, 0.30882353, 0.83333333, 0.90134529, 0.84528302]), class_recall=array([0.63333333, 0.93656716, 0.57249071, 0.69259259, 0.64705882,
       0.85217391, 0.19004525, 0.81967213, 0.85169492, 0.94514768]), confusion_mat=array([[171,   4,   6,  41,   2,   5,  37,   1,   3,   0],
       [  1, 251,   0,   8,   6,   1,   0,   0, 

       0.93043478, 0.45701357, 0.8647541 , 0.93220339, 0.9535865 ]), confusion_mat=array([[200,   0,   4,  25,   2,   3,  31,   0,   5,   0],
       [  0, 257,   2,   6,   2,   0,   0,   0,   1,   0],
       [  4,   0, 188,   2,  37,   1,  36,   0,   1,   0],
       [  5,   3,   3, 228,   9,   1,  21,   0,   0,   0],
       [  1,   0,  33,  10, 188,   0,  22,   0,   1,   0],
       [  0,   0,   0,   1,   0, 214,   1,  10,   0,   4],
       [ 44,   0,  34,  10,  26,   1, 101,   0,   5,   0],
       [  0,   0,   0,   0,   0,  13,   0, 211,   0,  20],
       [  1,   1,   0,   7,   2,   3,   1,   1, 220,   0],
       [  0,   0,   0,   1,   0,   3,   0,   7,   0, 226]])), EpochData(epoch_id=9, num_epochs=81, duration_train=980008, duration_test=8889, loss_train=0.5321717250347138, accuracy=81.76, loss=10.959771275520325, class_precision=array([0.78378378, 0.98473282, 0.71923077, 0.79513889, 0.70260223,
       0.89915966, 0.4952381 , 0.92173913, 0.94444444, 0.908     ]), class_recall=array([

       0.93103448, 0.54166667, 0.93191489, 0.94092827, 0.92307692]), class_recall=array([0.77777778, 0.9738806 , 0.73605948, 0.87407407, 0.76078431,
       0.93913043, 0.52941176, 0.89754098, 0.94491525, 0.96202532]), confusion_mat=array([[210,   0,   4,  16,   1,   2,  32,   0,   5,   0],
       [  0, 261,   1,   5,   0,   0,   0,   0,   1,   0],
       [  5,   0, 198,   2,  31,   1,  31,   0,   1,   0],
       [  6,   1,   3, 236,   9,   0,  15,   0,   0,   0],
       [  1,   0,  28,  12, 194,   0,  19,   0,   1,   0],
       [  0,   0,   0,   1,   0, 216,   0,   9,   0,   4],
       [ 38,   0,  27,   8,  25,   0, 117,   0,   6,   0],
       [  0,   0,   0,   0,   0,  10,   0, 219,   0,  15],
       [  1,   1,   1,   6,   0,   2,   2,   0, 223,   0],
       [  0,   0,   0,   1,   0,   1,   0,   7,   0, 228]])), EpochData(epoch_id=17, num_epochs=81, duration_train=1820904, duration_test=6894, loss_train=0.44218942523002625, accuracy=84.36, loss=9.476663172245026, class_precision=array

       0.93991416, 0.55458515, 0.94468085, 0.9535865 , 0.93061224]), class_recall=array([0.77407407, 0.9738806 , 0.76579926, 0.88518519, 0.76470588,
       0.95217391, 0.57466063, 0.90983607, 0.95762712, 0.96202532]), confusion_mat=array([[209,   0,   4,  14,   1,   2,  36,   0,   4,   0],
       [  1, 261,   1,   3,   0,   0,   1,   0,   1,   0],
       [  6,   0, 206,   2,  24,   1,  29,   0,   1,   0],
       [  6,   1,   3, 239,   8,   0,  13,   0,   0,   0],
       [  1,   0,  26,  11, 195,   0,  21,   0,   1,   0],
       [  0,   0,   0,   1,   0, 219,   0,   6,   0,   4],
       [ 35,   0,  26,   7,  22,   0, 127,   0,   4,   0],
       [  0,   0,   0,   0,   0,   9,   0, 222,   0,  13],
       [  0,   1,   1,   5,   0,   1,   2,   0, 226,   0],
       [  0,   0,   0,   1,   0,   1,   0,   7,   0, 228]])), EpochData(epoch_id=25, num_epochs=81, duration_train=2708009, duration_test=6895, loss_train=0.3949971377849579, accuracy=85.32, loss=8.719396591186523, class_precision=array(

       0.94420601, 0.57264957, 0.94491525, 0.95416667, 0.93442623]), class_recall=array([0.77037037, 0.9738806 , 0.76951673, 0.88888889, 0.77647059,
       0.95652174, 0.60633484, 0.91393443, 0.97033898, 0.96202532]), confusion_mat=array([[208,   0,   5,  13,   1,   2,  37,   0,   4,   0],
       [  1, 261,   1,   3,   0,   0,   1,   0,   1,   0],
       [  6,   0, 207,   2,  23,   0,  30,   0,   1,   0],
       [  6,   1,   3, 240,   9,   0,  11,   0,   0,   0],
       [  1,   0,  26,  10, 198,   0,  19,   0,   1,   0],
       [  0,   0,   0,   0,   0, 220,   1,   6,   0,   3],
       [ 33,   0,  23,   6,  21,   0, 134,   0,   4,   0],
       [  0,   0,   0,   0,   0,   8,   0, 223,   0,  13],
       [  0,   1,   0,   3,   0,   2,   1,   0, 229,   0],
       [  0,   0,   0,   1,   0,   1,   0,   7,   0, 228]])), EpochData(epoch_id=33, num_epochs=81, duration_train=3544307, duration_test=6701, loss_train=0.3635659044981003, accuracy=86.0, loss=8.236785501241684, class_precision=array([

       0.94827586, 0.58723404, 0.94537815, 0.95435685, 0.9382716 ]), class_recall=array([0.77407407, 0.9738806 , 0.78438662, 0.8962963 , 0.77647059,
       0.95652174, 0.62443439, 0.92213115, 0.97457627, 0.96202532]), confusion_mat=array([[209,   0,   5,  12,   1,   2,  37,   0,   4,   0],
       [  1, 261,   1,   3,   0,   0,   1,   0,   1,   0],
       [  6,   0, 211,   2,  20,   0,  29,   0,   1,   0],
       [  6,   1,   3, 242,   9,   0,   9,   0,   0,   0],
       [  1,   0,  28,   8, 198,   0,  19,   0,   1,   0],
       [  0,   0,   0,   0,   0, 220,   1,   6,   0,   3],
       [ 29,   0,  23,   6,  21,   0, 138,   0,   4,   0],
       [  0,   0,   0,   0,   0,   7,   0, 225,   0,  12],
       [  0,   1,   0,   2,   0,   2,   1,   0, 230,   0],
       [  0,   0,   0,   1,   0,   1,   0,   7,   0, 228]])), EpochData(epoch_id=41, num_epochs=81, duration_train=4393996, duration_test=6902, loss_train=0.3398309960961342, accuracy=86.56, loss=7.88749635219574, class_precision=array([

       0.94805195, 0.60606061, 0.93801653, 0.95833333, 0.94605809]), class_recall=array([0.78148148, 0.9738806 , 0.79182156, 0.89259259, 0.8       ,
       0.95217391, 0.63348416, 0.93032787, 0.97457627, 0.96202532]), confusion_mat=array([[211,   0,   5,  10,   1,   2,  38,   0,   3,   0],
       [  1, 261,   1,   3,   0,   0,   1,   0,   1,   0],
       [  6,   0, 213,   2,  19,   0,  28,   0,   1,   0],
       [  6,   1,   5, 241,   9,   0,   8,   0,   0,   0],
       [  1,   0,  28,   6, 204,   0,  15,   0,   1,   0],
       [  0,   0,   0,   0,   0, 219,   1,   7,   0,   3],
       [ 28,   0,  22,   6,  21,   0, 140,   0,   4,   0],
       [  0,   0,   0,   0,   0,   7,   0, 227,   0,  10],
       [  0,   1,   0,   2,   1,   2,   0,   0, 230,   0],
       [  0,   0,   0,   0,   0,   1,   0,   8,   0, 228]])), EpochData(epoch_id=49, num_epochs=81, duration_train=5237004, duration_test=7405, loss_train=0.3206969606876373, accuracy=86.96, loss=7.621096581220627, class_precision=array(

       [  0,   0,   0,   0,   0,   1,   0,   8,   0, 228]])), EpochData(epoch_id=56, num_epochs=81, duration_train=5978406, duration_test=6810, loss_train=0.310810475051403, accuracy=87.28, loss=7.494878500699997, class_precision=array([0.83076923, 0.99239544, 0.78228782, 0.89219331, 0.7953668 ,
       0.95633188, 0.62666667, 0.93852459, 0.95833333, 0.95      ]), class_recall=array([0.8       , 0.9738806 , 0.78810409, 0.88888889, 0.80784314,
       0.95217391, 0.63800905, 0.93852459, 0.97457627, 0.96202532]), confusion_mat=array([[216,   0,   5,  10,   1,   1,  33,   0,   4,   0],
       [  1, 261,   1,   3,   0,   0,   1,   0,   1,   0],
       [  6,   0, 212,   3,  20,   0,  27,   0,   1,   0],
       [  7,   1,   5, 240,   9,   0,   8,   0,   0,   0],
       [  1,   0,  27,   6, 206,   0,  14,   0,   1,   0],
       [  0,   0,   0,   0,   0, 219,   1,   7,   0,   3],
       [ 29,   0,  21,   5,  22,   0, 141,   0,   3,   0],
       [  0,   0,   0,   0,   0,   6,   0, 229,   0,   9],

       [  1,   0,  27,   6, 206,   0,  14,   0,   1,   0],
       [  0,   0,   0,   0,   0, 219,   1,   7,   0,   3],
       [ 30,   0,  22,   4,  21,   0, 141,   0,   3,   0],
       [  0,   0,   0,   0,   0,   6,   0, 229,   0,   9],
       [  0,   1,   0,   2,   1,   2,   0,   0, 230,   0],
       [  0,   0,   0,   0,   0,   2,   0,   8,   0, 227]])), EpochData(epoch_id=64, num_epochs=81, duration_train=6821600, duration_test=6706, loss_train=0.30310182482004167, accuracy=87.28, loss=7.395003229379654, class_precision=array([0.82824427, 0.99239544, 0.78148148, 0.89219331, 0.79844961,
       0.95217391, 0.63111111, 0.93852459, 0.95833333, 0.94979079]), class_recall=array([0.8037037 , 0.9738806 , 0.78438662, 0.88888889, 0.80784314,
       0.95217391, 0.64253394, 0.93852459, 0.97457627, 0.95780591]), confusion_mat=array([[217,   0,   5,  11,   1,   1,  31,   0,   4,   0],
       [  1, 261,   1,   3,   0,   0,   1,   0,   1,   0],
       [  6,   0, 211,   3,  20,   0,  28,   0,   1,   0

       [  1, 262,   1,   2,   0,   0,   1,   0,   1,   0],
       [  6,   0, 212,   3,  20,   0,  27,   0,   1,   0],
       [  7,   1,   5, 240,   9,   0,   8,   0,   0,   0],
       [  1,   0,  27,   6, 207,   0,  13,   0,   1,   0],
       [  0,   0,   0,   0,   0, 219,   1,   7,   0,   3],
       [ 29,   0,  21,   4,  21,   0, 143,   0,   3,   0],
       [  0,   0,   0,   0,   0,   6,   0, 229,   0,   9],
       [  0,   1,   0,   2,   1,   2,   0,   0, 230,   0],
       [  0,   0,   0,   0,   0,   2,   0,   8,   0, 227]])), EpochData(epoch_id=72, num_epochs=81, duration_train=7671296, duration_test=7012, loss_train=0.2959287521243095, accuracy=87.56, loss=7.305462539196014, class_precision=array([0.83524904, 0.99242424, 0.78308824, 0.8988764 , 0.7992278 ,
       0.95217391, 0.64285714, 0.93852459, 0.95833333, 0.94979079]), class_recall=array([0.80740741, 0.97761194, 0.79182156, 0.88888889, 0.81176471,
       0.95217391, 0.65158371, 0.93852459, 0.97457627, 0.95780591]), confusion_ma

       0.95633188, 0.64864865, 0.9382716 , 0.95473251, 0.94583333]), class_recall=array([0.80740741, 0.97761194, 0.79553903, 0.88888889, 0.81568627,
       0.95217391, 0.65158371, 0.93442623, 0.98305085, 0.95780591]), confusion_mat=array([[218,   0,   6,  10,   1,   1,  30,   0,   4,   0],
       [  1, 262,   1,   2,   0,   0,   1,   0,   1,   0],
       [  6,   0, 214,   3,  20,   0,  25,   0,   1,   0],
       [  7,   1,   5, 240,   9,   0,   8,   0,   0,   0],
       [  1,   0,  26,   6, 208,   0,  13,   0,   1,   0],
       [  0,   0,   0,   0,   0, 219,   1,   7,   0,   3],
       [ 27,   0,  21,   4,  21,   0, 144,   0,   4,   0],
       [  0,   0,   0,   0,   0,   6,   0, 228,   0,  10],
       [  0,   1,   0,   1,   1,   1,   0,   0, 232,   0],
       [  0,   0,   0,   0,   0,   2,   0,   8,   0, 227]])), EpochData(epoch_id=80, num_epochs=81, duration_train=8520000, duration_test=6798, loss_train=0.2892460376024246, accuracy=87.64, loss=7.223932355642319, class_precision=array(

10-17-2022 10:54:43 Client-1-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [17,    50] loss: 0.444
10-17-2022 10:55:25 Client-1-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [17,   100] loss: 0.442
10-17-2022 10:55:48 Client-1-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [18,     0] loss: 0.011
10-17-2022 10:56:32 Client-1-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [18,    50] loss: 0.437
10-17-2022 10:57:16 Client-1-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [18,   100] loss: 0.435
10-17-2022 10:57:40 Client-1-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [19,     0] loss: 0.011
10-17-2022 10:58:26 Client-1-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [19,    50] loss: 0.430
10-17-2022 10:59:11 Client-1-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [19,   100] loss: 0.428
10-17-2022 10:59:35 Client-1-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [20,     0] loss: 0.011
10-17-2022 11:00:22 Client-1-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [20,    50] loss: 0.424
10-17-2022

10-17-2022 11:43:58 Client-1-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [45,     0] loss: 0.009
10-17-2022 11:44:39 Client-1-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [45,    50] loss: 0.332
10-17-2022 11:45:19 Client-1-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [45,   100] loss: 0.330
10-17-2022 11:45:41 Client-1-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [46,     0] loss: 0.008
10-17-2022 11:46:23 Client-1-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [46,    50] loss: 0.330
10-17-2022 11:47:04 Client-1-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [46,   100] loss: 0.327
10-17-2022 11:47:27 Client-1-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [47,     0] loss: 0.008
10-17-2022 11:48:09 Client-1-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [47,    50] loss: 0.328
10-17-2022 11:48:49 Client-1-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [47,   100] loss: 0.325
10-17-2022 11:49:12 Client-1-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [48,     0] loss: 0.008
10-17-2022

10-17-2022 12:32:56 Client-1-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [72,   100] loss: 0.296
10-17-2022 12:33:19 Client-1-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [73,     0] loss: 0.008
10-17-2022 12:34:00 Client-1-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [73,    50] loss: 0.298
10-17-2022 12:34:42 Client-1-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [73,   100] loss: 0.295
10-17-2022 12:35:05 Client-1-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [74,     0] loss: 0.008
10-17-2022 12:35:46 Client-1-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [74,    50] loss: 0.297
10-17-2022 12:36:27 Client-1-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [74,   100] loss: 0.294
10-17-2022 12:36:50 Client-1-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [75,     0] loss: 0.008
10-17-2022 12:37:32 Client-1-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [75,    50] loss: 0.296
10-17-2022 12:38:14 Client-1-3a0b2d2c-864c-4f4c-8647-248465c441c4 INFO     [75,   100] loss: 0.293
10-17-2022

       [  0, 245,   3,  14,   5,   0,   2,   0,   0,   0],
       [  2,   0, 169,   2,  31,   1,  31,   0,   3,   0],
       [  8,   9,   0, 214,   2,   2,  20,   0,   2,   0],
       [  0,   0,  37,  14, 153,   0,  24,   0,   2,   0],
       [  0,   0,   0,   0,   0, 246,   0,  16,   2,   4],
       [ 51,   1,  30,   9,  29,   1, 124,   0,   9,   0],
       [  0,   0,   0,   0,   0,  13,   0, 218,   1,  23],
       [  0,   0,   1,   4,   1,   5,   3,   2, 220,   0],
       [  0,   0,   0,   0,   0,   3,   1,  16,   1, 233]])), EpochData(epoch_id=7, num_epochs=81, duration_train=760877, duration_test=7395, loss_train=0.572578113079071, accuracy=81.32, loss=11.192662715911865, class_precision=array([0.76209677, 0.97244094, 0.7       , 0.77580071, 0.7       ,
       0.91544118, 0.59817352, 0.875     , 0.89430894, 0.88931298]), class_recall=array([0.79411765, 0.91821561, 0.73221757, 0.84824903, 0.66956522,
       0.92910448, 0.51574803, 0.85098039, 0.93220339, 0.91732283]), confusion_mat=

       0.94360902, 0.60085837, 0.87698413, 0.91769547, 0.90769231]), class_recall=array([0.78991597, 0.92936803, 0.76987448, 0.85992218, 0.71304348,
       0.93656716, 0.5511811 , 0.86666667, 0.94491525, 0.92913386]), confusion_mat=array([[188,   0,   5,  18,   1,   0,  20,   0,   6,   0],
       [  1, 250,   4,   9,   2,   0,   3,   0,   0,   0],
       [  2,   0, 184,   2,  28,   0,  21,   0,   2,   0],
       [  6,   5,   0, 221,   4,   0,  19,   0,   2,   0],
       [  0,   0,  27,  10, 164,   0,  28,   0,   1,   0],
       [  0,   0,   0,   0,   0, 251,   0,  12,   2,   3],
       [ 46,   0,  29,   8,  25,   1, 140,   0,   5,   0],
       [  0,   0,   0,   0,   0,  12,   0, 221,   1,  21],
       [  0,   0,   1,   4,   2,   2,   2,   2, 223,   0],
       [  0,   0,   0,   0,   0,   0,   0,  17,   1, 236]])), EpochData(epoch_id=15, num_epochs=81, duration_train=1611171, duration_test=7598, loss_train=0.4586551606655121, accuracy=83.44, loss=9.273550152778625, class_precision=array(

       0.95488722, 0.63374486, 0.88715953, 0.93360996, 0.921875  ]), class_recall=array([0.80672269, 0.9330855 , 0.79916318, 0.85603113, 0.7173913 ,
       0.94776119, 0.60629921, 0.89411765, 0.95338983, 0.92913386]), confusion_mat=array([[192,   0,   5,  13,   0,   0,  23,   0,   5,   0],
       [  1, 251,   3,   9,   2,   0,   3,   0,   0,   0],
       [  1,   0, 191,   2,  26,   0,  18,   0,   1,   0],
       [  4,   5,   4, 220,   4,   0,  19,   0,   1,   0],
       [  0,   0,  30,  10, 165,   0,  24,   0,   1,   0],
       [  0,   0,   0,   0,   0, 254,   0,  10,   1,   3],
       [ 41,   0,  24,   6,  23,   1, 154,   0,   5,   0],
       [  0,   0,   0,   0,   0,   9,   0, 228,   1,  17],
       [  0,   0,   1,   2,   2,   2,   2,   2, 225,   0],
       [  0,   0,   0,   0,   0,   0,   0,  17,   1, 236]])), EpochData(epoch_id=23, num_epochs=81, duration_train=2495766, duration_test=7507, loss_train=0.4047824585437775, accuracy=84.8, loss=8.405369460582733, class_precision=array([

       0.95849057, 0.65983607, 0.89803922, 0.93415638, 0.92635659]), class_recall=array([0.80252101, 0.93680297, 0.84100418, 0.86770428, 0.73478261,
       0.94776119, 0.63385827, 0.89803922, 0.96186441, 0.94094488]), confusion_mat=array([[191,   0,   4,  12,   0,   0,  26,   0,   5,   0],
       [  1, 252,   3,   9,   2,   0,   2,   0,   0,   0],
       [  1,   0, 201,   2,  20,   0,  14,   0,   1,   0],
       [  4,   4,   5, 223,   3,   0,  17,   0,   1,   0],
       [  0,   0,  27,  10, 169,   0,  23,   0,   1,   0],
       [  0,   0,   0,   0,   0, 254,   0,  10,   1,   3],
       [ 39,   0,  21,   7,  20,   1, 161,   0,   5,   0],
       [  0,   0,   0,   0,   0,   9,   0, 229,   1,  16],
       [  0,   0,   1,   2,   2,   1,   1,   2, 227,   0],
       [  0,   0,   0,   0,   0,   0,   0,  14,   1, 239]])), EpochData(epoch_id=31, num_epochs=81, duration_train=3333575, duration_test=6697, loss_train=0.37050835490226747, accuracy=85.92, loss=7.888957858085632, class_precision=array

       0.96603774, 0.66528926, 0.9       , 0.94190871, 0.93700787]), class_recall=array([0.81512605, 0.94423792, 0.84937238, 0.87159533, 0.75217391,
       0.95522388, 0.63385827, 0.91764706, 0.96186441, 0.93700787]), confusion_mat=array([[194,   0,   4,  10,   0,   0,  26,   0,   4,   0],
       [  1, 254,   2,   9,   1,   0,   2,   0,   0,   0],
       [  1,   0, 203,   2,  19,   0,  13,   0,   1,   0],
       [  4,   4,   5, 224,   3,   0,  16,   0,   1,   0],
       [  0,   0,  25,   8, 173,   0,  23,   0,   1,   0],
       [  0,   0,   0,   0,   0, 256,   0,   9,   0,   3],
       [ 40,   0,  19,   8,  20,   1, 161,   0,   5,   0],
       [  0,   0,   0,   0,   0,   7,   0, 234,   1,  13],
       [  0,   0,   1,   2,   2,   1,   1,   2, 227,   0],
       [  0,   0,   0,   0,   0,   0,   0,  15,   1, 238]])), EpochData(epoch_id=39, num_epochs=81, duration_train=4180767, duration_test=7010, loss_train=0.34524976074695585, accuracy=86.64, loss=7.532857418060303, class_precision=array

       [  0,   0,   0,   0,   0,   0,   0,  15,   1, 238]])), EpochData(epoch_id=46, num_epochs=81, duration_train=4919476, duration_test=7796, loss_train=0.3274501505494118, accuracy=87.08, loss=7.295472204685211, class_precision=array([0.80327869, 0.98832685, 0.79377432, 0.86538462, 0.78923767,
       0.98091603, 0.66666667, 0.90494297, 0.94628099, 0.94444444]), class_recall=array([0.82352941, 0.94423792, 0.85355649, 0.87548638, 0.76521739,
       0.95895522, 0.62992126, 0.93333333, 0.97033898, 0.93700787]), confusion_mat=array([[196,   0,   3,   9,   0,   0,  26,   0,   4,   0],
       [  1, 254,   2,   9,   1,   0,   2,   0,   0,   0],
       [  2,   0, 204,   1,  20,   0,  12,   0,   0,   0],
       [  4,   3,   5, 225,   3,   0,  16,   0,   1,   0],
       [  0,   0,  23,   7, 176,   0,  23,   0,   1,   0],
       [  0,   0,   0,   0,   0, 257,   0,   8,   0,   3],
       [ 41,   0,  19,   8,  21,   0, 160,   0,   5,   0],
       [  0,   0,   0,   0,   0,   5,   0, 238,   1,  11]

       [  0,   0,  23,   7, 180,   0,  19,   0,   1,   0],
       [  0,   0,   0,   0,   0, 257,   0,   8,   0,   3],
       [ 43,   0,  19,   7,  21,   0, 160,   0,   4,   0],
       [  0,   0,   0,   0,   0,   5,   0, 240,   1,   9],
       [  0,   1,   1,   1,   2,   0,   1,   2, 228,   0],
       [  0,   0,   0,   0,   0,   0,   0,  16,   1, 237]])), EpochData(epoch_id=54, num_epochs=81, duration_train=5766175, duration_test=7692, loss_train=0.3128339183330536, accuracy=87.16, loss=7.1195098310709, class_precision=array([0.79674797, 0.98449612, 0.79296875, 0.87209302, 0.79295154,
       0.98091603, 0.67226891, 0.8988764 , 0.95      , 0.9516129 ]), class_recall=array([0.82352941, 0.94423792, 0.84937238, 0.87548638, 0.7826087 ,
       0.95895522, 0.62992126, 0.94117647, 0.96610169, 0.92913386]), confusion_mat=array([[196,   0,   3,   8,   0,   0,  27,   0,   4,   0],
       [  1, 254,   2,   9,   1,   0,   2,   0,   0,   0],
       [  2,   0, 203,   1,  20,   0,  13,   0,   0,   0],


       [  1, 254,   2,   9,   1,   0,   2,   0,   0,   0],
       [  2,   0, 203,   1,  18,   0,  15,   0,   0,   0],
       [  4,   3,   5, 225,   3,   0,  16,   0,   1,   0],
       [  0,   0,  23,   7, 179,   0,  20,   0,   1,   0],
       [  0,   0,   0,   0,   0, 257,   0,   8,   0,   3],
       [ 43,   0,  18,   7,  20,   0, 162,   0,   4,   0],
       [  0,   0,   0,   0,   0,   5,   0, 240,   1,   9],
       [  0,   1,   1,   1,   2,   0,   1,   2, 228,   0],
       [  0,   0,   0,   0,   0,   0,   0,  17,   1, 236]])), EpochData(epoch_id=62, num_epochs=81, duration_train=6611974, duration_test=7198, loss_train=0.3049787789583206, accuracy=87.28, loss=7.024581044912338, class_precision=array([0.79757085, 0.98449612, 0.79607843, 0.87548638, 0.80630631,
       0.98091603, 0.66803279, 0.8988764 , 0.95      , 0.9516129 ]), class_recall=array([0.82773109, 0.94423792, 0.84937238, 0.87548638, 0.77826087,
       0.95895522, 0.64173228, 0.94117647, 0.96610169, 0.92913386]), confusion_ma

       0.98091603, 0.67489712, 0.8988764 , 0.9539749 , 0.9516129 ]), class_recall=array([0.82773109, 0.94795539, 0.85774059, 0.87548638, 0.77826087,
       0.95895522, 0.64566929, 0.94117647, 0.96610169, 0.92913386]), confusion_mat=array([[197,   0,   3,   7,   0,   0,  27,   0,   4,   0],
       [  1, 255,   2,   8,   1,   0,   2,   0,   0,   0],
       [  2,   0, 205,   1,  18,   0,  13,   0,   0,   0],
       [  4,   3,   5, 225,   3,   0,  16,   0,   1,   0],
       [  0,   0,  23,   7, 179,   0,  20,   0,   1,   0],
       [  0,   0,   0,   0,   0, 257,   0,   8,   0,   3],
       [ 43,   0,  18,   7,  19,   0, 164,   0,   3,   0],
       [  0,   0,   0,   0,   0,   5,   0, 240,   1,   9],
       [  0,   1,   1,   1,   2,   0,   1,   2, 228,   0],
       [  0,   0,   0,   0,   0,   0,   0,  17,   1, 236]])), EpochData(epoch_id=70, num_epochs=81, duration_train=7459467, duration_test=7404, loss_train=0.29766977280378343, accuracy=87.44, loss=6.9396992623806, class_precision=array([

       [  0,   0,   0,   0,   0,   0,   0,  17,   1, 236]])), EpochData(epoch_id=77, num_epochs=81, duration_train=8203177, duration_test=7294, loss_train=0.29170302242040635, accuracy=87.68, loss=6.871705874800682, class_precision=array([0.79757085, 0.98461538, 0.8046875 , 0.87937743, 0.82110092,
       0.98091603, 0.67886179, 0.8988764 , 0.9539749 , 0.9516129 ]), class_recall=array([0.82773109, 0.95167286, 0.86192469, 0.87937743, 0.77826087,
       0.95895522, 0.65748031, 0.94117647, 0.96610169, 0.92913386]), confusion_mat=array([[197,   0,   3,   7,   0,   0,  27,   0,   4,   0],
       [  1, 256,   1,   8,   1,   0,   2,   0,   0,   0],
       [  2,   0, 206,   2,  16,   0,  13,   0,   0,   0],
       [  4,   3,   5, 226,   2,   0,  16,   0,   1,   0],
       [  0,   0,  23,   7, 179,   0,  20,   0,   1,   0],
       [  0,   0,   0,   0,   0, 257,   0,   8,   0,   3],
       [ 43,   0,  17,   6,  18,   0, 167,   0,   3,   0],
       [  0,   0,   0,   0,   0,   5,   0, 240,   1,   9

## Copy experiment results from the extractor

Extractor holds the experiment results in the format that can be processedby TensorBoard.
In order to download it to the local machine, execute:

In [63]:
EXTRACTOR_POD_NAME=$(kubectl get pods -n test -l "app.kubernetes.io/name=fltk.extractor" -o jsonpath="{.items[0].metadata.name}")

kubectl cp -n test $EXTRACTOR_POD_NAME:/opt/federation-lab/logging ../logging

To learn more, consult https://cloud.google.com/blog/products/containers-kubernetes/kubectl-auth-changes-in-gke
To learn more, consult https://cloud.google.com/blog/products/containers-kubernetes/kubectl-auth-changes-in-gke
tar: Removing leading `/' from member names


# Cleanup

## Removing orchestrator

In [64]:
helm uninstall -n test experiment-orchestrator

To learn more, consult https://cloud.google.com/blog/products/containers-kubernetes/kubectl-auth-changes-in-gke
release "experiment-orchestrator" uninstalled


## Removing extractor

IMPORTANT: Removing extractor chart will result in deleting the already collected experiment results, stored in the NFS!

In [65]:
helm uninstall extractor -n test

To learn more, consult https://cloud.google.com/blog/products/containers-kubernetes/kubectl-auth-changes-in-gke
release "extractor" uninstalled


## Wrapping up

To scale down the cluster nodepools, run the cell below. This will scale the node pools down and remove all the experiments deployed (on the cluster).

1. Experiments cannot be restarted.
2. Experiment logs will not persist deletion.


In [66]:
# This will remove all information and logs as well.
kubectl delete pytorchjobs.kubeflow.org --all-namespaces --all

gcloud container clusters resize $CLUSTER_NAME --node-pool $DEFAULT_POOL \
    --num-nodes 0 --region $REGION --quiet

gcloud container clusters resize $CLUSTER_NAME --node-pool $EXPERIMENT_POOL \
    --num-nodes 0 --region $REGION --quiet

To learn more, consult https://cloud.google.com/blog/products/containers-kubernetes/kubectl-auth-changes-in-gke
pytorchjob.kubeflow.org "trainjob-3a0b2d2c-864c-4f4c-8647-248465c441c4" deleted
pytorchjob.kubeflow.org "trainjob-76482789-4267-4bde-a48d-6c2adc0c85e5" deleted
Resizing fltk-testbed-cluster...done.                                          
Updated [https://container.googleapis.com/v1/projects/test-bed-fltk-group16-mb/zones/us-central1-c/clusters/fltk-testbed-cluster].
Resizing fltk-testbed-cluster...done.                                          
Updated [https://container.googleapis.com/v1/projects/test-bed-fltk-group16-mb/zones/us-central1-c/clusters/fltk-testbed-cluster].
