# Getting started

First, we enable the cluster to scale up. Note that if you run an auto-scaling cluster,
Google will suspend your nodes. Make sure to have the experiment prepared before running the commands.

The following is assumed ready:
* GKE/Kubernetes cluster (see also `terraform/terraform_notebook.ipynb`)
    * 2 nodes pools (default for system & dependencies, experiment pool)
* Docker image (including dataset, to speed-up starting experiments).
    * First run the extractor (locally) `python3 -m extractor configs/example_cloud_experiment.json`
        *  This downloads datasets to be included in the docker image.
    * Build the container `DOCKER_BUILDKIT=1 docker build --platform linux/amd64 . --tag gcr.io/eric-cs4215-fltk/fltk`
    * Push to your gcr.io repository `docker push gcr.io/eric-cs4215-fltk/fltk`


With that setup, first set some variables used throughout the experiment.


In [10]:
PROJECT_ID="eric-cs4215-fltk"
CLUSTER_NAME="fltk-testbed-cluster"
DEFAULT_POOL="default-node-pool"
EXPERIMENT_POOL="medium-fltk-pool-1"
REGION="us-central1-c"
alias gcloud=/home/yifan/.local/tools/google-cloud-sdk/bin/gcloud
# In case we do not yet have the credentials/kubeconfig
gcloud container clusters get-credentials $CLUSTER_NAME --region $REGION --project $PROJECT_ID

Fetching cluster endpoint and auth data.
kubeconfig entry generated for fltk-testbed-cluster.


Scale the default-node-pool up.

In [13]:
# These commands might take a while to complete.
gcloud container clusters resize $CLUSTER_NAME --node-pool $DEFAULT_POOL \
     --num-nodes 2 --region us-central1-c --quiet

gcloud container clusters resize $CLUSTER_NAME --node-pool $EXPERIMENT_POOL \
    --num-nodes 3 --region us-central1-c --quiet

Resizing fltk-testbed-cluster...done.                                          
Updated [https://container.googleapis.com/v1/projects/eric-cs4215-fltk/zones/us-central1-c/clusters/fltk-testbed-cluster].
ERROR: (gcloud.container.clusters.resize) PERMISSION_DENIED: Insufficient regional quota to satisfy request: resource "CPUS": request requires '24.0' and is short '2.0'. project has a quota of '24.0' with '22.0' available. View and manage quotas at https://console.cloud.google.com/iam-admin/quotas?usage=USED&project=eric-cs4215-fltk.


: 1

## Preparation
In case you have already tested something or ran another experiment, we have to remove the deployment of the Orchestrator. This will not delete any experiment data, as this persists on one of the ReadWriteMany PVCs.


Currently, the Orchestrator is deployed using a `Deployment` definition, a future version will replace this with a `Deployment` definition, to make this step unnecessary. For experiments this means the following:

1. A single deployment can exist at a single time in a single namespace. This includes 'completed' experiments.
2. For running batches of experiments, a BatchOrchestrator is provided.


ℹ️ This will not remove any data, but if your orchestrator is still/already running experiments, this will stop the deployment. Running training jobs will not be stopped, for this you can use `kubectl`. ConfigMaps created by the Orchestrator (to provide experiment configurations), will not be removed. See the commented code in the cell below.

In [14]:
# If you want to delete all pytorch trainjobs, uncomment the command below.
#  kubectl delete pytorchjobs.kubeflow.org --all --namespace test

# If you want to delete all existing configuration map objects in a namespace, run teh command below
# kubectl delete configmaps --all --namespace test

helm uninstall -n test flearner

To learn more, consult https://cloud.google.com/blog/products/containers-kubernetes/kubectl-auth-changes-in-gke
Error: uninstall: Release not loaded: flearner: release: not found


: 1

## Define experiment configuration files

Deployment of experiments is currently done through a Helm Deployment. A future release (™️) will rework this to a Job definition, as this allows to re-use the template more easily.

The `EXPERIMENT_FILE` will contain the description of the experiments
The `CLUSTER_CONFIG` will contain shared configurations for logging, Orchestrator configuration and replication information.

In [15]:
EXPERIMENT_FILE="configs/federated_tasks/example_arrival_config.json"
CLUSTER_CONFIG="configs/example_cloud_experiment.json"

## Setup experiment variables
Next, we will deploy the experiments.


We provide a configuration file, `charts/fltk-values.yaml`, in here change the values under the `provider` block. Change `projectName` to your Google Cloud Project ID.

```yaml
provider:
    domain: gcr.io
    projectName: CHANGE_ME!
    imageName: fltk:latest
```

We use the `--set-file` flag for `helm`, as currently, Helm does not support using files outside of the chart root directory (in this case `charts/orchestrator`). Using `--set-file` we can dynamically provide these files. See also issue [here](https://github.com/helm/helm/issues/3276)


In [16]:
helm uninstall experiment-orchestrator -n test
helm install experiment-orchestrator charts/orchestrator --namespace test -f charts/fltk-values.yaml\
  --set-file orchestrator.experiment=$EXPERIMENT_FILE,orchestrator.configuration=$CLUSTER_CONFIG


To learn more, consult https://cloud.google.com/blog/products/containers-kubernetes/kubectl-auth-changes-in-gke
Error: uninstall: Release not loaded: experiment-orchestrator: release: not found
Error: INSTALLATION FAILED: failed to download "charts/orchestrator"


: 1

In [17]:
# To get logs from the orchestrator
kubectl logs -n test fl-learner

To learn more, consult https://cloud.google.com/blog/products/containers-kubernetes/kubectl-auth-changes-in-gke
Error from server (NotFound): pods "fl-learner" not found


: 1

In [18]:
# To get logs from learners (example)
kubectl logs -n test trainjob-eb056010-7c33-4c46-9559-b197afc7cb84-master-0

# To get logs from learners (federated learning)
kubectl logs -n test trainjob-eb056010-7c33-4c46-9559-b197afc7cb84-worker-0

To learn more, consult https://cloud.google.com/blog/products/containers-kubernetes/kubectl-auth-changes-in-gke
Error from server (NotFound): pods "trainjob-eb056010-7c33-4c46-9559-b197afc7cb84-master-0" not found
To learn more, consult https://cloud.google.com/blog/products/containers-kubernetes/kubectl-auth-changes-in-gke
Error from server (NotFound): pods "trainjob-eb056010-7c33-4c46-9559-b197afc7cb84-worker-0" not found


: 1

# Wrapping up

To scale down the cluster nodepools, run the cell below.


In [None]:
gcloud container clusters resize $CLUSTER_NAME --node-pool $DEFAULT_POOL \
     --num-nodes 0 --region us-central1-c --quiet

gcloud container clusters resize $CLUSTER_NAME --node-pool $EXPERIMENT_POOL \
    --num-nodes 0 --region us-central1-c --quiet