# README (Ignore if you are running on Mac/Linux)

If you are running on Windows, make sure you have started the Jupyter Notebook in a Bash shell.
Moreover, all the requirements below must be installed in this Bash (compatible) shell.

This can be achieved as follows:

1. Enable and install WSL(2) for Windows 10/11 [official documentation](https://docs.microsoft.com/en-us/windows/wsl/install)
    * On newer builds of W10/11 you can install WSL by running the following command in an *administrator* PowerShell terminal. Which will install by default an Ubuntu instance of WSL.
    ```bash
   wsl --install
    ```
2. Start the Ubuntu Bash shell by searching for `Bash` under Start, or by running `bash` in a (normal) PowerShell terminal.

Using a Bash terminal as started under step 2 above, you can install the Requirements as described below as if you are running it under Linux or Ubuntu/Debian.

## Requirements
These requirements may also be installed on Windows, however, development has only been tested on Linux/macOS.

Before we get started, first make sure to install all the required tools. We provide two lists below, one needed for setting up the testbed. And one for developing code to use with the testbed. Feel free to skip the installation of the second list, and return at a later point in time.


### Deployment

 > ⚠️ All dependencies must be installed in a Bash-compatible shell. For Windows users also see [above](#read-me)
Make sure to install a recent version of each of the dependencies.


 * (Windows only) Install every dependency in a Windows Subsystem for the Linux, Bash shell (see also README above).
 * GCloud SDK
    - Follow the installation instructions [here](https://cloud.google.com/sdk/docs/install), follow either the Linux installation instruction, or your OS/Distribution specific instructions.
    - Initialize the SDK with `gcloud init`, if prompted you may ignore to set/create a default/first project.
    - ⚠️ Run the command `gcloud auth application-default login`
        - ℹ️ We need to run this command to utilize your login credentials programmatically with terraform. This is needed as we will use these to impersonate a service account during the creation and setup of the Kubernetes cluster.
    - ⚠️ Run the command `gcloud components install beta`
        - ℹ️ We need to run this command to list the billing account IDs and enable billing. Currently, these features fall under beta access.
    - ⚠️ Run the command `gcloud components install gke-gcloud-auth-plugin`
        - ℹ️ We need to run this command to retrieve cluster configurations (to be used by `kubectl` and `helm`)
 * Kubectl (>= 1.22.0)
 * Helm (>= 3.9.4)
 * Terraform (>= 1.2.8)
 * Python3.9/10
   * jupyter, ipython, bash_kernel
```bash
pip3 install jupyter ipython bash_kernel
python3 -m bash_kernel.install
```

### Development
For development, the following tools are needed/recommended:

 * Docker (>= 18.09).
    - If you don't have experience with using Docker, we recommend following [this](https://docs.docker.com/get-started/) tutorial.
 * Python3.9
 * pip3
 * JetBrains PyCharm

# Preparation

To make sure we can request resources on Google Cloud Platform (GCP), perform the following;

1. Create a GCP account on [https://cloud.google.com](https://cloud.google.com), using a Google account
2. Redeem your academic coupon on GCP, see Brightspace for information on obtaining the \\$50 academic coupon, or use the free \\$300 credits for new users provided by Google.


3. Make sure to use the `Bash` kernel, not a Python or other kernel. For those on windows machines, make sure to launch the `jupyter notebook` server from a bash-compliant command line, we recommend Windows Subsystem for Linux.

⚠️ Make sure to run this Notebook within a cloned repository, not standalone/downloaded from GitHub.


# Deployment

⚠️ This notebook assumes that commands are executed in order. Executing the provided commands multiple times should not result in issues. However, re-running cells with `cd` commands, or altering cells (other than variables as instructed) may result in unexpected behaviour.

## Getting started

First, we will set a few variables used **throughout** the project. We set them in this notebook for convenience, but they are also set to some example default values in configuration files for the project. If you change any of these, make sure to change the corresponding variables as well in;

* [`../terraform/terraform-gke/variables.tf`](../terraform/terraform-gke/variables.tf)
* [`../terraform/terraform-dependencies/variables.tf`](../terraform/terraform-dependencies/variables.tf)


> ⚠️ As you have changed the `PROJECT_ID` parameter to a unique project name, also change the `project_id` variable in the following files. This allows you to run `terraform apply` without having to override the default value for the project.

> ℹ️ Any variable changed here can also be provided to `terraform` using the `-var` flag, i.e.  `-var terraform_variable=$BASH_VARIABLE`. An example for setting the `project_id` variable is also provided later.

In [4]:
# VARIABLES THAT NEEDS TO BE SET
PROJECT_ID="eric-cs4215-fltk"              # CHANGE ME!

# DEFAULT VARIABLES
ACCOUNT_ID="terraform-iam-service-account"
PRIVILEGED_ACCOUNT_ID="${ACCOUNT_ID}@${PROJECT_ID}.iam.gserviceaccount.com"
CLUSTER_NAME="fltk-testbed-cluster"
REGION="us-central1-c"
echo $PWD

/home/yifan/Documents/fltk-testbed/jupyter


## Project creation

Next, we create a project using the `PROJECT_ID` variable and get all the billing account information.

⁉️ (Ignore if using a pre-existing GCP Project) If the command below does not complete successfully, make sure to change the `PROJECT_ID` variable in the previous cell and re-run it.

In [5]:
alias gcloud=/home/yifan/.local/tools/google-cloud-sdk/bin/gcloud
gcloud projects create $PROJECT_ID --set-as-default
# gcloud config set project $PROJECT_ID
gcloud beta billing accounts list # Copy the Account ID of the account

Create in progress for [https://cloudresourcemanager.googleapis.com/v1/projects/eric-cs4215-fltk].
Waiting for [operations/cp.8719614993913160474] to finish...done.              
Enabling service [cloudapis.googleapis.com] on project [eric-cs4215-fltk]...
Operation "operations/acat.p2-357745681408-bca4138e-a10c-4c49-abc9-b4c7c29a92e4" finished successfully.
Updated property [core/project] to [eric-cs4215-fltk].
ACCOUNT_ID            NAME          OPEN  MASTER_ACCOUNT_ID
01A260-563FE5-2AAE64  我的结算帐号  True


Copy the billing account identifier, e.g. `015594-41687F-092941`, and assign to the variable in the cell below

In [6]:
BILLING_ACCOUNT="01A260-563FE5-2AAE64"      # CHANGE ME!

Setup billing and enable services, this will allow us to create a GKE cluster (Google managed Kubernetes cluster), and push and pull containers to our private container repo.

In [7]:
# Setup billing to project
gcloud beta billing projects link $PROJECT_ID --billing-account $BILLING_ACCOUNT
# Enable services now billing is enabled
gcloud services enable compute container --project $PROJECT_ID

billingAccountName: billingAccounts/01A260-563FE5-2AAE64
billingEnabled: true
name: projects/eric-cs4215-fltk/billingInfo
projectId: eric-cs4215-fltk
Operation "operations/acf.p2-357745681408-8f13bc8d-5f5d-47b0-aa34-1f19850f285d" finished successfully.


## Creating a service-account

Create service account that has the minimum set of permissions for creating and managing a cluster. This service account
will be used to create the cluster, and deploy the dependencies that we use.

During the deployment we will make use of impersonation, to let *your* account utilize the service-account. For more information about this practise, see also [this](https://cloud.google.com/blog/topics/developers-practitioners/using-google-cloud-service-account-impersonation-your-terraform-code) blog by Google.

In [10]:
# Helper function to quickly enable gcp roles, assumes $PRIVILEGED_ACCOUNT_ID and $PROJECT_ID to be set.
function enable_gcp_role () {
  ROLE=$1
  gcloud projects add-iam-policy-binding \
    $PROJECT_ID \
    --member="serviceAccount:$PRIVILEGED_ACCOUNT_ID" \
    --role="roles/$ROLE"
}

# Create service-account
# gcloud iam service-accounts create $ACCOUNT_ID --display-name="Terraform service account" --project ${PROJECT_ID}

# Allow the service account to use the the set of roles below.
enable_gcp_role "compute.viewer"                # Allow the service account to see active resources
enable_gcp_role "storage.objectViewer"          # Allow the service account/managed resources to pull from gcr.io (your code)
enable_gcp_role "compute.networkAdmin"          # Needed for setting up private network
enable_gcp_role "compute.securityAdmin"         # Needed for GKE
enable_gcp_role "container.clusterViewer"       # Needed for GKE
enable_gcp_role "container.clusterAdmin"        # Needed for GKE
enable_gcp_role "container.developer"           # Needed for GKE
enable_gcp_role "iam.serviceAccountAdmin"       # Needed for GKE
enable_gcp_role "iam.serviceAccountUser"        # Needed for GKE


Created service account [terraform-iam-service-account].
Updated IAM policy for project [eric-cs4215-fltk].
bindings:
- members:
  - serviceAccount:service-357745681408@compute-system.iam.gserviceaccount.com
  role: roles/compute.serviceAgent
- members:
  - serviceAccount:terraform-iam-service-account@eric-cs4215-fltk.iam.gserviceaccount.com
  role: roles/compute.viewer
- members:
  - serviceAccount:service-357745681408@container-engine-robot.iam.gserviceaccount.com
  role: roles/container.serviceAgent
- members:
  - serviceAccount:service-357745681408@containerregistry.iam.gserviceaccount.com
  role: roles/containerregistry.ServiceAgent
- members:
  - serviceAccount:357745681408-compute@developer.gserviceaccount.com
  - serviceAccount:357745681408@cloudservices.gserviceaccount.com
  role: roles/editor
- members:
  - user:ericyfsong@gmail.com
  role: roles/owner
- members:
  - serviceAccount:service-357745681408@gcp-sa-pubsub.iam.gserviceaccount.com
  role: roles/pubsub.serviceAgent
et

## Enable impersonation
With the service account created, we must enable impersonation, to allow the main account of the project to make use of the service account. For more information see also the [`add-iam-policy-binding`](https://cloud.google.com/sdk/gcloud/reference/iam/service-accounts/add-iam-policy-binding) reference.

Assign your `google_account` mail to the `OWNER_MAIL` variable, and run the command box below.

In [11]:
OWNER_MAIL="ericyfsong@gmail.com"
gcloud iam service-accounts add-iam-policy-binding $PRIVILEGED_ACCOUNT_ID \
 --member="user:$OWNER_MAIL" \
 --role=roles/iam.serviceAccountTokenCreator \
 --project $PROJECT_ID

Updated IAM policy for serviceAccount [terraform-iam-service-account@eric-cs4215-fltk.iam.gserviceaccount.com].
bindings:
- members:
  - user:ericyfsong@gmail.com
  role: roles/iam.serviceAccountTokenCreator
etag: BwXoyRYUPXs=
version: 1


To enable using your account's credentials, run the command below. This will open in a new tab/open the link that is displayed. Afterwards you can use your own credentials to impersonate the service account. 

You can, for example, also allow other google users (such as project members) to work with your cluster in this way.

In [12]:
gcloud auth application-default login

Your browser has been opened to visit:

    https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=764086051850-6qr4p6gpi6hn506pt8ejuq83di341hur.apps.googleusercontent.com&redirect_uri=http%3A%2F%2Flocalhost%3A8085%2F&scope=openid+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fsqlservice.login+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Faccounts.reauth&state=PTOYUlOzaPv7wIBBRkpiH71eTHIeFP&access_type=offline&code_challenge=0G73vrz2HbceJYnxUay2qKuTdhCe7uJoQH6BmTJ_kkc&code_challenge_method=S256

Opening in existing browser session.

Credentials saved to file: [/home/yifan/.config/gcloud/application_default_credentials.json]

These credentials will be used by any library that requests Application Default Credentials (ADC).

Quota project "eric-cs4215-fltk" was added to ADC which can be used by Google client libraries for billing and quota. Note that some services may 

## Creating a Google managed cluster (GKE)
To create the cluster, first change the active directory to the `terraform-gke` directory.

⚠️ Creating a cluster will incur billing cost on your project, by default the cluster will be small to minimize costs during this tutorial. Forgetting to `destroy` or scale down the cluster may result in quickly spending your academic coupon.

In [23]:
cd ../terraform/terraform-gke
echo $PWD

/home/yifan/Documents/fltk-testbed/terraform/terraform-gke


Init the directory, to initialize the Terraform module.

In [24]:
terraform init -reconfigure

[0m[1mInitializing modules...[0m

[0m[1mInitializing the backend...[0m

[0m[1mInitializing provider plugins...[0m
- Reusing previous version of hashicorp/google-beta from the dependency lock file
- Reusing previous version of hashicorp/random from the dependency lock file
- Reusing previous version of hashicorp/google from the dependency lock file
- Reusing previous version of hashicorp/kubernetes from the dependency lock file
- Using previously-installed hashicorp/google-beta v4.36.0
- Using previously-installed hashicorp/random v3.4.3
- Using previously-installed hashicorp/google v4.36.0
- Using previously-installed hashicorp/kubernetes v2.13.1

[0m[1m[32mTerraform has been successfully initialized![0m[32m[0m
[0m[32m
You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.

If you ever set or change modules or backend configuration for Terraform,
rerun 

Next, we can check whether we can create a cluster. No warnings or errors should occur during this process. It may take a while to complete.

> ⚠️ We provide the project_id variable from `terraform/terraform-gke` manually, and also change the default value.

⁉️ If the command below does not complete successfully, e.g. after raising a `403` error, make sure that you have successfully created the project with `gcloud` earlier.


In [25]:
terraform plan -var project_id=$PROJECT_ID

[0m[1mmodule.gke.random_string.cluster_service_account_suffix: Refreshing state... [id=zl6u][0m
[0m[1mdata.google_service_account_access_token.default: Reading...[0m[0m
[0m[1mdata.google_service_account_access_token.default: Read complete after 0s [id=projects/-/serviceAccounts/terraform-iam-service-account@eric-cs4215-fltk.iam.gserviceaccount.com][0m
[0m[1mdata.google_client_config.default: Reading...[0m[0m
[0m[1mmodule.gke.data.google_compute_zones.available: Reading...[0m[0m
[0m[1mmodule.gke.data.google_container_engine_versions.region: Reading...[0m[0m
[0m[1mmodule.gcp-network.module.vpc.google_compute_network.network: Refreshing state... [id=projects/eric-cs4215-fltk/global/networks/gcp-private-network][0m
[0m[1mdata.google_client_config.default: Read complete after 0s [id=projects/eric-cs4215-fltk/regions//zones/][0m
[0m[1mmodule.gke.data.google_container_engine_versions.region: Read complete after 1s [id=2022-09-16 11:41:14.880587865 +0000 UTC][0m


When the previous command completes successfully, we can start the deployment. Depending on any changes you may have done, this might take a while.

By default, this will create a private zonal cluster consisting of two node pools.

> ⚠️ A regional cluster (multi-zonal) will incur an additional fee of \\$ 0.10 /hour per managed (GKE) cluster. The **first** zonal cluster is free of this charge.

> ⚠️ By default spot/preemptive nodes are disabled. You can experiment by setting `spot` to true in the `tf` files. Note, however, that the default implementations provided in the testbed do not allow for recovery from getting spun down and rescheduled. Moreover, this may result in poor availability during busy hours in the region in which you deploy your cluster.


In [26]:
terraform apply -auto-approve -var project_id=$PROJECT_ID

[0m[1mmodule.gke.random_string.cluster_service_account_suffix: Refreshing state... [id=zl6u][0m
[0m[1mdata.google_service_account_access_token.default: Reading...[0m[0m
[0m[1mdata.google_service_account_access_token.default: Read complete after 0s [id=projects/-/serviceAccounts/terraform-iam-service-account@eric-cs4215-fltk.iam.gserviceaccount.com][0m
[0m[1mdata.google_client_config.default: Reading...[0m[0m
[0m[1mmodule.gke.data.google_compute_zones.available: Reading...[0m[0m
[0m[1mdata.google_client_config.default: Read complete after 0s [id=projects/eric-cs4215-fltk/regions//zones/][0m
[0m[1mmodule.gke.data.google_container_engine_versions.region: Reading...[0m[0m
[0m[1mmodule.gcp-network.module.vpc.google_compute_network.network: Refreshing state... [id=projects/eric-cs4215-fltk/global/networks/gcp-private-network][0m
[0m[1mmodule.gke.data.google_compute_zones.available: Read complete after 1s [id=projects/eric-cs4215-fltk/regions/us-central1][0m
[0

Next, we add cluster credentials (so you can interact with the cluster through `kubectl` an `helm`).

In [27]:
# Add credentials for interacting with cluster via kubectl
gcloud container clusters get-credentials $CLUSTER_NAME --region $REGION --project $PROJECT_ID

Fetching cluster endpoint and auth data.
kubeconfig entry generated for fltk-testbed-cluster.


### Changing deployment

To save cost, or run different experiments, you might want to change the configuration of your cluster. This can be achieved by modifying the cluster configuration in the [`terraform-gke/main.tf`](../terraform/terraform-gke/main.tf) configuration file. You can change the default node-pools, create additional node pools with taints (to allow for scheduling on specific nodes/pools) and much more.

After finishing your changes, simply run the following commands

```bash
# Use `plan` to check your configuration
terraform plan
# Check to see if your changes are as expected, terraform will show what will be created/removed.

# If the changes are as you expect, apply the changes.
terraform apply #-auto-approve
```

Depending on the number of changes, this may take some time.

## Installing dependencies
Lastly, we need to install the dependencies on our cluster. First change the directories, and then run the `init`, `plan` and `apply` commands as we did for creating the GKE cluster.

In [28]:
cd ../terraform-dependencies
echo $PWD

/home/yifan/Documents/fltk-testbed/terraform/terraform-dependencies


Init the directory, to initialize the Terraform module.

In [29]:

terraform init -reconfigure


[0m[1mInitializing the backend...[0m

[0m[1mInitializing provider plugins...[0m
- Reusing previous version of kbst/kustomization from the dependency lock file
- Reusing previous version of gavinbunney/kubectl from the dependency lock file
- Reusing previous version of hashicorp/kubernetes from the dependency lock file
- Reusing previous version of hashicorp/google from the dependency lock file
- Reusing previous version of hashicorp/helm from the dependency lock file
- Using previously-installed kbst/kustomization v0.9.0
- Using previously-installed gavinbunney/kubectl v1.14.0
- Using previously-installed hashicorp/kubernetes v2.13.1
- Using previously-installed hashicorp/google v4.36.0
- Using previously-installed hashicorp/helm v2.6.0

[0m[1m[32mTerraform has been successfully initialized![0m[32m[0m
[0m[32m
You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should n

Check to see if we can plan the deployment. This will setup the following:

* Kubeflow training operator (used to deploy and manage PyTorchTrainJobs programmatically)
* NFS-provisioner (used to enable logging on a persistent `ReadWriteMany` PVC in the cluster)


In [30]:
terraform plan -var project_id=$PROJECT_ID

[0m[1mdata.kustomization_build.training_operator: Reading...[0m[0m
[0m[1mhelm_release.nfs_client_provisioner: Refreshing state... [id=nfs-server][0m
[0m[1mdata.google_service_account_access_token.default: Reading...[0m[0m
[0m[1mdata.google_service_account_access_token.default: Read complete after 0s [id=projects/-/serviceAccounts/terraform-iam-service-account@eric-cs4215-fltk.iam.gserviceaccount.com][0m
[0m[1mdata.google_client_config.default: Reading...[0m[0m
[0m[1mdata.google_client_config.default: Read complete after 0s [id=projects/eric-cs4215-fltk/regions//zones/][0m
[0m[1mdata.google_container_cluster.testbed_cluster: Reading...[0m[0m
[0m[1mdata.kustomization_build.training_operator: Read complete after 2s [id=a294ea9a3d4f626ec1ec55aac66b4a486f682fe5dbec2eadf58d30baee14a8f66a3ec2674c0e2d9a40e7bc191f878f010393849f4581c6a4189bc41761abab89][0m
[0m[1mkustomization_resource.training_operator["_/ServiceAccount/kubeflow/training-operator"]: Refreshing state

When the previous command completes successfully, we can start the deployment. This will install the NFS provisioner and Kubeflow Training Operator dependencies


In [32]:
terraform apply -auto-approve -var project_id=$PROJECT_ID

[0m[1mdata.kustomization_build.training_operator: Reading...[0m[0m
[0m[1mhelm_release.nfs_client_provisioner: Refreshing state... [id=nfs-server][0m
[0m[1mdata.google_service_account_access_token.default: Reading...[0m[0m
[0m[1mdata.google_service_account_access_token.default: Read complete after 1s [id=projects/-/serviceAccounts/terraform-iam-service-account@eric-cs4215-fltk.iam.gserviceaccount.com][0m
[0m[1mdata.google_client_config.default: Reading...[0m[0m
[0m[1mdata.google_client_config.default: Read complete after 0s [id=projects/eric-cs4215-fltk/regions//zones/][0m
[0m[1mdata.google_container_cluster.testbed_cluster: Reading...[0m[0m
[0m[1mdata.kustomization_build.training_operator: Read complete after 3s [id=a294ea9a3d4f626ec1ec55aac66b4a486f682fe5dbec2eadf58d30baee14a8f66a3ec2674c0e2d9a40e7bc191f878f010393849f4581c6a4189bc41761abab89][0m
[0m[1mkustomization_resource.training_operator["_/Namespace/_/kubeflow"]: Refreshing state... [id=ad40d91e-e00d

: 1

## Testing the deployment

To make sure that the deployment went OK, we can run the following command to test whether we can use Pytorch-Training operators.

This will create a simple deployment using a Kubeflow pytorch example job.

This will create a small (1 master, 1 client) training job on mnist on your cluster. You can follow the deployment by navigating to your cluster on [cloud.google.com](cloud.google.com)

In [None]:
kubectl create -f https://raw.githubusercontent.com/kubeflow/training-operator/master/examples/pytorch/simple.yaml


# Cleaning up

> ⚠️ THIS WILL REMOVE YOUR CLUSTER AND DATA STORED ON IT. For this tutorial's purpose destroying your cluster is not an issue. For testing/developing, we recommend manually scaling your cluster up and down instead.


To clean up/remove the cluster, we will use the `terraform destroy` command.

 * Running it in `terraform-dependencies` WILL REMOVE the Kubeflow Training-Operator from your cluster.
 * Running it in `terraform-gke` WILL REMOVE YOU ENTIRE CLUSTER.

You can uncomment the commands below to remove the cluster, or run the command in a terminal in the [`.../terraform/terraform-gke`](../terraform/terraform-gke) directory.


In [None]:
pwd

In [None]:
# cd ../terraform-gke

terraform destroy -auto-approve

In [None]:
# Change nodepools


gcloud container clusters resize $CLUSTER_NAME --node-pool "default-node-pool" \
     --num-nodes 0 --region us-central1-c --quiet

gcloud container clusters resize $CLUSTER_NAME --node-pool "medium-fltk-pool-1" \
    --num-nodes 0 --region us-central1-c --quiet
