# Automating and Scaling Deep Learning on Kubernetes using Nautilus Job Launcher

This notebook will walk through using the [Nautilus Job Launcher](https://github.com/MU-HPDI/Nautilus-Job-Launcher) library to launch multiple jobs from a file config and dictionary config, as well as how to integrate the library into your own Python scripts


## Nautilus Job Launcher Configuration

Each job passed to the Nautilus Job Launcher must be configured via a set of key/value pairs, via either YAML or Dictionary as we'll see in the sections below.

### Required Keys

| Key        | Description             |       Type        |
| :--------- | :---------------------- | :---------------: |
| `job_name` | Name of the job         |       `str`       |
| `image`    | Container image URI     |       `str`       |
| `command`  | Command to run on start | `str`/`List[str]` |

### Optional Keys

| Key          | Description                             |          Type          | Default Value |
| :----------- | :-------------------------------------- | :--------------------: | :-----------: |
| `workingDir` | Working directory when container starts |         `str`          |     None      |
| `env`        | Environment variables                   | `Dict`: `str` -> `str` |     None      |
| `volumes`    | PVC and SHM volumes                     | `Dict`: `str` -> `str` |     None      |
| `ports`      | Container ports to expose               |      `List[int]`       |     None      |
| `min_cpu`    | Min number of CPUs                      |         `int`          |       2       |
| `max_cpu`    | Max number of CPUs                      |         `int`          |       4       |
| `min_ram`    | Min GB of RAM                           |         `int`          |       4       |
| `max_ram`    | Max GB of RAM                           |         `int`          |       8       |
| `gpu`        | Number of GPUs                          |         `int`          |       0       |
| `gpu_types`  | Types of GPUs for Job                   |      `List[str]`       |     None      |
| `shm`        | When true, add shared memory            |         `bool`         |     False     |


## Part 1: Using a Dictionary

You can use a Python `dict` object to configure a set of jobs that will all be run.

Be sure to update the `namespace` and `job_prefix` variables below:


In [None]:
from nautiluslauncher import NautilusJobLauncher

namespace = None
job_prefix = "myname-"
command = ["python", "-c", "print('hello world')"]

jobs = {
    "namespace": namespace,
    "jobs": [
        {"image": "python:3.6", "command": command, "job_name": job_prefix + "1"},
        {"image": "python:3.7", "command": command, "job_name": job_prefix + "2"},
        {"image": "python:3.8", "command": command, "job_name": job_prefix + "3"},
    ],
}

jobs


Once we've build our dictionary, we can pass it to the constructor of the job launcher:


In [None]:
launcher = NautilusJobLauncher(jobs)

In [None]:
launcher.jobs

To launch the jobs, we can call the `run` method:


In [None]:
launcher.run()

We can now see all the jobs have been created:


In [None]:
! kubectl get pods

## Part 2: Templating Jobs with Dictionaries and YAML Files

As we've seen in the SKLearn and ViT exercises, we often want to start with a template of a job, be that a CPU ML task or GPU ML task, and simply update the values. The Nautilus Job Launcher library allows us to do that with a YAML file.

This is done using a `defaults` key and then a set of jobs. Every job will inherit the values in `defaults` and then optionally override them.

Here's an example YAML file to do this:

```yaml
defaults:
  command: ["python", "-c", "print('hello world')"]

jobs:
  - job_name: myjob-1
    container: python:3.7
  - job_name: myjob-2
    container: python:3.8
```


### Part 2A: Build Updated Script

To show the capabilities of this automation client in Python, let's update our script from the [ViT Exercise](./VisionTransformerCifar10.ipynb) to be able to dynamically select a model from an environment variable. First, we'll need a function that can take in the name of a model and return the fully configured PyTorch object:


In [None]:
from torchvision.models import (
    vit_b_16,
    ViT_B_16_Weights,
    resnet18,
    ResNet18_Weights,
    mobilenet_v2,
    MobileNet_V2_Weights,
)
from torch import nn


def build_model(model_name):
    if model_name == "ViT":
        model = vit_b_16(weights=ViT_B_16_Weights.IMAGENET1K_V1)
        model.heads[0] = nn.Linear(768, 10)
        model.encoder.requires_grad_(False)
    elif model_name == "ResNet":
        model = resnet18(weights=ResNet18_Weights.IMAGENET1K_V1)
        model.fc = nn.Linear(model.fc.in_features, 10)
        for param in model.parameters():
            param.requires_grad = False
        model.fc.requires_grad_(True)
    elif model_name == "MobileNet":
        model = mobilenet_v2(weights=MobileNet_V2_Weights.IMAGENET1K_V2)
        model.classifier = nn.Linear(model.classifier[1].in_features, 10)
        model.features.requires_grad_(False)
    else:
        raise ValueError(f"Invalid Model: {model_name}")

    return model


Now we can update the model section of the script to:


In [None]:
import os

TORCH_MODEL_NAME = os.environ.get("TORCH_MODEL_NAME", "ViT")

model = build_model(TORCH_MODEL_NAME)

### Part 2B: Upload the Updated Script

We've placed the updated version of the script to the scripts directory, and named it [MultiModelCifar10.ipynb](../scripts/MultiModelCifar10.ipynb). Let's copy this script to our PVC using a pod on the cluster.


In [None]:
! kubectl create -f ../yaml/pod_pvc.yml

In [None]:
! kubectl cp ../scripts/MultiModelCifar10.py PODNAME:/data/MultiModelCifar10.py

In [None]:
! kubectl exec PODNAME -- cat /data/MultiModelCifar10.py

### Part 2C: Building the Configuration Dictionary

In previous exercises, we needed to build the Kubernetes YAML Spec by hand. In this exercise, we are instead going to build a configuration dictionary.

First, we define the defaults, i.e., what each job will start with. We can override these values for individual jobs as necessary. Be sure to set the `pvc_name` to your PVC name:


In [None]:
pvc_name = None


defaults = dict(
    image="gitlab-registry.nrp-nautilus.io/gp-engine/jupyter-stacks/bigdata-2023:latest",
    command=["python3", "/data/MultiModelCifar10.py"],
    workingDir="/data",
    volumes={pvc_name: "/data"},
    shm=True,
    min_cpu=8,
    max_cpu=8,
    min_ram=8,
    max_ram=8,
    gpu=1,
    env=dict(TORCH_NUM_JOBS=8, TORCH_NUM_EPOCHS=1),
)

defaults


Now we can define the jobs with what needs to be unqiue to each job. In this case, just the name and the model name. Ensure you update the job names to be unique in the namespace:


In [None]:
jobs = [
    dict(job_name=f"jobname-{i+1}", env=dict(TORCH_MODEL_NAME=m))
    for i, m in enumerate(["ViT", "ResNet", "MobileNet"])
]

jobs


### Part 2C: Create Job Launcher Object

Be sure to update the `NAMESPACE` variable:


In [None]:
NAMESPACE = None

launcher = NautilusJobLauncher(
    cfg=dict(namespace=NAMESPACE, defaults=defaults, jobs=jobs)
)

launcher.jobs


We can now check the full configuration of the jobs (with defaults intertwined) using the `dryrun` flag:


In [None]:
launcher.run(dryrun=True)

And finally, we can kick off the jobs, by removing the `dryrun` flag:


In [None]:
launcher.run()

And now we can watch the jobs run:


In [None]:
! kubectl get pods

### Part 2D: Comapring Results

By configuring our dictionary correctly and using templating, we have easily trained and tested 3 separate models on CIFAR-10. We can now check the outputs to compare the performance of the models.

Be sure to update the pod names in the cells below to match your pod names. (1) is ViT, (2) is ResNet, and (3) is MobileNet


##### ViT Performance


In [None]:
! kubectl logs --tail=5 jobname-1-REPLACE

##### ResNet18 Performance


In [None]:
! kubectl logs --tail=5 jobname-2-REPLACE

##### MobileNetV2


In [None]:
! kubectl logs --tail=5 jobname-3-REPLACE

## Part 3: Integration with Python

Thus far, we have used the Nautilus Job Launcher as a pure runner, where we hand it a Python dictionary (or YAML file), and have it launch the jobs. Here, we will look at an object oriented approach to job creation.

We will use a different class, the `NautilusAutomationClient`, which accepts `Job` objects that we can create on the fly.

Let's start by creating our client:


In [None]:
from nautiluslauncher import Job, NautilusAutomationClient

NAMESPACE = None

client = NautilusAutomationClient(NAMESPACE)

We can now create job objects, where the key/values from our dictionary become parameters:


In [None]:
job1 = Job(
    image="ubuntu:20.04", command=["echo", "'Hello World'"], job_name="my-ubuntu-job"
)
job2 = Job(image="python:3.8", command=["ls", "/etc"], job_name="my-python-job")


And our client can create jobs as well:


In [None]:
client.create_job(job1)

In [None]:
client.create_job(job2)

We can also get the details of all the jobs in our namespace using the `list_pods` function:


In [None]:
client.list_pods()

## Part 4: Removing Resources

Please be sure to remove your completed jobs and running pods!
