# Training a Vision Transformer on CIFAR-10 Using Kubernetes
This notebook will  walk through training a Vision Transformer model on the CIFAR 10 dataset using Kubernetes

# Step 0: Prerequisites

1. You must have an NRP account
2. You must have been added to a Nautilus namespace
3. You must have your NRP config in the `~/.kube` directory. There is a notebook to assist you [here](./NautilusConfigSetup.ipynb).
4. You must have a PVC on the Nautilus cluster in your assigned namespace



# Step 1: Creating a Train Script

Our first step is to build a script to train and test our Vision Transformer. Let's begin by importing modules:

In [None]:
from torchvision.datasets import CIFAR10
from torchvision.models import vit_b_16, ViT_B_16_Weights
from torch.utils.data import DataLoader
from torch import nn
import torch
import os

# get number of jobs and epochs from the environment
TORCH_NUM_JOBS = int(os.environ.get("TORCH_NUM_JOBS", "4"))
TORCH_NUM_EPOCHS = int(os.environ.get("TORCH_NUM_EPOCHS", "1"))


Now, let's download the train partition and create our data loader:

In [None]:
cifar10_train = CIFAR10(root="~/data", download=True, train=True, transform=ViT_B_16_Weights.IMAGENET1K_V1.transforms())

train_data_loader = DataLoader(cifar10_train,
                               batch_size=32,
                               shuffle=True,
                               num_workers=TORCH_NUM_JOBS)


Next is Model Setup:

In [None]:
model = vit_b_16(weights=ViT_B_16_Weights.IMAGENET1K_V1)

# set output neurons to Num Classes = 10
model.heads[0] = nn.Linear(768, 10)

# freeze the backbone
model.encoder.requires_grad_(False)

# create opt and loss
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)
loss_function = nn.CrossEntropyLoss()

We can now train our model. Our job is going to utilize a GPU, so we can specify CUDA:

In [None]:
model.cuda()
model.train()

for epoch in range(TORCH_NUM_EPOCHS):
    print("*" * 20 + f"\nEpoch {epoch+1} / {TORCH_NUM_EPOCHS}")
    epoch_loss = 0
    epoch_correct = 0
    
    for i, (data, labels) in enumerate(train_data_loader):
        data = data.cuda()
        labels = labels.cuda()
        
        optimizer.zero_grad()
        model_outputs = model(data)
        
        loss = loss_function(model_outputs.float(), labels.long())
        
        if torch.isnan(loss):
            raise RuntimeError("Loss reached NaN!")
        
        loss.backward()
        optimizer.step()
            
            
        _, predictions = torch.max(model_outputs, 1)
        epoch_correct += torch.sum(predictions == labels)        
        epoch_loss += loss.item()
        if i > 0 and (i % (len(train_data_loader) // 10) == 0 or i == 1):
            print(f"{i} / {len(train_data_loader)}"
                  f"\tLoss = {epoch_loss / i:.2f}"
                  f"\tAcc = {epoch_correct:d} / {i * train_data_loader.batch_size} "
                  f"({epoch_correct / (i * train_data_loader.batch_size) * 100:.1f}%)", flush=True)
        
    print(f"Loss = {epoch_loss / len(train_data_loader):.4f}")
    print(f"Train Acc = {epoch_correct / len(cifar10_train) * 100:.2f}%") 

Next, let's setup a test dataloader and run predictions on the test partition of the dataset:

In [None]:
cifar10_test = CIFAR10(root="~/data", download=True, train=False, transform=ViT_B_16_Weights.IMAGENET1K_V1.transforms())

test_data_loader = DataLoader(cifar10_test,
                              batch_size=32,
                              shuffle=False,
                              num_workers=TORCH_NUM_JOBS)

In [None]:
model.eval()

predictions = []
labels = []
with torch.no_grad():
    print("*" * 20 + f"\nRunning Eval")
    for i, (data, lb) in enumerate(test_data_loader):
        
        model_outputs = model(data.cuda())

        _, preds = torch.max(model_outputs, 1)

        labels.extend(lb.numpy().tolist())
        predictions.extend(preds.cpu().numpy().tolist())
        if i > 0 and i % (len(test_data_loader) // 10) == 0:
            print(f"{i} / {len(test_data_loader)}", flush=True)

Finally, we can calculate statistics on the predictions:


In [None]:
from sklearn.metrics import precision_recall_fscore_support as evaluate

prec, rec, fscore, _ = evaluate(predictions, labels, average="macro")

print("*" * 20 + f"""\n
Precision  \t{prec*100:.2f}%
Recall  \t{rec*100:.2f}%
F-1 Score \t{fscore*100:.2f}%
""")

# Step 2: Copying Our Script to the Cluster

### Step 2A: Copy Code to a File
The code we have written above has been copied to a script in the scripts directory named [ViTCifar10.py](../scripts/ViTCifar10.py).

### Step 2B: Spawn Pod with PVC
You now need to spawn a pod on the cluster with your peristent volume attached. You should have already updated the [pod_pvc.yml](../yaml/pod_pvc.yml) during the [SKLearn](./SKLearn.ipynb) exercise.


Once you have updated the placeholder values, you can run the following cell to provision and start a pod on the cluster:

In [None]:
! kubectl create -f ../yaml/pod_pvc.yml

### Step 2C: Copy the File to the PVC

Run the following cell until your pod is `Running`:

In [None]:
! kubectl get pods

Once your pod is running, we can copy our script to the PVC attached to the pod. Change `PODNAME` to your podname:

In [None]:
! kubectl cp ../scripts/ViTCifar10.py PODNAME:/data/ViTCifar10.py

We can check that our copy was successful with the `exec` subcommand in `kubectl`. Again, replace PODNAME with your pod's name:

In [None]:
! kubectl exec PODNAME -- cat /data/ViTCifar10.py

# Step 3: Building a Container Image

The next step is to build and push a container to a container registry, but we can just use the same container image that is currently running this Jupyter instance, since it has all the dependencies we need:
```
gitlab-registry.nrp-nautilus.io/gp-engine/jupyter-stacks/bigdata-2023:latest
```

The `Dockerfile` for this image is in the docker directory of this repository [here](../docker/Dockerfile). However, in other contexts, you may need to write your own `Dockerfile` and build and push it yourself

# Step 4: Building the Job Specification YAML

We now have everything we need to run our train and evaluation job. The final to-do item is to create a YAML Job Specification file. There is a template file for this in the repository [here](../yaml/sklearn_job_template.yml)

```yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: {{ job_name }}
spec:
  template:
    spec:
      containers:
          - name: vit-train-container
            image: gitlab-registry.nrp-nautilus.io/gp-engine/jupyter-stacks/bigdata-2023:latest
            workingDir: /data
            env:
                - name: TORCH_NUM_JOBS
                  value: "{{ num_jobs }}"
                - name: TORCH_NUM_EPOCHS
                  value: "{{ num_epochs }}"
            command: ["python3", "/data/ViTCifar10.py"]
            volumeMounts:
                - name: {{ pvc_name }}
                  mountPath: /data
                - name: dshm
                  mountPath: /dev/shm
            resources:
                limits:
                  memory: 8Gi
                  cpu: "{{ num_jobs }}"
                  nvidia.com/gpu: 1
                requests:
                  memory: 8Gi
                  cpu: "{{ num_jobs }}"    
                  nvidia.com/gpu: 1
      volumes:
          - name: {{ pvc_name }}
            persistentVolumeClaim:
                claimName: {{ pvc_name }}
          - name: dshm
            emptyDir:
              medium: Memory
      affinity:
        nodeAffinity:
              preferredDuringSchedulingIgnoredDuringExecution:
                  - weight: 1
                    preference: 
                      matchExpressions:
                        - key: nvidia.com/gpu.product
                          operator: In
                          values:
                            - NVIDIA-A100-80GB-PCIe-MIG-1g.10gb
      restartPolicy: Never      
  backoffLimit: 1


```

Let's use `jinja2` to fill in the missing values in our Template:

In [None]:
from jinja2 import Template

# read in the template
with open('../yaml/vit_cifar10_job_template.yml') as file_:
    template = Template(file_.read())

Replace the arguments to the `render` function with the appropriate values:

In [None]:
# render the job spec
job_spec = template.render(
    job_name="JOBNAME",
    pvc_name="PVCNAME",
    num_jobs=8,
    num_epochs=1
)

# print the job spec
print(job_spec)

Now, let's save it to disk:

In [None]:
with open("./vit_job.yml", "w") as file:
    file.write(job_spec)

# Step 5: Start the Job

Run the cell below to start the job:

In [None]:
! kubectl create -f ./vit_job.yml

Run the cell below until your job moves to the `Complete` status. It will go through the stages of: `Pending`, `ContainerCreating`, and `Running`:

In [None]:
! kubectl get pods

**Note**: You can check the output as the job runs, once your pod moves to `Running`:

In [None]:
! kubectl logs PODNAME

# Step 6: Review the Output of the Job

As you can see in the output from Step 5, your job created a pod with the name of `job-ABCDE`. Let's check the output of that pod to see our accuracy. Change `PODNAME` below to the correct pod name:

In [None]:
! kubectl logs --tail=5 PODNAME

# Step 7: Delete the Job and the Pod

The final step is to delete the job we ran the pod we spawned. Please change `JOBNAME` and `PODNAME` below to the appropriate name:

In [None]:
! kubectl delete job JOBNAME

In [None]:
! kubectl delete pod PODNAME