# Training SKLearn Models on Kubernetes

This notebook will walk through using Kubernetes to train an SKLearn model in a Kubernetes Job

## Step 0: Prerequisites

1. You must have an NRP account
2. You must have been added to a Nautilus namespace
3. You must have your NRP config in the `~/.kube` directory. There is a notebook to assist you [here](./NautilusConfigSetup.ipynb).
4. You must have a PVC on the Nautilus cluster in your assigned namespace



# Step 1: Building a Training Script 

The first step is to build a script to perform the training (and/or testing)

Let's build this below:

In [None]:
from torchvision.datasets import MNIST
from skimage.feature import hog
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from tqdm import tqdm
import numpy as np
import os

NUM_TREES = int(os.environ.get("SK_NUM_TREES", "3"))
NUM_JOBS = int(os.environ.get("SK_NUM_JOBS", "1"))

print(f"Running random forest with {NUM_TREES} trees and {NUM_JOBS} jobs")

######
# Download MNIST
######
train_dataset = MNIST(download=True, root="~/data", train=True)
test_dataset = MNIST(download=True, root="~/data", train=False)

##### 
# Generate Train Features
#####
print("Generating Train Features")
train_features = np.empty((len(train_dataset), 108))
train_labels = np.empty(len(train_dataset), np.int32)
for i, (img, label) in tqdm(enumerate(train_dataset), ncols=80, total=len(train_dataset)):
    train_features[i] = hog(np.asarray(img), orientations=12, cells_per_block=(3,3))
    train_labels[i] = label

#####
# Generate Test Features
#####
print("Generating Test Features")
test_features = np.empty((len(test_dataset), 108))
test_labels = np.empty(len(test_dataset), np.int32)
for i, (img, label) in tqdm(enumerate(test_dataset), ncols=80, total=len(test_dataset)):
    test_features[i] = hog(np.asarray(img), orientations=12, cells_per_block=(3,3))
    test_labels[i] = label

######
# Train Model
#######
print("Training the model")
model = RandomForestClassifier(n_estimators=NUM_TREES, n_jobs=NUM_JOBS, verbose=1)
model.fit(train_features, train_labels)

####
# Score Model
#####
print("Evaluating the model")
model_accuracy = model.score(test_features, test_labels)
print(f"Model Accuracy = {model_accuracy*100:.2f}%")

# Step 2: Copying Our Script to the Cluster

### Step 2A: Copy Code to a File
The code we have written above has been copied to a script in the scripts directory named [RandomForestMNIST.py](../scripts/RandomForestMNIST.py).

### Step 2B: Spawn Pod with PVC
You now need to spawn a pod on the cluster with your peristent volume attached

For a refresher, [here is a sample YAML file](../yaml/pod_pvc.yml). Be sure to change the `name` of the pod and the `persistentVolume-name`

```yaml
apiVersion: v1
kind: Pod
metadata:
  name: pod-name-sso
spec:
  containers:
  - name: pod-name-sso
    image: ubuntu:20.04
    command: ["sh", "-c", "echo 'Im a new pod' && sleep infinity"]
    resources:
      limits:
        memory: 12Gi
        cpu: 2
      requests:
        memory: 10Gi
        cpu: 2
    volumeMounts:
    - mountPath: /data
      name: persistentVolume-name
  volumes:
    - name: persistentVolume-name
      persistentVolumeClaim:
        claimName: persistentVolume-name
```

Once you have updated those values, you can run the following cell:

In [None]:
! kubectl create -f ../yaml/pod_pvc.yml

### Step 2C: Copy the File to the PVC

Run the following cell until your pod is `Running`:

In [None]:
! kubectl get pods

Once your pod is running, we can copy our script to the PVC attached to the pod. Change `PODNAME` to your podname:

In [None]:
! kubectl cp ../scripts/RandomForestMNIST.py PODNAME:/data/RandomForestMNIST.py

We can check that our copy was successful with the `exec` subcommand in `kubectl`. Again, replace PODNAME with your pod's name:

In [None]:
! kubectl exec PODNAME -- cat /data/RandomForestMNIST.py

# Step 3: Building a Container Image

The next step is to build and push a container to a container registry, but we can just use the same container image that is currently running this Jupyter instance, since it has all the dependencies we need:
```
gitlab-registry.nrp-nautilus.io/gp-engine/jupyter-stacks/bigdata-2023:latest
```

The `Dockerfile` for this image is in the docker directory of this repository [here](../docker/Dockerfile). However, in other contexts, you may need to write your own `Dockerfile` and build and push it yourself

# Step 4: Building the Job Specification YAML

We now have everything we need to run our train and evaluation job. The final to-do item is to create a YAML Job Specification file. There is a template file for this in the repository [here](../yaml/sklearn_job_template.yml)

```yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: {{ job_name }}
spec:
  template:
    spec:
      containers:
      - name: sklearn-train-container
        image: gitlab-registry.nrp-nautilus.io/gp-engine/jupyter-stacks/bigdata-2023:latest
        workingDir: /data
        env:
            - name: SK_NUM_TREES
              value: "{{ num_trees }}"
            - name: SK_NUM_JOBS
              value: "{{ num_jobs }}"
        command: ["python3", "/data/RandomForestMNIST.py"]
        volumeMounts:
            - name: {{ pvc_name }}
              mountPath: /data
        resources:
            limits:
              memory: 1Gi
              cpu: "{{ num_jobs }}"
            requests:
              memory: 1Gi
              cpu: "{{ num_jobs }}"    
      volumes:
      - name: {{ pvc_name }}
        persistentVolumeClaim:
            claimName: {{ pvc_name }}
      restartPolicy: Never      
  backoffLimit: 1
```

Let's use `jinja2` to fill in the missing values in our Template:

In [None]:
from jinja2 import Template

# read in the template
with open('../yaml/sklearn_job_template.yml') as file_:
    template = Template(file_.read())

Replace the arguments to the `render` function with the appropriate values:

In [None]:
# render the job spec
job_spec = template.render(
    job_name="JOBNAME",
    pvc_name="PVC NAME",
    num_trees=1,
    num_jobs=1
)

# print the job spec
print(job_spec)

Now, let's save it to disk:

In [None]:
with open("./sklearn_job.yml", "w") as file:
    file.write(job_spec)

# Step 5: Start the Job

Run the cell below to start the job:

In [None]:
! kubectl create -f ./sklearn_job.yml

Run the cell below until your job moves to the `Complete` status. It will go through the stages of: `Pending`, `ContainerCreating`, and `Running`:

In [None]:
! kubectl get pods

# Step 6: Review the Output of the Job

As you can see in the output from Step 5, your job created a pod with the name of `job-ABCDE`. Let's check the output of that pod to see our accuracy. Change `PODNAME` below to the correct pod name:

In [None]:
! kubectl logs --tail=1 PODNAME

# Step 7: Delete the Job and the Pod

The final step is to delete the job we ran the pod we spawned. Please change `JOBNAME` and `PODNAME` below to the appropriate name:

In [None]:
! kubectl delete job JOBNAME

In [None]:
! kubectl delete pod PODNAME