# 3. Neural Style Transfer on AKS

We've tested locally in the previous notebook. Now use an AKS cluster and test that our neural style transfer script still works as expected when running across multiple nodes in parallel on AKS.

1. Build AKS Docker Image
2. Test style transfer on Docker locally
3. Push docker image to Docker hub
4. Provision AKS cluster 
5. Test style transfer on parallel on AKS cluster

---

### Import packages and load .env

In [1]:
from dotenv import set_key, get_key, find_dotenv, load_dotenv
from pathlib import Path
import subprocess
import json
import os

In [2]:
env_path = find_dotenv(raise_error_if_not_found=True)
load_dotenv(env_path)

True

### Build AKS Docker Image

In [3]:
%%writefile aks/requirements.txt
azure==4.0.0
torch==0.4.1
torchvision==0.2.1

Writing aks/requirements.txt


In [4]:
%%writefile aks/Dockerfile

FROM nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04

RUN echo "deb http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64 /" > /etc/apt/sources.list.d/nvidia-ml.list

RUN apt-get update && apt-get install -y --no-install-recommends \
        build-essential \
        ca-certificates \
        cmake \
        curl \
        git \
        nginx \
        supervisor \
        wget && \
        rm -rf /var/lib/apt/lists/*

ENV PYTHON_VERSION=3.6
RUN curl -o ~/miniconda.sh -O  https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh  && \
    chmod +x ~/miniconda.sh && \
    ~/miniconda.sh -b -p /opt/conda && \
    rm ~/miniconda.sh && \
    /opt/conda/bin/conda create -y --name py$PYTHON_VERSION python=$PYTHON_VERSION && \
    /opt/conda/bin/conda clean -ya
ENV PATH /opt/conda/envs/py$PYTHON_VERSION/bin:$PATH
ENV LD_LIBRARY_PATH /opt/conda/envs/py$PYTHON_VERSION/lib:/usr/local/cuda/lib64/:$LD_LIBRARY_PATH
ENV PYTHONPATH /code/:$PYTHONPATH

RUN mkdir /app
WORKDIR /app
ADD process_images_from_queue.py /app
ADD style_transfer.py /app
ADD main.py /app
ADD util.py /app
ADD requirements.txt /app

RUN pip install --no-cache-dir -r requirements.txt

CMD ["python", "main.py"]

Writing aks/Dockerfile


In [5]:
!sudo docker build -t {get_key(env_path, "AKS_IMAGE")} aks

Sending build context to Docker daemon  46.59kB
Step 1/17 : FROM nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04
 ---> f4f6aaaaa057
Step 2/17 : RUN echo "deb http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64 /" > /etc/apt/sources.list.d/nvidia-ml.list
 ---> Using cache
 ---> 4196af2ba86e
Step 3/17 : RUN apt-get update && apt-get install -y --no-install-recommends         build-essential         ca-certificates         cmake         curl         git         nginx         supervisor         wget &&         rm -rf /var/lib/apt/lists/*
 ---> Using cache
 ---> 8ddcde9d280a
Step 4/17 : ENV PYTHON_VERSION=3.6
 ---> Using cache
 ---> 5a047de1f83a
Step 5/17 : RUN curl -o ~/miniconda.sh -O  https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh  &&     chmod +x ~/miniconda.sh &&     ~/miniconda.sh -b -p /opt/conda &&     rm ~/miniconda.sh &&     /opt/conda/bin/conda create -y --name py$PYTHON_VERSION python=$PYTHON_VERSION &&     /opt/conda/bin/conda c

### Test Docker image locally (before deploying on AKS)

Add images to queue

In [6]:
input_frames_dir = "orangutan_frames_test"
docker_output_frames_dir = "orangutan_frames_docker_test_processed"

In [7]:
!python aci/add_images_to_queue.py \
    --input-dir {input_frames_dir} \
    --output-dir {docker_output_frames_dir} \
    --queue-limit 10

Adding images from 'orangutan_frames_test' in storage to queue 'batchscoringdlqueue'
2018-12-07 20:37:18,080 [root:add_images_to_queue.py:48] DEBUG - Queue limit is reached. Exiting process...


In [8]:
!sed -e "s/=\"/=/g" -e "s/\"$//g" .env > .env.docker

In [9]:
!cat .env.docker

Run docker locally

In [10]:
!sudo docker run --runtime=nvidia -e TERMINATE=True --env-file ".env.docker" {get_key(env_path, "AKS_IMAGE")}

2018-12-07 20:37:24,568 [root:process_images_from_queue.py:35] DEBUG - Downloading style model from directory models
2018-12-07 20:37:25,571 [root:process_images_from_queue.py:50] DEBUG - The following model were downloaded: ["candy.pth","model.pth","mosaic.pth","rain_princess.pth","udnie.pth"]
2018-12-07 20:37:25,571 [root:process_images_from_queue.py:88] DEBUG - It took 1.07 seconds to download style model.
2018-12-07 20:37:25,571 [root:process_images_from_queue.py:91] DEBUG - Start listening to queue 'batchscoringdlqueue' on service bus...
2018-12-07 20:37:25,572 [root:process_images_from_queue.py:96] DEBUG - Peek queue...
2018-12-07 20:37:25,637 [root:process_images_from_queue.py:123] DEBUG - Queue message body: {'input_frame': '000001_frame.jpg', 'input_dir': 'orangutan_frames_test', 'output_dir': 'orangutan_frames_docker_test_processed'}
2018-12-07 20:37:25,644 [root:process_images_from_queue.py:140] DEBUG - Starting style transfer on orangutan_frames_test/000001_frame.jpg
2018-1

2018-12-07 20:37:31,313 [root:process_images_from_queue.py:150] DEBUG - Finished style transfer on orangutan_frames_test/000009_frame.jpg
2018-12-07 20:37:31,339 [root:process_images_from_queue.py:162] DEBUG - Uploaded output file and log file to storage
2018-12-07 20:37:31,339 [root:process_images_from_queue.py:171] DEBUG - Deleting queue message...
2018-12-07 20:37:31,428 [root:process_images_from_queue.py:96] DEBUG - Peek queue...
2018-12-07 20:37:31,433 [root:process_images_from_queue.py:123] DEBUG - Queue message body: {'input_frame': '000010_frame.jpg', 'input_dir': 'orangutan_frames_test', 'output_dir': 'orangutan_frames_docker_test_processed'}
2018-12-07 20:37:31,439 [root:process_images_from_queue.py:140] DEBUG - Starting style transfer on orangutan_frames_test/000010_frame.jpg
2018-12-07 20:37:31,467 [root:style_transfer.py:157] DEBUG - Processing .aks/input/000010_frame.jpg
2018-12-07 20:37:31,560 [root:process_images_from_queue.py:150] DEBUG - Finished style transfer on ora

Check that queue is now empty

In [11]:
!az servicebus queue show \
    --name {get_key(env_path, "SB_QUEUE")} \
    --namespace-name {get_key(env_path, "SB_NAMESPACE")} \
    --resource-group {get_key(env_path, "RESOURCE_GROUP")} \
    --query 'countDetails.activeMessageCount'

0


Once the queue is emptied, you can use storage explorer to check that all the output images are correctly saved in the directory `orangutan_frames_docker_test_processed`.

Tag and push docker image

In [12]:
!sudo docker tag {get_key(env_path, "AKS_IMAGE")} {get_key(env_path, "DOCKER_LOGIN")}/{get_key(env_path, "AKS_IMAGE")}

In [13]:
!sudo docker push {get_key(env_path, "DOCKER_LOGIN")}/{get_key(env_path, "AKS_IMAGE")}

The push refers to repository [docker.io/jiata/batchscoringdl_aks_app]

[1Bac7ea077: Preparing 
[1Bfa35397b: Preparing 
[1B7b7ed3d8: Preparing 
[1B474baca9: Preparing 
[1B7d7c7fa2: Preparing 
[1B5d58a915: Preparing 
[1B8452f77e: Preparing 
[1Baad0d176: Preparing 
[1Bff05626e: Preparing 
[1B9048222b: Preparing 
[1Bf7dc85a1: Preparing 
[1B2df89268: Preparing 
[1Bd8f0884d: Preparing 
[1B87fdb58c: Preparing 
[1B8fb03d12: Preparing 
[1B843615e2: Preparing 
[12Bd58a915: Waiting g 
[1B9c0f8a0b: Preparing 
[13B452f77e: Waiting g 
[1B91f0ffec: Layer already exists [19A[1K[K[20A[1K[K[12A[1K[K[14A[1K[K[15A[1K[K[7A[1K[K[8A[1K[K[9A[1K[K[6A[1K[K[4A[1K[K[2A[1K[Klatest: digest: sha256:4ad2d4c94c9a4ec7994590b436684f3585084c02686d430194260b42f0129520 size: 4507


### Provision AKS cluster

Set how many nodes you want to provision.

In [14]:
node_count = 5

Check that there are enough core of the "Standard_NC6s_v3". If not, check that there are enough core of the "Standard_D2s_v3". If not, raise exception. 

In [15]:
vm_dict = {
    "NCSv3": {
        "size": "Standard_NC6s_v3",
        "cores": 6
    },
    "DSv3": {
        "size": "Standard_D2s_v3",
        "cores": 2
    }
}

print("Checking quota for family size NCSv3...")
vm_family = "NCSv3"
requested_cores = node_count * vm_dict[vm_family]["cores"]

def check_quota(vm_family):
    """
    returns quota object
    """
    results = subprocess.run([
        "az", "vm", "list-usage", 
        "--location", get_key(env_path, "REGION"), 
        "--query", "[?contains(localName, '%s')].{max:limit, current:currentValue}" % (vm_family)
    ], stdout=subprocess.PIPE)
    quota = json.loads(''.join(results.stdout.decode('utf-8')))
    return int(quota[0]['max']) - int(quota[0]['current'])

diff = check_quota(vm_family)
if diff <= requested_cores:
    print("Not enough cores of NCSv3 in region, asking for {} but have {}".format(requested_cores, diff))
    
    print("Retrying with family size DSv3...")
    vm_family = "DSv3"
    requested_cores = node_count * vm_dict[vm_family]["cores"]
    
    diff = check_quota(vm_family)
    if diff <= requested_cores:
        print("Not enough cores of DSv3 in region, asking for {} but have {}".format(requested_cores, diff))
        raise Exception("Core Limit", "Note enough cores to satisfy request")

print("There are enough cores, you may continue...") 

Checking quota for family size NCSv3...
There are enough cores, you may continue...


Create the aks cluster. This step may take a while... Please note that this step creates another resource group in your subscription containing the actual compute of the AKS cluster.

In [17]:
!az aks create \
    --resource-group {get_key(env_path, "RESOURCE_GROUP")} \
    --name {get_key(env_path, "AKS_CLUSTER")} \
    --node-count {node_count} \
    --node-vm-size {vm_dict[vm_family]["size"]} \
    --generate-ssh-keys \
    --service-principal {get_key(env_path, "SP_CLIENT")} \
    --client-secret {get_key(env_path, "SP_SECRET")}

Install Kubectl - this tool is used to manage the kubernetes cluster.

In [18]:
!sudo az aks install-cli

[33mDownloading client to "/usr/local/bin/kubectl" from "https://storage.googleapis.com/kubernetes-release/release/v1.13.0/bin/linux/amd64/kubectl"[0m
[33mPlease ensure that /usr/local/bin is in your search PATH, so the `kubectl` command can be found.[0m


In [19]:
!az aks get-credentials \
    --resource-group {get_key(env_path, 'RESOURCE_GROUP')}\
    --name {get_key(env_path, 'AKS_CLUSTER')}

Merged "batchscoringdlcluster" as current context in /home/jiata/.kube/config


Check that our nodes are up and ready.

In [21]:
!kubectl get nodes

NAME                       STATUS   ROLES   AGE     VERSION
aks-nodepool1-40264992-0   Ready    agent   4m40s   v1.9.11
aks-nodepool1-40264992-1   Ready    agent   4m26s   v1.9.11
aks-nodepool1-40264992-2   Ready    agent   4m36s   v1.9.11
aks-nodepool1-40264992-3   Ready    agent   4m32s   v1.9.11
aks-nodepool1-40264992-4   Ready    agent   4m31s   v1.9.11


### Deploy docker image to AKS cluster

To deploy our neural style transfer script into our AKS cluster, we need to define what the deployment will look like:

In [23]:
aks_deployment_json = {
    "apiVersion": "apps/v1beta1",
    "kind": "Deployment",
    "metadata": {
        "name": "aks-app", 
        "labels": {
            "purpose": "dequeue_messages_and_apply_style_transfer"
        }
    },
    "spec": {
        "replicas": node_count,
        "template": {
            "metadata": {
                "labels": {
                    "app": "aks-app"
                }
            },
            "spec": {
                "containers": [
                    {
                        "name": "aks-app",
                        "image": "{}/{}:latest".format(get_key(env_path, "DOCKER_LOGIN"), get_key(env_path, "AKS_IMAGE")),
                        "volumeMounts": [
                            {
                                "mountPath": "/usr/local/nvidia", 
                                "name": "nvidia"
                            }
                        ],
                        "resources": {
                            "requests": {
                                "alpha.kubernetes.io/nvidia-gpu": 1
                            },
                            "limits": {
                                "alpha.kubernetes.io/nvidia-gpu": 1
                            },
                        },
                        "ports": [{
                            "containerPort": 433
                        }],
                        "env": [
                            {
                                "name": "LB_LIBRARY_PATH",
                                "value": "$LD_LIBRARY_PATH:/usr/local/nvidia/lib64:/opt/conda/envs/py3.6/lib",
                            },
                            {
                                "name": "DP_DISABLE_HEALTHCHECKS", 
                                "value": "xids"
                            },
                            {
                                "name": "STORAGE_MODEL_DIR",
                                "value": get_key(env_path, "STORAGE_MODEL_DIR")
                            },
                            {
                                "name": "SUBSCRIPTION_ID",
                                "value": get_key(env_path, "SUBSCRIPTION_ID")
                            },
                            {
                                "name": "RESOURCE_GROUP",
                                "value": get_key(env_path, "RESOURCE_GROUP")
                            },
                            {
                                "name": "REGION",
                                "value": get_key(env_path, "REGION")
                            },
                            {
                                "name": "STORAGE_ACCOUNT_NAME", 
                                "value": get_key(env_path, "STORAGE_ACCOUNT_NAME")
                            },
                            {
                                "name": "STORAGE_ACCOUNT_KEY",
                                "value": get_key(env_path, "STORAGE_ACCOUNT_KEY")
                            },
                            {
                                "name": "STORAGE_CONTAINER_NAME",
                                "value": get_key(env_path, "STORAGE_CONTAINER_NAME")
                            },
                            {
                                "name": "SB_SHARED_ACCESS_KEY_NAME",
                                "value": get_key(env_path, "SB_SHARED_ACCESS_KEY_NAME")
                            },
                            {
                                "name": "SB_SHARED_ACCESS_KEY_VALUE",
                                "value": get_key(env_path, "SB_SHARED_ACCESS_KEY_VALUE")
                            },
                            {
                                "name": "SB_NAMESPACE",
                                "value": get_key(env_path, "SB_NAMESPACE")
                            },
                            {
                                "name": "SB_QUEUE", 
                                "value": get_key(env_path, "SB_QUEUE")
                            },
                        ],
                    }
                ],
                "volumes": [
                    {
                        "name": "nvidia", 
                        "hostPath": {
                            "path": "/usr/local/nvidia"
                        }
                    }
                ],
            },
        },
    },
}

In [24]:
with open("aks_deployment.json", "w") as outfile:
    json.dump(aks_deployment_json, outfile, indent=4, sort_keys=True)
    outfile.write('\n\n')

### Run style transfer on AKS

Add 100 new messages to the queue so that we can use our newly created AKS cluster to test.

In [25]:
aks_output_frames_dir = "orangutan_frames_aks_test_processed"

In [26]:
!python aci/add_images_to_queue.py \
    --input-dir {input_frames_dir} \
    --output-dir {aks_output_frames_dir} \
    --queue-limit 100

Adding images from 'orangutan_frames_test' in storage to queue 'batchscoringdlqueue'
2018-12-07 20:50:20,196 [root:add_images_to_queue.py:48] DEBUG - Queue limit is reached. Exiting process...


Using the `aks_deployment.json` we created, create our deployment on AKS. This can take a few minutes...

In [27]:
!kubectl create -f aks_deployment.json

deployment.apps/aks-app created


Now that its been deployed, check our pods to make sure that the deployment worked as expected.

In [28]:
!kubectl get pods

NAME                       READY   STATUS              RESTARTS   AGE
aks-app-864c65b9bb-46dnd   0/1     ContainerCreating   0          4m
aks-app-864c65b9bb-d54cc   0/1     ContainerCreating   0          4m
aks-app-864c65b9bb-d575x   0/1     ContainerCreating   0          4m
aks-app-864c65b9bb-dtd6x   0/1     ContainerCreating   0          4m
aks-app-864c65b9bb-nj8jh   0/1     ContainerCreating   0          4m


Print the logs of one of the pods to inspect the process running inside.

In [30]:
pod_json = !kubectl get pods -o json
pod_dict = json.loads(''.join(pod_json))
!kubectl logs {pod_dict['items'][0]['metadata']['name']}

Check that there are now no more messages in the queue.

In [31]:
!az servicebus queue show \
    --name {get_key(env_path, "SB_QUEUE")} \
    --namespace-name {get_key(env_path, "SB_NAMESPACE")} \
    --resource-group {get_key(env_path, "RESOURCE_GROUP")} \
    --query 'countDetails.activeMessageCount'

0


Once the queue is emptied, you can use storage explorer to check that all the output images are correctly saved in the directory `orangutan_frames_aks_test_processed`.

### Monitor in kubernetes dashboard
You can use the Kubernetes dashboard to monitor the cluster using the following commands:

```
# use the kube_dashboard_access.yaml to create a deployment
!kubectl create -f kube_dashboard_access.yaml

# use this command to browse
!az aks browse -n {get_key(env_path, "AKS_CLUSTER")} -g {get_key(env_path, "RESOURCE_GROUP")}
```

### Additional commands for AKS

Scale your AKS cluster:

```
!az aks scale \
    --name {get_key(env_path, "AKS_CLUSTER")} \
    --resource-group {get_key(env_path, "RESOURCE_GROUP")} \
    --node-count 10
```

Scale your deployment:
```
!kubectl scale deployment.apps/aks-app --replicas=10
```

---

Continue to the next [notebook](/notebooks/04_deploy_logic_app.ipynb).