# AKS Cookbook

## 🧪 Kubernetes AI Toolchain Operator (KAITO) lab

![visual](visual.png)

Kaito is an operator that automates the AI/ML model inference or tuning workload in a Kubernetes cluster. The target models are popular open-sourced large models such as falcon and phi-3. 
In this lab we explore the workflow of onboarding large AI inference models in Kubernetes with Kaito.  
It's based on the official [AKS documentation](https://learn.microsoft.com/en-us/azure/aks/ai-toolchain-operator).  

▶️ Click on the `Run All` button to execute all the subsequent steps in sequence, or run each step individually by executing the cells one at a time.

### TOC

- [0️⃣ Initialize notebook variables](#0)
- [1️⃣ Verify the Azure CLI and connected Azure subscription](#1)
- [2️⃣ Create a new Azure Resource Group or reuse an existing one](#2)
- [3️⃣ Create an AKS cluster with the AI toolchain operator add-on enabled](#3)
- [4️⃣ Connect to the AKS cluster](#4)
- [5️⃣ Create role assignment for the service principal](#5)
- [6️⃣ Establish a federated identity credential](#6)
- [7️⃣ Deploy a default hosted AI model](#7)
- [8️⃣ Track the live resource changes in your workspace](#8)
- [9️⃣ Run the model with a sample input ](#9)
- [🗑️ Clean up resources](#clean)


<a id='0'></a>
### 0️⃣ Initialize notebook variables
Adjust the location parameters according your preferences and on the [product availability by Azure region](https://azure.microsoft.com/en-us/explore/global-infrastructure/products-by-region/?cdn=disable).

In [None]:
import os, time, json, requests, utils

create_resources = True # specify True if you want to create new resources, False to use existing ones

if create_resources:
    # create new resources with the following properties
    deployment_name = os.path.basename(os.path.dirname(globals()['__vsc_ipynb_file__']))
    resource_group_name = f"lab-{deployment_name}" # change the name to match your naming convention
    resource_group_location = "eastus2"
    aks_resource_name = "aks-cluster"
    aks_node_count = 1

else:
    # or use the following existing resources
    resource_group_name = ""
    aks_resource_name = ""

workspace_instance_type = "Standard_NC12s_v3"
utils.print_ok('Notebook initiaized')


<a id='1'></a>
### 1️⃣ Verify the Azure CLI and connected Azure subscription
The following commands ensure that you have the latest version of the Azure CLI and relevant extensions installed while also verifying that the Azure CLI is connected to your Azure subscription.

In [None]:
output = utils.run("az account show", "Retrieved az account", "Failed to get the current az account")
if output.success and output.json_data:
    current_user = output.json_data.get('user').get('name')
    subscription_id = output.json_data.get('id')
    tenant_id = output.json_data.get('tenantId')
output = utils.run("az provider register --namespace Microsoft.ContainerService --wait", "Microsoft.ContainerService registered in your subscription", "Failed to register Microsoft.ContainerService")
output = utils.run("az provider register --namespace Microsoft.KubernetesConfiguration --wait", "Microsoft.KubernetesConfiguration registered in your subscription", "Failed to register Microsoft.KubernetesConfiguration")
output = utils.run("az extension add --name k8s-extension", "az k8s-extension installed", "Failed to install az k8s-extension")
output = utils.run("az extension update --name k8s-extension", "az k8s-extension updated", "Failed to update az k8s-extension")
output = utils.run("az extension add --name aks-preview", "az aks-preview extension installed", "Failed to install az aks-preview extension")
output = utils.run("az extension update --name aks-preview", "az aks-preview extension updated", "Failed to update az aks-preview extension")

output = utils.run("az feature register --namespace Microsoft.ContainerService --name AIToolchainOperatorPreview", "AIToolchainOperatorPreview registered in your subscription", "Failed to register AIToolchainOperatorPreview")

available = utils.check_vm_availability(workspace_instance_type, resource_group_location, subscription_id)
if not available:
    utils.print_error(f"Instance type {workspace_instance_type} is not available in location {resource_group_location}")
    raise SystemExit("Stopping the notebook!")



<a id='2'></a>
### 2️⃣ Create a new Azure Resource Group or reuse an existing one
All resources deployed in this lab will be created within the designated resource group. 

In [None]:
if create_resources:
    utils.create_resource_group(create_resources, resource_group_name, resource_group_location)


<a id='3'></a>
### 3️⃣ Create an AKS cluster with the AI toolchain operator add-on enabled
The following step creates an AKS cluster using the [az aks create](https://learn.microsoft.com/en-us/cli/azure/aks?view=azure-cli-latest#az-aks-create) command. 

In [None]:
if create_resources:
    output = utils.run(f"az aks create --resource-group {resource_group_name} --name {aks_resource_name} --node-count {aks_node_count} --enable-oidc-issuer --enable-ai-toolchain-operator --generate-ssh-keys --only-show-errors",
             f"AKS cluster '{aks_resource_name}' created",
             f"Failed to create AKS cluster '{aks_resource_name}'")
    if output.success and output.json_data:
        aks_resource_id = output.json_data['id']
        aks_node_resource_group_name = output.json_data['nodeResourceGroup']
        aks_oidc_issuer = output.json_data.get("oidcIssuerProfile").get("issuerUrl")
        utils.print_info(f"AKS Resource Id: {aks_oidc_issuer}")
else:
    output = utils.run(f"az aks show --resource-group {resource_group_name} --name {aks_resource_name} --only-show-errors",
             f"AKS cluster '{aks_resource_name}' retrieved",
             f"Failed to retrieve AKS cluster '{aks_resource_name}'")
    if output.success and output.json_data:
        aks_resource_id = output.json_data['id']
        aks_node_resource_group_name = output.json_data['nodeResourceGroup']
        aks_oidc_issuer = output.json_data.get("oidcIssuerProfile").get("issuerUrl")


<a id='4'></a>
### 4️⃣ Connect to the AKS cluster
Configure kubectl to connect to your Kubernetes cluster using the [az aks get-credentials](https://learn.microsoft.com/en-us/cli/azure/aks?view=azure-cli-latest#az-aks-get-credentials) command. This command downloads credentials and configures the Kubernetes CLI to use them.

In [None]:
output = utils.run(f"az aks get-credentials --resource-group {resource_group_name} --name {aks_resource_name} --overwrite-existing",
             f"Credentials for AKS cluster '{aks_resource_name}' configured",
             f"Failed to configure credentials for AKS cluster '{aks_resource_name}'")


<a id='5'></a>
### 5️⃣ Create role assignment for the service principal


In [None]:
kaito_identity_name = f"ai-toolchain-operator-{aks_resource_name}"
output = utils.run(f"az identity show -g {aks_node_resource_group_name} -n {kaito_identity_name} --only-show-errors", "Identity retrieved", "Failed to retrieve identity")
if output.success and output.json_data:
    aks_resource_principal_id = output.json_data['principalId']
    print(f"AKS Resource Principal Id: {aks_resource_principal_id}")

output = utils.run(f"az role assignment create --assignee {aks_resource_principal_id} --scope /subscriptions/{subscription_id}/resourcegroups/{resource_group_name}  --role Contributor", "Role assigned", "Failed to assign role")


<a id='6'></a>
### 6️⃣ Establish a federated identity credential

Create the federated identity credential between the managed identity, AKS OIDC issuer, and subject using the az identity federated-credential create command.

In [None]:
output = utils.run(f"az identity federated-credential create --name kaito-federated-identity --identity-name {kaito_identity_name} -g {aks_node_resource_group_name} --issuer {aks_oidc_issuer} --subject system:serviceaccount:\"kube-system:kaito-gpu-provisioner\" --audience api://AzureADTokenExchange", "Federated credential created", "Failed to create federated credential")  



<a id='7'></a>
### 7️⃣ Deploy a default hosted AI model



In [None]:



#! kubectl apply -f kaito_workspace_falcon_7b-instruct.yaml


<a id='8'></a>
### 8️⃣ Track the live resource changes in your workspace



In [None]:
output = utils.run("kubectl get workspace workspace-falcon-7b-instruct -o json", "Workspace retrieved", "Failed to retrieve workspace")
if output.success and output.json_data:
    print(output.text)


output = utils.run("kubectl get svc workspace-falcon-7b-instruct -o jsonpath='{.spec.clusterIP}")
if output.success:
    service_ip = output.text
    print(f"Service IP: {service_ip}")



<a id='9'></a>
### 9️⃣ Run the model with a sample input 



In [94]:
! kubectl run -it --rm --restart=Never curl --image=curlimages/curl -- curl -X POST http://{service_ip}/chat -H "accept: application/json" -H "Content-Type: application/json" -d "{\"prompt\":\"what is AKS?\"}"


<a id='clean'></a>
### 🗑️ Clean up resources
When you're finished with the lab, you should remove all your deployed resources from Azure to avoid extra charges and keep your Azure subscription uncluttered. Use the [clean-up-resources notebook](clean-up-resources.ipynb) for that.