ArmDeveloperEcosystem · jasonrandrews · Mar 26, 2025 · Mar 15, 2025 · Mar 17, 2025 · Mar 17, 2025
diff --git a/.gitignore b/.gitignore
@@ -14,4 +14,10 @@ startup.sh
 nohup.out
 
 venv/
-z_local_saved/
+z_local_saved/
+/.idea/
+/tools/.python-version
+/.python-version
+*.iml
+*.xml
+
diff --git a/...hs/servers-and-cloud-computing/multiarch_ollama_on_gke/0-spin_up_gke_cluster.md b/...hs/servers-and-cloud-computing/multiarch_ollama_on_gke/0-spin_up_gke_cluster.md
@@ -0,0 +1,93 @@
+---
+title: Spin up the GKE Cluster
+weight: 2
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Overview
+
+Arm CPUs are widely used in traditional AI/ML use cases. In this Learning Path, you learn how to run [Ollama](https://ollama.com/) on Arm-based CPUs in a hybrid architecture (amd64 and arm64) K8s cluster.
+
+To demonstrate this as a real life scenario, you're going to bring up an initial Kubernetes cluster (depicted as "*1. Inital Cluster (amd64)*" in the image below) with an amd64 node running an Ollama Deployment and Service.
+
+Next, as depicted by "*2. Hybrid Cluster amd64/arm64*", you'll add the arm64 node, and apply an arm64 Deployment and Service to it, so that you can now test both architectures together, and separately, to investigate performance. 
+
+When satisfied with the arm64 performance over amd64, its easy to delete the amd64-specific node, deployment, and service, to complete the migration, as depicted in "*3. Migrated Cluster (arm64)*".
+
+![Project Overview](images/general_flow.png)
+
+Once you've seen how easy it is to add an arm64 to an existing cluster, you can apply the knowledge to experiment with the value arm64 brings to other workloads in your environment as you see fit.
+
+### Create the cluster
+
+1. From within the GCP Console, navigate to [Google Kubernetes Engine](https://console.cloud.google.com/kubernetes/list/overview) and click *Create*.
+
+2. Select *Standard*->*Configure*
+
+![Select and Configure Cluster Type](images/select_standard.png)
+
+The *Cluster basics* tab appears.
+
+3. For *Name*, enter *ollama-on-multiarch*
+4. For *Region*, enter *us-central1*.
+
+![Select and Configure Cluster Type](images/cluster_basics.png)
+
+{{% notice Note %}}
+Although this will work in all regions and zones where C4 and C4a instance types are supported, for this demo, we use *us-central1* and *us-central1-1a* regions and zones.  In addition, with simplicity and cost savings in mind, only one node per architecture is used. 
+{{% /notice %}}
+
+5. Click on *NODE POOLS*->*default-pool*
+6. For *Name*, enter *amd64-pool*
+7. For size, enter *1*
+8. Select *Specify node locations*, and select *us-central1-a*
+
+![Configure amd64 Node pool](images/x86-node-pool.png)
+
+
+8. Click on *NODE POOLS*->*Nodes*
+9. For *Series*, select *C4*
+10. For *Machine Type*, select *c4-standard-4*
+
+{{% notice Note %}}
+We've chosen node types that will support one pod per node.  If you wish to run multiple pods per mode, assume each node should provide ~10GB per pod. 
+{{% /notice %}}
+
+![Configure amd64 node type](images/configure-x86-note-type.png)
+
+11. *Click* the *Create* button at the bottom of the screen.
+
+It will take a few moments, but when the green checkmark is showing next to the *ollama-on-multiarch* cluster, you're ready to continue to test your connection to the cluster.
+
+### Connect to the cluster
+
+{{% notice Note %}}
+The following assumes you have gcloud and kubectl already installed.  If not, please follow the instructions on the first page under "Prerequisites". 
+{{% /notice %}}
+
+You'll first setup your newly created K8s cluster credentials using the gcloud utility.  Enter the following in your command prompt (or cloud shell), and make sure to replace "YOUR_PROJECT_ID" with the ID of your GCP project:
+
+```bash
+export ZONE=us-central1
+export CLUSTER_NAME=ollama-on-multiarch
+export PROJECT_ID=YOUR_PROJECT_ID
+gcloud container clusters get-credentials $CLUSTER_NAME --zone $ZONE --project $PROJECT_ID
+```
+If you get the message:
+
+```commandline
+CRITICAL: ACTION REQUIRED: gke-gcloud-auth-plugin, which is needed for continued use of kubectl, was not found or is not executable. Install gke-gcloud-auth-plugin for use with kubectl by following https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-access-for-kubectl#install_plugin
+```
+This command should help resolve it:
+
+```bash
+gcloud components install gke-gcloud-auth-plugin
+```
+Finally, test the connection to the cluster with this command:
+
+```commandline
+kubectl cluster-info
+```
+If you receive a non-error response, you're successfully connected to the k8s cluster!
diff --git a/...ing-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/1-deploy-amd64.md b/...ing-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/1-deploy-amd64.md
@@ -0,0 +1,264 @@
+---
+title: Deploy Ollama amd64 to the cluster
+weight: 3
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Overview
+
+Any easy way to experiment with Arm64 nodes in your K8s cluster is to deploy Arm64 nodes and pods alongside your existing amd64 node and pods. In this section of the tutorial, you'll bootstrap the cluster with Ollama on amd64, to simulate an "existing" K8s cluster running Ollama.
+
+### Deployment and Service
+
+
+1. Copy the following YAML, and save it to a file called *namespace.yaml*:
+
+```yaml
+apiVersion: v1
+kind: Namespace
+metadata:
+  name: ollama
+```
+
+When the above is applied, a new K8s namespace named *ollama* will be created.  This is where all the K8s object created under this tutorial will live.
+
+2. Copy the following YAML, and save it to a file called *amd64_ollama.yaml*:
+
+```yaml
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: ollama-amd64-deployment
+  labels:
+    app: ollama-multiarch
+  namespace: ollama
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      arch: amd64
+  template:
+    metadata:
+      labels:
+        app: ollama-multiarch
+        arch: amd64
+    spec:
+      nodeSelector:
+        kubernetes.io/arch: amd64
+      containers:
+      - image: ollama/ollama:0.6.1
+        name: ollama-multiarch
+        ports:
+        - containerPort: 11434
+          name: http
+          protocol: TCP
+        volumeMounts:
+        - mountPath: /root/.ollama
+          name: ollama-data
+      volumes:
+      - emptyDir: {}
+        name: ollama-data
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: ollama-amd64-svc
+  namespace: ollama
+spec:
+  sessionAffinity: None
+  ports:
+  - nodePort: 30668
+    port: 80
+    protocol: TCP
+    targetPort: 11434
+  selector:
+    arch: amd64
+  type: LoadBalancer
+```
+
+When the above is applied:
+
+* A new Deployment called *ollama-amd64-deployment* is created.  This deployment pulls a multi-architectural (both amd64 and arm64) image [ollama image from Dockerhub](https://hub.docker.com/layers/ollama/ollama/0.6.1/images/sha256-28b909914d4e77c96b1c57dea199c60ec12c5050d08ed764d9c234ba2944be63).
+
+Of particular interest is the *nodeSelector* *kubernetes.io/arch*, with the value of *amd64*.  This will ensure that this deployment only runs on amd64-based nodes, utilizing the amd64 version of the Ollama container image. 
+
+* A new load balancer Service *ollama-amd64-svc* is created, which targets all pods with the *arch: amd64* label (our amd64 deployment creates these pods.)
+
+A *sessionAffinity* tag was added to this Service to remove sticky connections to the target pods; this removes persistent connections to the same pod on each request.
+
+### Apply the amd64 Deployment and Service
+
+1. Run the following command to apply the namespace, deployment, and service definitions:
+
+```bash
+kubectl apply -f namespace.yaml
+kubectl apply -f amd64_ollama.yaml
+```
+
+You should get the following responses back:
+
+```bash
+namespace/ollama created
+deployment.apps/ollama-amd64-deployment created
+service/ollama-amd64-svc created
+```
+2. Optionally, set the *default Namespace* to *ollama* so you don't need to specify the namespace each time, by entering the following:
+
+```bash
+config set-context --current --namespace=ollama
+```
+
+3. Get the status of the pods, and the services, by running the following:
+
+```commandline
+kubectl get nodes,pods,svc -nollama 
+```
+
+Your output should be similar to the following, showing one node, one pod, and one service:
+
+```commandline
+NAME                                              STATUS   ROLES    AGE   VERSION
+node/gke-ollama-on-arm-amd64-pool-62c0835c-93ht   Ready    <none>   77m   v1.31.6-gke.1020000
+
+NAME                                          READY   STATUS    RESTARTS   AGE
+pod/ollama-amd64-deployment-cbfc4b865-msftf   1/1     Running   0          16m
+
+NAME                       TYPE           CLUSTER-IP      EXTERNAL-IP    PORT(S)        AGE
+service/ollama-amd64-svc   LoadBalancer   1.2.2.3         1.2.3.4        80:30668/TCP   16m
+```
+
+When the pods show *Running* and the service shows a valid *External IP*, we're ready to test the Ollama amd64 service!
+
+### Test the Ollama on amd64 web service 
+
+{{% notice Note %}}
+The following utility, modelUtil.sh, is provided as a convenient utility to accompany this learning path.  It's simply a shell wrapper for kubectl, utilizing the utilities [curl](https://curl.se/), [jq](https://jqlang.org/), [bc](https://www.gnu.org/software/bc/), and [stdbuf](https://www.gnu.org/software/coreutils/manual/html_node/stdbuf-invocation.html).  Make sure you have these shell utilities installed before running.
+{{% /notice %}}
+
+
+4. Copy the following shell script, and save it to a file called *model_util.sh*:
+
+```bash
+#!/bin/bash
+
+echo
+
+# https://ollama-operator.ayaka.io/pages/en/guide/supported-models
+model_name="llama3.2"
+#model_name="mistral"
+#model_name="dolphin-phi"
+
+#prompt="Name the two closest stars to earth"
+prompt="Create a sentence that makes sense in the English language, with as many palindromes in it as possible"
+
+echo "Server response:"
+
+get_service_ip() {
+    arch=$1
+    svc_name="ollama-${arch}-svc"
+    kubectl -nollama get svc $svc_name -o jsonpath="{.status.loadBalancer.ingress[*]['ip', 'hostname']}"
+}
+
+infer_request() {
+    svc_ip=$1
+    temp=$(mktemp)
+    stdbuf -oL curl -s $temp http://$svc_ip/api/generate -d '{
+        "model": "'"$model_name"'",
+        "prompt": "'"$prompt"'"
+    }' | tee $temp
+
+    duration=$(grep eval_count $temp | jq -r '.eval_duration')
+    count=$(grep eval_count $temp | jq -r '.eval_count')
+
+    if [[ -n "$duration" && -n "$count" ]]; then
+        quotient=$(echo "scale=2;1000000000*$count/$duration" | bc)
+        echo "Tokens per second:  $quotient"
+    else
+        echo "Error: eval_count or eval_duration not found in response."
+    fi
+
+    rm $temp
+}
+
+pull_model() {
+    svc_ip=$1
+    curl http://$svc_ip/api/pull -d '{
+        "model": "'"$model_name"'"
+    }'
+}
+
+hello_request() {
+    svc_ip=$1
+    curl http://$svc_ip/
+}
+
+run_action() {
+    arch=$1
+    action=$2
+
+    svc_ip=$(get_service_ip $arch)
+    echo "Using service endpoint $svc_ip for $action on $arch"
+
+    case $action in
+        infer)
+            infer_request $svc_ip
+            ;;
+        pull)
+            pull_model $svc_ip
+            ;;
+        hello)
+            hello_request $svc_ip
+            ;;
+        *)
+            echo "Invalid second argument. Use 'infer', 'pull', or 'hello'."
+            exit 1
+            ;;
+    esac
+}
+
+case $1 in
+    arm64|amd64|multiarch)
+        run_action $1 $2
+        ;;
+    *)
+        echo "Invalid first argument. Use 'arm64', 'amd64', or 'multiarch'."
+        exit 1
+        ;;
+esac
+
+echo -e "\n\nPod log output:"
+echo;kubectl logs --timestamps  -l app=ollama-multiarch -nollama --prefix  | sort -k2 | cut -d " " -f 1,2 | tail -1
+echo
+```
+
+5. Make it executable with the following command:
+
+```bash
+chmod 755 model_util.sh
+```
+
+This shell script conveniently bundles many test and logging commands into a single place, making it easy to test, troubleshoot, and view the services we expose in this tutorial. 
+
+6. Run the following to make an HTTP request to the amd64 Ollama service on port 80:
+
+```commandline
+./model_util.sh amd64 hello
+```
+
+You should get back the HTTP response, as well as the logline from the pod that served it:
+
+```commandline
+Server response:
+Using service endpoint 34.55.25.101 for hello on amd64
+Ollama is running
+
+Pod log output:
+
+[pod/ollama-amd64-deployment-cbfc4b865-msftf/ollama-multiarch] 2025-03-25T21:13:49.022522588Z
+```
+
+Success is defined specifically by seeing the words "Ollama is running".  If you see this in your output, then congrats, you've successfully bootstrapped your GKE cluster with an amd64 node, running a Deployment with the Ollama multi-architecture container instance!
+
+Next, we'll do the same thing, but with an Arm node.