diff --git a/.gitignore b/.gitignore index 284d26700f..c1e1d2264d 100644 --- a/.gitignore +++ b/.gitignore @@ -14,4 +14,10 @@ startup.sh nohup.out venv/ -z_local_saved/ \ No newline at end of file +z_local_saved/ +/.idea/ +/tools/.python-version +/.python-version +*.iml +*.xml + diff --git a/content/learning-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/0-spin_up_gke_cluster.md b/content/learning-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/0-spin_up_gke_cluster.md new file mode 100644 index 0000000000..726e624d8c --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/0-spin_up_gke_cluster.md @@ -0,0 +1,93 @@ +--- +title: Spin up the GKE Cluster +weight: 2 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Overview + +Arm CPUs are widely used in traditional AI/ML use cases. In this Learning Path, you learn how to run [Ollama](https://ollama.com/) on Arm-based CPUs in a hybrid architecture (amd64 and arm64) K8s cluster. + +To demonstrate this as a real life scenario, you're going to bring up an initial Kubernetes cluster (depicted as "*1. Inital Cluster (amd64)*" in the image below) with an amd64 node running an Ollama Deployment and Service. + +Next, as depicted by "*2. Hybrid Cluster amd64/arm64*", you'll add the arm64 node, and apply an arm64 Deployment and Service to it, so that you can now test both architectures together, and separately, to investigate performance. + +When satisfied with the arm64 performance over amd64, its easy to delete the amd64-specific node, deployment, and service, to complete the migration, as depicted in "*3. Migrated Cluster (arm64)*". + +![Project Overview](images/general_flow.png) + +Once you've seen how easy it is to add an arm64 to an existing cluster, you can apply the knowledge to experiment with the value arm64 brings to other workloads in your environment as you see fit. + +### Create the cluster + +1. From within the GCP Console, navigate to [Google Kubernetes Engine](https://console.cloud.google.com/kubernetes/list/overview) and click *Create*. + +2. Select *Standard*->*Configure* + +![Select and Configure Cluster Type](images/select_standard.png) + +The *Cluster basics* tab appears. + +3. For *Name*, enter *ollama-on-multiarch* +4. For *Region*, enter *us-central1*. + +![Select and Configure Cluster Type](images/cluster_basics.png) + +{{% notice Note %}} +Although this will work in all regions and zones where C4 and C4a instance types are supported, for this demo, we use *us-central1* and *us-central1-1a* regions and zones. In addition, with simplicity and cost savings in mind, only one node per architecture is used. +{{% /notice %}} + +5. Click on *NODE POOLS*->*default-pool* +6. For *Name*, enter *amd64-pool* +7. For size, enter *1* +8. Select *Specify node locations*, and select *us-central1-a* + +![Configure amd64 Node pool](images/x86-node-pool.png) + + +8. Click on *NODE POOLS*->*Nodes* +9. For *Series*, select *C4* +10. For *Machine Type*, select *c4-standard-4* + +{{% notice Note %}} +We've chosen node types that will support one pod per node. If you wish to run multiple pods per mode, assume each node should provide ~10GB per pod. +{{% /notice %}} + +![Configure amd64 node type](images/configure-x86-note-type.png) + +11. *Click* the *Create* button at the bottom of the screen. + +It will take a few moments, but when the green checkmark is showing next to the *ollama-on-multiarch* cluster, you're ready to continue to test your connection to the cluster. + +### Connect to the cluster + +{{% notice Note %}} +The following assumes you have gcloud and kubectl already installed. If not, please follow the instructions on the first page under "Prerequisites". +{{% /notice %}} + +You'll first setup your newly created K8s cluster credentials using the gcloud utility. Enter the following in your command prompt (or cloud shell), and make sure to replace "YOUR_PROJECT_ID" with the ID of your GCP project: + +```bash +export ZONE=us-central1 +export CLUSTER_NAME=ollama-on-multiarch +export PROJECT_ID=YOUR_PROJECT_ID +gcloud container clusters get-credentials $CLUSTER_NAME --zone $ZONE --project $PROJECT_ID +``` +If you get the message: + +```commandline +CRITICAL: ACTION REQUIRED: gke-gcloud-auth-plugin, which is needed for continued use of kubectl, was not found or is not executable. Install gke-gcloud-auth-plugin for use with kubectl by following https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-access-for-kubectl#install_plugin +``` +This command should help resolve it: + +```bash +gcloud components install gke-gcloud-auth-plugin +``` +Finally, test the connection to the cluster with this command: + +```commandline +kubectl cluster-info +``` +If you receive a non-error response, you're successfully connected to the k8s cluster! \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/1-deploy-amd64.md b/content/learning-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/1-deploy-amd64.md new file mode 100644 index 0000000000..dc675c4c1c --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/1-deploy-amd64.md @@ -0,0 +1,264 @@ +--- +title: Deploy Ollama amd64 to the cluster +weight: 3 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Overview + +Any easy way to experiment with Arm64 nodes in your K8s cluster is to deploy Arm64 nodes and pods alongside your existing amd64 node and pods. In this section of the tutorial, you'll bootstrap the cluster with Ollama on amd64, to simulate an "existing" K8s cluster running Ollama. + +### Deployment and Service + + +1. Copy the following YAML, and save it to a file called *namespace.yaml*: + +```yaml +apiVersion: v1 +kind: Namespace +metadata: + name: ollama +``` + +When the above is applied, a new K8s namespace named *ollama* will be created. This is where all the K8s object created under this tutorial will live. + +2. Copy the following YAML, and save it to a file called *amd64_ollama.yaml*: + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: ollama-amd64-deployment + labels: + app: ollama-multiarch + namespace: ollama +spec: + replicas: 1 + selector: + matchLabels: + arch: amd64 + template: + metadata: + labels: + app: ollama-multiarch + arch: amd64 + spec: + nodeSelector: + kubernetes.io/arch: amd64 + containers: + - image: ollama/ollama:0.6.1 + name: ollama-multiarch + ports: + - containerPort: 11434 + name: http + protocol: TCP + volumeMounts: + - mountPath: /root/.ollama + name: ollama-data + volumes: + - emptyDir: {} + name: ollama-data +--- +apiVersion: v1 +kind: Service +metadata: + name: ollama-amd64-svc + namespace: ollama +spec: + sessionAffinity: None + ports: + - nodePort: 30668 + port: 80 + protocol: TCP + targetPort: 11434 + selector: + arch: amd64 + type: LoadBalancer +``` + +When the above is applied: + +* A new Deployment called *ollama-amd64-deployment* is created. This deployment pulls a multi-architectural (both amd64 and arm64) image [ollama image from Dockerhub](https://hub.docker.com/layers/ollama/ollama/0.6.1/images/sha256-28b909914d4e77c96b1c57dea199c60ec12c5050d08ed764d9c234ba2944be63). + +Of particular interest is the *nodeSelector* *kubernetes.io/arch*, with the value of *amd64*. This will ensure that this deployment only runs on amd64-based nodes, utilizing the amd64 version of the Ollama container image. + +* A new load balancer Service *ollama-amd64-svc* is created, which targets all pods with the *arch: amd64* label (our amd64 deployment creates these pods.) + +A *sessionAffinity* tag was added to this Service to remove sticky connections to the target pods; this removes persistent connections to the same pod on each request. + +### Apply the amd64 Deployment and Service + +1. Run the following command to apply the namespace, deployment, and service definitions: + +```bash +kubectl apply -f namespace.yaml +kubectl apply -f amd64_ollama.yaml +``` + +You should get the following responses back: + +```bash +namespace/ollama created +deployment.apps/ollama-amd64-deployment created +service/ollama-amd64-svc created +``` +2. Optionally, set the *default Namespace* to *ollama* so you don't need to specify the namespace each time, by entering the following: + +```bash +config set-context --current --namespace=ollama +``` + +3. Get the status of the pods, and the services, by running the following: + +```commandline +kubectl get nodes,pods,svc -nollama +``` + +Your output should be similar to the following, showing one node, one pod, and one service: + +```commandline +NAME STATUS ROLES AGE VERSION +node/gke-ollama-on-arm-amd64-pool-62c0835c-93ht Ready 77m v1.31.6-gke.1020000 + +NAME READY STATUS RESTARTS AGE +pod/ollama-amd64-deployment-cbfc4b865-msftf 1/1 Running 0 16m + +NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE +service/ollama-amd64-svc LoadBalancer 1.2.2.3 1.2.3.4 80:30668/TCP 16m +``` + +When the pods show *Running* and the service shows a valid *External IP*, we're ready to test the Ollama amd64 service! + +### Test the Ollama on amd64 web service + +{{% notice Note %}} +The following utility, modelUtil.sh, is provided as a convenient utility to accompany this learning path. It's simply a shell wrapper for kubectl, utilizing the utilities [curl](https://curl.se/), [jq](https://jqlang.org/), [bc](https://www.gnu.org/software/bc/), and [stdbuf](https://www.gnu.org/software/coreutils/manual/html_node/stdbuf-invocation.html). Make sure you have these shell utilities installed before running. +{{% /notice %}} + + +4. Copy the following shell script, and save it to a file called *model_util.sh*: + +```bash +#!/bin/bash + +echo + +# https://ollama-operator.ayaka.io/pages/en/guide/supported-models +model_name="llama3.2" +#model_name="mistral" +#model_name="dolphin-phi" + +#prompt="Name the two closest stars to earth" +prompt="Create a sentence that makes sense in the English language, with as many palindromes in it as possible" + +echo "Server response:" + +get_service_ip() { + arch=$1 + svc_name="ollama-${arch}-svc" + kubectl -nollama get svc $svc_name -o jsonpath="{.status.loadBalancer.ingress[*]['ip', 'hostname']}" +} + +infer_request() { + svc_ip=$1 + temp=$(mktemp) + stdbuf -oL curl -s $temp http://$svc_ip/api/generate -d '{ + "model": "'"$model_name"'", + "prompt": "'"$prompt"'" + }' | tee $temp + + duration=$(grep eval_count $temp | jq -r '.eval_duration') + count=$(grep eval_count $temp | jq -r '.eval_count') + + if [[ -n "$duration" && -n "$count" ]]; then + quotient=$(echo "scale=2;1000000000*$count/$duration" | bc) + echo "Tokens per second: $quotient" + else + echo "Error: eval_count or eval_duration not found in response." + fi + + rm $temp +} + +pull_model() { + svc_ip=$1 + curl http://$svc_ip/api/pull -d '{ + "model": "'"$model_name"'" + }' +} + +hello_request() { + svc_ip=$1 + curl http://$svc_ip/ +} + +run_action() { + arch=$1 + action=$2 + + svc_ip=$(get_service_ip $arch) + echo "Using service endpoint $svc_ip for $action on $arch" + + case $action in + infer) + infer_request $svc_ip + ;; + pull) + pull_model $svc_ip + ;; + hello) + hello_request $svc_ip + ;; + *) + echo "Invalid second argument. Use 'infer', 'pull', or 'hello'." + exit 1 + ;; + esac +} + +case $1 in + arm64|amd64|multiarch) + run_action $1 $2 + ;; + *) + echo "Invalid first argument. Use 'arm64', 'amd64', or 'multiarch'." + exit 1 + ;; +esac + +echo -e "\n\nPod log output:" +echo;kubectl logs --timestamps -l app=ollama-multiarch -nollama --prefix | sort -k2 | cut -d " " -f 1,2 | tail -1 +echo +``` + +5. Make it executable with the following command: + +```bash +chmod 755 model_util.sh +``` + +This shell script conveniently bundles many test and logging commands into a single place, making it easy to test, troubleshoot, and view the services we expose in this tutorial. + +6. Run the following to make an HTTP request to the amd64 Ollama service on port 80: + +```commandline +./model_util.sh amd64 hello +``` + +You should get back the HTTP response, as well as the logline from the pod that served it: + +```commandline +Server response: +Using service endpoint 34.55.25.101 for hello on amd64 +Ollama is running + +Pod log output: + +[pod/ollama-amd64-deployment-cbfc4b865-msftf/ollama-multiarch] 2025-03-25T21:13:49.022522588Z +``` + +Success is defined specifically by seeing the words "Ollama is running". If you see this in your output, then congrats, you've successfully bootstrapped your GKE cluster with an amd64 node, running a Deployment with the Ollama multi-architecture container instance! + +Next, we'll do the same thing, but with an Arm node. diff --git a/content/learning-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/2-deploy-arm64.md b/content/learning-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/2-deploy-arm64.md new file mode 100644 index 0000000000..db7e8d366d --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/2-deploy-arm64.md @@ -0,0 +1,197 @@ +--- +title: Deploy Ollama arm64 to the cluster +weight: 4 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Overview +At this point you have a what many people in their K8s Arm journey start with -- a workload running on an amd64 cluster. As mentioned earlier, the easiest way to experiment with Arm in your K8s cluster is to run both architectures simultaneously, not just for the sake of learning how to do it, but also to see first-hand the price/performance advantages of running Arm-based nodes. + +Next, you'll add an Arm-based node pool to the cluster, and from there, apply an ollama Arm deployment and service to mimic what we did in the last chapter. + +### Adding the arm64-pool node pool + +To add Arm nodes to the cluster: + +1. From the Clusters menu, select *ollama-on-multiarch* +2. Select *Add node pool* +3. For *Name*, enter *arm64-pool* +4. For *Size*, enter *1* +5. Check *Specify node locations* and select *us-central1-a* + +![YAML Overview](images/arm_node_config-1.png) + +6. Select the *Nodes* tab to navigate to the *Configure node settings* screen +7. Select *C4A* : *c4a-standard-4* for Machine *Configuration/Type*. + +{{% notice Note %}} +To make an apples-to-apples comparison of amd64 and arm64 performance, the c4a-standard-4 is spun up as the arm64 "equivalent" of the previously deployed c4-standard-4 in the amd64 node pool. +{{% /notice %}} + +![YAML Overview](images/arm_node_config-2.png) + +8. Select *Create* +9. After provisioning completes, select the newly created *arm64-pool* from the *Clusters* screen to take you to the *Node pool details* page. + +Note the taint GKE applies by default to the Arm Node of *NoSchedule* if arch=arm64: + +![arm node taint](images/taint_on_arm_node.png) + +Without a toleration for this taint, we won't be able to schedule any workloads on it! But do not fear, as the nodeSelector in the amd64 (and as you will shortly see, the arm64) Deployment YAMLs not only defines which architecture to target, [but in the arm64 use case](https://cloud.google.com/kubernetes-engine/docs/how-to/prepare-arm-workloads-for-deployment#schedule-with-node-selector-arm), it also adds the required toleration automatically. + +```yaml +nodeSelector: + kubernetes.io/arch: arm64 # or amd64 +``` + +### Deployment and Service +We can now apply the arm64-based deployment. + +1. Copy the following YAML, and save it to a file called arm64_ollama.yaml: + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: ollama-arm64-deployment + labels: + app: ollama-multiarch + namespace: ollama +spec: + replicas: 1 + selector: + matchLabels: + arch: arm64 + template: + metadata: + labels: + app: ollama-multiarch + arch: arm64 + spec: + nodeSelector: + kubernetes.io/arch: arm64 + containers: + - image: ollama/ollama:0.6.1 + name: ollama-multiarch + ports: + - containerPort: 11434 + name: http + protocol: TCP + volumeMounts: + - mountPath: /root/.ollama + name: ollama-data + volumes: + - emptyDir: {} + name: ollama-data +--- +apiVersion: v1 +kind: Service +metadata: + name: ollama-arm64-svc + namespace: ollama +spec: + sessionAffinity: None + ports: + - nodePort: 30666 + port: 80 + protocol: TCP + targetPort: 11434 + selector: + arch: arm64 + type: LoadBalancer +--- +apiVersion: v1 +kind: Service +metadata: + name: ollama-multiarch-svc + namespace: ollama +spec: + sessionAffinity: None + ports: + - nodePort: 30667 + port: 80 + protocol: TCP + targetPort: 11434 + selector: + app: ollama-multiarch + type: LoadBalancer +``` + +When the above is applied: + +* A new Deployment called *ollama-arm64-deployment* is created. Like the amd64 deployment, it pulls the same multi-architectural (both amd64 and arm64) image from Dockerhub [ollama image from Dockerhub](https://hub.docker.com/layers/ollama/ollama/0.6.1/images/sha256-28b909914d4e77c96b1c57dea199c60ec12c5050d08ed764d9c234ba2944be63). + +Of particular interest is the *nodeSelector* *kubernetes.io/arch*, with the value of *arm64*. This will ensure that this deployment only runs on arm64-based nodes, utilizing the arm64 layer of the ollama multi-architecture container image. As mentioned earlier, this *nodeSelector* triggers the automatic creation of the toleration for the arm64 nodes. + +* Two new load balancer Services are created. The first, *ollama-arm64-svc* is created, analogous to the existing service, targets all pods with the *arch: arm64* label (our arm64 deployment creates these pods.) The second service, *ollama-multiarch-svc*, target ALL Pods, regardless of the architecture they are running. This service will show us how we can mix and match pods in production to serve the same app regardless of node/pod architecture. + +You may also notice that a *sessionAffinity* tag was added to this Service to remove sticky connections to the target pods; this removes persistent connections to the same pod on each request. + + +### Apply the arm64 Deployment and Service + +1. Run the following command to apply the arm64 deployment, and service definitions: + +```bash +kubectl apply -f arm64_ollama.yaml +``` + +You should get the following responses back: + +```bash +deployment.apps/ollama-arm64-deployment created +service/ollama-arm64-svc created +service/ollama-multiarch-svc created +``` + +2. Get the status of the pods, and the services, by running the following: + +```commandline +kubectl get nodes,pods,svc -nollama +``` + +Your output should be similar to the following, showing two nodes, two pods, and three services: + +```commandline +NAME STATUS ROLES AGE VERSION +node/gke-ollama-on-arm-amd64-pool-62c0835c-93ht Ready 91m v1.31.6-gke.1020000 +node/gke-ollama-on-arm-arm64-pool-2ae0d1f0-pqrf Ready 4m11s v1.31.6-gke.1020000 + +NAME READY STATUS RESTARTS AGE +pod/ollama-amd64-deployment-cbfc4b865-msftf 1/1 Running 0 29m +pod/ollama-arm64-deployment-678dc8556f-956d6 1/1 Running 0 2m52s + +NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE +service/ollama-amd64-svc LoadBalancer 1.2.3.4 1.2.3.4 80:30668/TCP 29m +service/ollama-arm64-svc LoadBalancer 1.2.3.4 1.2.3.4 80:30666/TCP 2m52s +service/ollama-multiarch-svc LoadBalancer 1.2.3.4 1.2.3.4 80:30667/TCP 2m52s +``` + +When the pods show *Running* and the service shows a valid *External IP*, we're ready to test the ollama arm64 service! + +### Test the ollama on arm web service + +To test the service, use the previously created model_util.sh from the last section; instead of the *amd64* parameter, replace it with *arm64*: + +3. Run the following to make an HTTP request to the amd64 ollama service on port 80: + +```commandline +./model_util.sh arm64 hello +``` + +You should get back the HTTP response, as well as the logline from the pod that served it: + +```commandline +Server response: +Using service endpoint 34.44.135.90 for hello on arm64 +Ollama is running + +Pod log output: + +[pod/ollama-arm64-deployment-678dc8556f-956d6/ollama-multiarch] 2025-03-25T21:25:21.547384356Z +``` +Once again, we're looking for "Ollama is running". If you see that, congrats, you've successfully setup your GKE cluster with both amd64 and arm64 nodes and pods running a Deployment with the ollama multi-architecture container! + +Next, let's do some simple analysis of the cluster's performance. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/3-perf-tests.md b/content/learning-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/3-perf-tests.md new file mode 100644 index 0000000000..95b0319c47 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/3-perf-tests.md @@ -0,0 +1,115 @@ +--- +title: Testing Functionality and Performance +weight: 5 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Overview +Now that you have a hybrid arm64/amd64 cluster running Ollama, you can kick the tires a bit and see for yourself the advantages of running on arm64 yourself! + +### Using the multiarch service to run the application on any platform +In a real world scenario, you may wish to access ollama without regard to architecture, only with regard to availability. As you are now running ollama on both amd64 and arm64, both architectures' endpoints are available via the multiarch service. + +To send a request to either based on availability, try the following command: + +```commandline +./model_util.sh multiarch hello +``` + +You should see a server response, and the pod that handled the request prefixed with the deployment, node, pod, and timestamp: + +```commandline +Server response: +Ollama is running + +Pod log output: +[pod/ollama-amd64-deployment-cbfc4b865-rf4p9/ollama-multiarch] 06:25:48 +``` + +Click up arrow and enter a few times, or write a loop -- due to the sessionAffinity:None settings on the service, it should distribute the load to both architectures based on availability. You can see which exact pod was hit by whether amd64 or arm64 is seen in the Pod log output: + +```commandline +[pod/ollama-amd64-... # amd64 pod was hit +[pod/ollama-arm64-... # arm64 pod was hit +``` + +Now you've seen both architectures responding to a "hello world" ping. But what about real world price/performance stats? Load a model(s) into the ollama pods, and see for yourself! + +### Load the llama3.2 model into pods + +{{% notice Note %}} +The llama3.2 model is used in this demonstration. As [Ollama supports many different models](https://ollama-operator.ayaka.io/pages/en/guide/supported-models) you are invited to modify the model_util.sh script to add/replace llama3.2 with the model(s) of your choice. +{{% /notice %}} + +Ollama will host and run models, but you need to first load the model (llama3.2 in this case) before performing inference against it. To do this, run both commands shown below: + +```commandline +# for amd64 +./model_util.sh amd64 pull + +# for arm64 +./model_util.sh arm64 pull +``` +If the output ends with ```{"status":"success"}``` for each command, the model was pulled successfuly. + +### Seeing is believing! + +Once the models have loaded into both pods, you can perform inference regardless of node architecture (multiarch), or individually by architecture type (amd64 or arm64). + +By default, the prompt hardcoded into the modelUtil.sh script is *Create a sentence that makes sense in the English language, with as many palindromes in it as possible*, but feel free to edit that if you wish. + +To see the inference performance first on the amd64 pod: + +```bash +./model_util.sh amd64 infer +``` +should output something similar to: + +```commandline +... +1023,13],"total_duration":15341522988,"load_duration":16209080,"prompt_eval_count":32,"prompt_eval_duration":164000000,"eval_count":93,"eval_duration":15159000000} +Tokens per second: 6.13 + +Pod log output: +[pod/ollama-arm64-deployment-678dc8556f-mj7gm/ollama-multiarch] 06:29:14 +``` + +Objectively, you can see tokens per second rate measured at 6.13 (from my log output example, your actual value may vary a bit). + +Next, run the same inference, but on the arm64 node, with the following command: + +```bash +./model_util.sh arm64 infer +``` + +Visually, you should see the output streaming out a lot faster on arm64 than on amd64. To be more objective, taking a look at the output can help us verify it was indeed faster on arm64 vs amd64: + +```commandline +4202,72,426,13],"total_duration":13259950101,"load_duration":1257990283,"prompt_eval_count":32,"prompt_eval_duration":1431000000,"eval_count":153,"eval_duration":10570000000} +Tokens per second: 14.47 + +Pod log output: +[pod/ollama-arm64-deployment-678dc8556f-mj7gm/ollama-multiarch] 06:46:35 +``` +This shows more than a 2X performance increase of arm64 vs amd64! + +### Notes on Evaluating Price/Performance + +We chose GKE amd64-based c4 and arm64-based c4a instances so we could compare apples to apples. Advertised similarly for memory and vCPU performance, pricing for arm64 vs other architectures is generally less expensive than that of its arm64 counterparts. If you're interested in learning more, browse your cloud providers' VM Instance price pages for more information on the price/performance benefits of Arm CPUs for your workloads. + +### Conclusion + +In this learning path, you learned how to: + +1. Bring up a GKE cluster with amd64 and arm64 nodes. +2. Use the same multi-architecture container image for both amd64 and arm64 ollama deployments. +3. See how inference on arm64 is faster than that of amd64. + +What's next? You can: + +* Evaluate Price/Performance by following the steps of this Learning Path on your own workloads. +* See if a full migration to Arm nodes makes sense, or to continue in a hybrid configuration. +* Phase in Arm-specific DevOps tool/operations support. +* Shut the test cluster down if you aren't using it to save money on unused cloud resources. diff --git a/content/learning-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/_index.md b/content/learning-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/_index.md new file mode 100644 index 0000000000..f4e7e69832 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/_index.md @@ -0,0 +1,87 @@ +--- +title: Run ollama on both arm64 and amd64 nodes, using the same multi-architecture container image on GKE. + +minutes_to_complete: 30 + +who_is_this_for: This learning path will show you how easy it is to migrate from homogenous amd64 k8s clusters, to a hybrid (arm64 and amd64) cluster with multi-architectural container images on GKE. Demonstrated with the ollama application, you'll see for yourself the price/performance advantages of running on arm64. Although tutorial will be GKE-specific with ollama, the provided YAML can act as a template for deployment on any on any multi-architectural application and cloud. + + +learning_objectives: + - Spin up a GKE cluster with amd64 and arm64 nodes. + - Apply ollama amd64-based and arm64-based Deployments and Services using the same container image. + - Ping, pull models, and make inferences to experience each architectures' performance first-hand. + - Experiment further on your own by researching which existing, and future workloads could benefit most from single, or multi-architectural clusters. + +prerequisites: + - A [Google Cloud account](https://console.cloud.google.com/). + - A computer with [Google Cloud CLI](/install-guides/gcloud) and [kubectl](/install-guides/kubectl/) installed. + - The [GKE Cloud Plugin](https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-access-for-kubectl#gcloud) + + +author: + - Geremy Cohen + +### Tags +skilllevels: Introductory + +subjects: Containers and Virtualization +cloud_service_providers: Google Cloud + + +armips: + - Neoverse + +operatingsystems: + - Linux + - MacOs + +tools_software_languages: + - LLM + - ollama + - GenAI + +further_reading: + - resource: + title: ollama - Get up and running with large language models + link: https://ollama.com/ + type: documentation + - resource: + title: ollama API calls + link: https://github.com/ollama/ollama/blob/main/docs/api.md + type: documentation + - resource: + title: Dockerhub for Ollama + link: https://hub.docker.com/r/ollama/ollama + type: documentation + - resource: + title: ollama build docs + link: https://github.com/ollama/ollama/blob/main/docs/development.md + type: documentation + - resource: + title: Getting started with Llama + link: https://llama.meta.com/get-started + type: documentation + - resource: + title: Prepare to deploy an Arm workload in a Standard cluster + link: https://cloud.google.com/kubernetes-engine/docs/how-to/prepare-arm-workloads-for-deployment + type: documentation + - resource: + title: Create an External Load Balancer + link: https://kubernetes.io/docs/tasks/access-application-cluster/create-external-load-balancer/ + type: documentation + - resource: + title: Install kubectl and configure cluster access on GKE + link: https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-access-for-kubectl + type: documentation + + + + + + +### FIXED, DO NOT MODIFY +# ================================================================================ +weight: 1 # _index.md always has weight of 1 to order correctly +layout: "learningpathall" # All files under learning paths have this same wrapper +learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content. +--- diff --git a/content/learning-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/images/arm_node_config-1.png b/content/learning-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/images/arm_node_config-1.png new file mode 100644 index 0000000000..e276018809 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/images/arm_node_config-1.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/images/arm_node_config-2.png b/content/learning-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/images/arm_node_config-2.png new file mode 100644 index 0000000000..f49480b31e Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/images/arm_node_config-2.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/images/cluster_basics.png b/content/learning-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/images/cluster_basics.png new file mode 100644 index 0000000000..cd28021707 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/images/cluster_basics.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/images/configure-x86-note-type.png b/content/learning-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/images/configure-x86-note-type.png new file mode 100644 index 0000000000..ff9e5ce4c6 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/images/configure-x86-note-type.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/images/general_flow.png b/content/learning-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/images/general_flow.png new file mode 100644 index 0000000000..a341eda42d Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/images/general_flow.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/images/select_standard.png b/content/learning-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/images/select_standard.png new file mode 100644 index 0000000000..b0c7f6328b Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/images/select_standard.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/images/taint_on_arm_node.png b/content/learning-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/images/taint_on_arm_node.png new file mode 100644 index 0000000000..4673c874ef Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/images/taint_on_arm_node.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/images/x86-node-pool.png b/content/learning-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/images/x86-node-pool.png new file mode 100644 index 0000000000..b7b6218f17 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/multiarch_ollama_on_gke/images/x86-node-pool.png differ