Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .github/workflows/build_tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -131,6 +131,11 @@ jobs:
uses: ./.github/workflows/reusable_lint_and_format.yml
with:
run-id: '${{needs.set-variables.outputs.run-id}}'
verify-goldens:
needs: [install-dependencies, set-variables]
uses: ./.github/workflows/reusable_goldens.yaml
with:
run-id: '${{needs.set-variables.outputs.run-id}}'
run-unit-tests:
needs: [install-dependencies, set-variables]
uses: ./.github/workflows/reusable_unit_tests.yaml
Expand Down
48 changes: 48 additions & 0 deletions .github/workflows/reusable_goldens.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License

on:
workflow_call:
inputs:
run-id:
required: true
type: string

permissions:
contents: read

jobs:
verify-goldens:
runs-on: [ubuntu-22.04]
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.10'
- name: Prepare directories
run: mkdir -p ~/.cache/pip
- name: Restore cached dependencies
uses: actions/cache@v4
with:
path: |
/usr/local/bin/kubectl-kueue
/usr/local/bin/kubectl-kjob
~/.cache/pip
${{env.pythonLocation}}
key: xpk-deps-3.10-${{github.run_id}}-${{github.run_attempt}}
restore-keys: xpk-deps-3.10-
- name: Verify goldens
run: ./golden_buddy.sh verify goldens.yaml goldens
env:
UPDATE_GOLDEN_COMMAND: make goldens
97 changes: 81 additions & 16 deletions goldens/NAP_cluster-create.txt
Original file line number Diff line number Diff line change
Expand Up @@ -29,19 +29,84 @@ kubectl wait deployment/coredns --for=condition=Available=true --namespace=kube-
[XPK] CoreDNS has successfully started and passed verification.
[XPK] CoreDNS deployment 'coredns' found in namespace 'kube-system'.
[XPK] Skipping CoreDNS deployment since it already exists.
[XPK] Working on golden-project and us-central1-a
[XPK] Try 1: get-credentials to cluster golden-cluster
[XPK] Task: `get-credentials to cluster golden-cluster` is implemented by the following command not running since it is a dry run.
gcloud container clusters get-credentials golden-cluster --region=us-central1 --project=golden-project && kubectl config view && kubectl config set-context --current --namespace=default
[XPK] Couldn't translate project id: golden-project to project number. Error: 403 Permission 'resourcemanager.projects.get' denied on resource '//cloudresourcemanager.googleapis.com/projects/golden-project' (or it may not exist). [reason: "IAM_PERMISSION_DENIED"
domain: "cloudresourcemanager.googleapis.com"
metadata {
key: "resource"
value: "projects/golden-project"
}
metadata {
key: "permission"
value: "resourcemanager.projects.get"
}
]
[XPK] XPK failed, error code 1
[XPK] Task: `Determine current gke master version` is implemented by the following command not running since it is a dry run.
gcloud beta container clusters describe golden-cluster --region us-central1 --project golden-project --format="value(currentMasterVersion)"
[XPK] Creating 1 node pool or pools of tpu7x-8
We assume that the underlying system is: SystemCharacteristics(topology='2x2x1', vms_per_slice=1, gke_accelerator='tpu7x', gce_machine_type='tpu7x-standard-4t', chips_per_vm=4, accelerator_type=1, device_type='tpu7x-8')
[XPK] Task: `Get All Node Pools` is implemented by the following command not running since it is a dry run.
gcloud beta container node-pools list --cluster golden-cluster --project=golden-project --region=us-central1 --format="csv[no-heading](name)"
[XPK] Creating 1 node pool or pools of tpu7x-8
Underlyingly, we assume that means: SystemCharacteristics(topology='2x2x1', vms_per_slice=1, gke_accelerator='tpu7x', gce_machine_type='tpu7x-standard-4t', chips_per_vm=4, accelerator_type=1, device_type='tpu7x-8')
[XPK] Task: `Get Node Pool Zone` is implemented by the following command not running since it is a dry run.
gcloud beta container node-pools describe 0 --cluster golden-cluster --project=golden-project --region=us-central1 --format="value(locations)"
[XPK] Task: `GKE Cluster Get ConfigMap` is implemented by the following command not running since it is a dry run.
kubectl get configmap golden-cluster-resources-configmap -o=custom-columns="ConfigData:data" --no-headers=true
[XPK] Existing node pool names ['0']
[XPK] To complete NodepoolCreate-golden-cluster-np-0 we are executing gcloud beta container node-pools create golden-cluster-np-0 --region=us-central1 --cluster=golden-cluster --project=golden-project --node-locations=us-central1-a --machine-type=tpu7x-standard-4t --host-maintenance-interval=AS_NEEDED --enable-gvnic --node-version=0 --num-nodes=1 --scopes=storage-full,gke-default,"https://www.googleapis.com/auth/cloud-platform" --placement-type=COMPACT --max-pods-per-node 15 --tpu-topology=2x2x1
[XPK] Breaking up a total of 1 commands into 1 batches
[XPK] Pretending all the jobs succeeded
[XPK] Create or delete node pool request complete.
[XPK] Enabling Autoprovisioning
[XPK] Default Chips quota is minimum: 0, maximum: 4.
[XPK] Chips quota is minimum: 0, maximum: 4. XPK will autoprovision 4 chips based on incoming workload requests, keeping at least 0 available at all times, and maximum of 4. If the difference (4 chips) is small, rescaling will not work well.
[XPK] Task: `Update cluster with autoprovisioning enabled` is implemented by the following command not running since it is a dry run.
gcloud container clusters update golden-cluster --project=golden-project --region=us-central1 --enable-autoprovisioning --autoprovisioning-config-file 6062bfee91f21efca86f2c3261129f06b1896ad9b68d2ecdba9589bea9e15ddf
[XPK] Task: `Update cluster with autoscaling-profile` is implemented by the following command not running since it is a dry run.
gcloud container clusters update golden-cluster --project=golden-project --region=us-central1 --autoscaling-profile=optimize-utilization
[XPK] Task: `Get All Node Pools` is implemented by the following command not running since it is a dry run.
gcloud beta container node-pools list --cluster golden-cluster --project=golden-project --region=us-central1 --format="csv[no-heading](name)"
[XPK] Breaking up a total of 0 commands into 0 batches
[XPK] Pretending all the jobs succeeded
[XPK] Creating ConfigMap for cluster
[XPK] Breaking up a total of 2 commands into 1 batches
[XPK] Pretending all the jobs succeeded
[XPK] Enabling the jobset API on our cluster, to be deprecated when Jobset is globally available
[XPK] Try 1: Install Jobset on golden-cluster
[XPK] Task: `Install Jobset on golden-cluster` is implemented by the following command not running since it is a dry run.
kubectl apply --server-side --force-conflicts -f https://github.com/kubernetes-sigs/jobset/releases/download/v0.8.0/manifests.yaml
[XPK] Task: `Count total nodes` is implemented by the following command not running since it is a dry run.
kubectl get node --no-headers | wc -l
[XPK] Try 1: Updating jobset Controller Manager resources
[XPK] Task: `Updating jobset Controller Manager resources` is implemented by the following command not running since it is a dry run.
kubectl apply -f 1b31e624e490f9c8c4ef4e369f08d3fa467990af5a261e4405bd045265d70e95
[XPK] Try 1: Install PathwaysJob on golden-cluster
[XPK] Task: `Install PathwaysJob on golden-cluster` is implemented by the following command not running since it is a dry run.
kubectl apply --server-side -f https://github.com/google/pathways-job/releases/download/v0.1.2/install.yaml
[XPK] Enabling Kueue on the cluster
[XPK] Task: `Get kueue version on server` is implemented by the following command not running since it is a dry run.
kubectl kueue version
[XPK] Try 1: Set Kueue On Cluster
[XPK] Task: `Set Kueue On Cluster` is implemented by the following command not running since it is a dry run.
kubectl apply --server-side --force-conflicts -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.12.2/manifests.yaml
[XPK] Wait for Kueue to be fully available
[XPK] Task: `Wait for Kueue to be available` is implemented by the following command not running since it is a dry run.
kubectl wait deploy/kueue-controller-manager -nkueue-system --for=condition=available --timeout=10m
[XPK] Install Kueue Custom Resources
[XPK] Try 1: Applying Kueue Custom Resources
[XPK] Task: `Applying Kueue Custom Resources` is implemented by the following command not running since it is a dry run.
kubectl apply -f b3843453fb19ae7105126245bac5b63930f46861462cd3a557aea44801a99280
[XPK] Update Kueue Controller Manager resources
[XPK] Task: `Count total nodes` is implemented by the following command not running since it is a dry run.
kubectl get node --no-headers | wc -l
[XPK] Try 1: Updating Kueue Controller Manager resources
[XPK] Task: `Updating Kueue Controller Manager resources` is implemented by the following command not running since it is a dry run.
kubectl apply -f 012e1b15b6941e9d47cb2cdb35488d57c2f3ce0ef0b18093d2759f2e02ed81dc
[XPK] Verifying kjob installation
[XPK] Task: `Verify kjob installation ` is implemented by the following command not running since it is a dry run.
kubectl-kjob help
[XPK] kjob found
[XPK] Applying kjob CDRs
[XPK] Task: `Create kjob CRDs on cluster` is implemented by the following command not running since it is a dry run.
kubectl kjob printcrds | kubectl apply --server-side -f -
[XPK] Creating kjob CRDs succeeded
[XPK] Task: `GKE Cluster Get ConfigMap` is implemented by the following command not running since it is a dry run.
kubectl get configmap golden-cluster-resources-configmap -o=custom-columns="ConfigData:data" --no-headers=true
[XPK] Task: `Creating JobTemplate` is implemented by the following command not running since it is a dry run.
kubectl apply -f 4abb796ed6e7c9d7256a51f13124efd989fc12ee83839bed432fcf7d64f68e61
[XPK] Task: `Creating PodTemplate` is implemented by the following command not running since it is a dry run.
kubectl apply -f a63aa3c4593c38ad90671fd8b067d1886f6313ad558379b364b51791aa50f4e8
[XPK] Task: `Creating AppProfile` is implemented by the following command not running since it is a dry run.
kubectl apply -f 1d13ddebae3c90a05ba26b312df088982dd0df0edc4f4013b88384e476c20486
[XPK] GKE commands done! Resources are created.
[XPK] See your GKE Cluster here: https://console.cloud.google.com/kubernetes/clusters/details/us-central1/golden-cluster/details?project=golden-project
[XPK] Exiting XPK cleanly
98 changes: 82 additions & 16 deletions goldens/NAP_cluster-create_with_pathways.txt
Original file line number Diff line number Diff line change
Expand Up @@ -29,19 +29,85 @@ kubectl wait deployment/coredns --for=condition=Available=true --namespace=kube-
[XPK] CoreDNS has successfully started and passed verification.
[XPK] CoreDNS deployment 'coredns' found in namespace 'kube-system'.
[XPK] Skipping CoreDNS deployment since it already exists.
[XPK] Working on golden-project and us-central1-a
[XPK] Try 1: get-credentials to cluster golden-cluster
[XPK] Task: `get-credentials to cluster golden-cluster` is implemented by the following command not running since it is a dry run.
gcloud container clusters get-credentials golden-cluster --region=us-central1 --project=golden-project && kubectl config view && kubectl config set-context --current --namespace=default
[XPK] Couldn't translate project id: golden-project to project number. Error: 403 Permission 'resourcemanager.projects.get' denied on resource '//cloudresourcemanager.googleapis.com/projects/golden-project' (or it may not exist). [reason: "IAM_PERMISSION_DENIED"
domain: "cloudresourcemanager.googleapis.com"
metadata {
key: "resource"
value: "projects/golden-project"
}
metadata {
key: "permission"
value: "resourcemanager.projects.get"
}
]
[XPK] XPK failed, error code 1
[XPK] Task: `Determine current gke master version` is implemented by the following command not running since it is a dry run.
gcloud beta container clusters describe golden-cluster --region us-central1 --project golden-project --format="value(currentMasterVersion)"
[XPK] Creating 1 node pool or pools of tpu7x-8
We assume that the underlying system is: SystemCharacteristics(topology='2x2x1', vms_per_slice=1, gke_accelerator='tpu7x', gce_machine_type='tpu7x-standard-4t', chips_per_vm=4, accelerator_type=1, device_type='tpu7x-8')
[XPK] Task: `Get All Node Pools` is implemented by the following command not running since it is a dry run.
gcloud beta container node-pools list --cluster golden-cluster --project=golden-project --region=us-central1 --format="csv[no-heading](name)"
[XPK] Creating 1 node pool or pools of tpu7x-8
Underlyingly, we assume that means: SystemCharacteristics(topology='2x2x1', vms_per_slice=1, gke_accelerator='tpu7x', gce_machine_type='tpu7x-standard-4t', chips_per_vm=4, accelerator_type=1, device_type='tpu7x-8')
[XPK] Task: `Get Node Pool Zone` is implemented by the following command not running since it is a dry run.
gcloud beta container node-pools describe 0 --cluster golden-cluster --project=golden-project --region=us-central1 --format="value(locations)"
[XPK] Task: `GKE Cluster Get ConfigMap` is implemented by the following command not running since it is a dry run.
kubectl get configmap golden-cluster-resources-configmap -o=custom-columns="ConfigData:data" --no-headers=true
[XPK] Existing node pool names ['0']
[XPK] To complete NodepoolCreate-golden-cluster-np-0 we are executing gcloud beta container node-pools create golden-cluster-np-0 --region=us-central1 --cluster=golden-cluster --project=golden-project --node-locations=us-central1-a --machine-type=tpu7x-standard-4t --host-maintenance-interval=AS_NEEDED --enable-gvnic --node-version=0 --num-nodes=1 --scopes=storage-full,gke-default,"https://www.googleapis.com/auth/cloud-platform" --placement-type=COMPACT --max-pods-per-node 15 --tpu-topology=2x2x1
[XPK] To complete NodepoolCreate-cpu-np we are executing gcloud beta container node-pools create cpu-np --node-version=0 --cluster=golden-cluster --project=golden-project --node-locations=us-central1-a --region=us-central1 --num-nodes=1 --machine-type=n2-standard-64 --scopes=storage-full,gke-default,"https://www.googleapis.com/auth/cloud-platform" --enable-autoscaling --min-nodes=1 --max-nodes=20
[XPK] Breaking up a total of 2 commands into 1 batches
[XPK] Pretending all the jobs succeeded
[XPK] Create or delete node pool request complete.
[XPK] Enabling Autoprovisioning
[XPK] Default Chips quota is minimum: 0, maximum: 4.
[XPK] Chips quota is minimum: 0, maximum: 4. XPK will autoprovision 4 chips based on incoming workload requests, keeping at least 0 available at all times, and maximum of 4. If the difference (4 chips) is small, rescaling will not work well.
[XPK] Task: `Update cluster with autoprovisioning enabled` is implemented by the following command not running since it is a dry run.
gcloud container clusters update golden-cluster --project=golden-project --region=us-central1 --enable-autoprovisioning --autoprovisioning-config-file 6062bfee91f21efca86f2c3261129f06b1896ad9b68d2ecdba9589bea9e15ddf
[XPK] Task: `Update cluster with autoscaling-profile` is implemented by the following command not running since it is a dry run.
gcloud container clusters update golden-cluster --project=golden-project --region=us-central1 --autoscaling-profile=optimize-utilization
[XPK] Task: `Get All Node Pools` is implemented by the following command not running since it is a dry run.
gcloud beta container node-pools list --cluster golden-cluster --project=golden-project --region=us-central1 --format="csv[no-heading](name)"
[XPK] Breaking up a total of 0 commands into 0 batches
[XPK] Pretending all the jobs succeeded
[XPK] Creating ConfigMap for cluster
[XPK] Breaking up a total of 2 commands into 1 batches
[XPK] Pretending all the jobs succeeded
[XPK] Enabling the jobset API on our cluster, to be deprecated when Jobset is globally available
[XPK] Try 1: Install Jobset on golden-cluster
[XPK] Task: `Install Jobset on golden-cluster` is implemented by the following command not running since it is a dry run.
kubectl apply --server-side --force-conflicts -f https://github.com/kubernetes-sigs/jobset/releases/download/v0.8.0/manifests.yaml
[XPK] Task: `Count total nodes` is implemented by the following command not running since it is a dry run.
kubectl get node --no-headers | wc -l
[XPK] Try 1: Updating jobset Controller Manager resources
[XPK] Task: `Updating jobset Controller Manager resources` is implemented by the following command not running since it is a dry run.
kubectl apply -f 1b31e624e490f9c8c4ef4e369f08d3fa467990af5a261e4405bd045265d70e95
[XPK] Try 1: Install PathwaysJob on golden-cluster
[XPK] Task: `Install PathwaysJob on golden-cluster` is implemented by the following command not running since it is a dry run.
kubectl apply --server-side -f https://github.com/google/pathways-job/releases/download/v0.1.2/install.yaml
[XPK] Enabling Kueue on the cluster
[XPK] Task: `Get kueue version on server` is implemented by the following command not running since it is a dry run.
kubectl kueue version
[XPK] Try 1: Set Kueue On Cluster
[XPK] Task: `Set Kueue On Cluster` is implemented by the following command not running since it is a dry run.
kubectl apply --server-side --force-conflicts -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.12.2/manifests.yaml
[XPK] Wait for Kueue to be fully available
[XPK] Task: `Wait for Kueue to be available` is implemented by the following command not running since it is a dry run.
kubectl wait deploy/kueue-controller-manager -nkueue-system --for=condition=available --timeout=10m
[XPK] Install Kueue Custom Resources
[XPK] Try 1: Applying Kueue Custom Resources
[XPK] Task: `Applying Kueue Custom Resources` is implemented by the following command not running since it is a dry run.
kubectl apply -f 898c7686cc5ef7f026f74e55b73b5843767e3f1abb9639169f02ebc44d06af73
[XPK] Update Kueue Controller Manager resources
[XPK] Task: `Count total nodes` is implemented by the following command not running since it is a dry run.
kubectl get node --no-headers | wc -l
[XPK] Try 1: Updating Kueue Controller Manager resources
[XPK] Task: `Updating Kueue Controller Manager resources` is implemented by the following command not running since it is a dry run.
kubectl apply -f 012e1b15b6941e9d47cb2cdb35488d57c2f3ce0ef0b18093d2759f2e02ed81dc
[XPK] Verifying kjob installation
[XPK] Task: `Verify kjob installation ` is implemented by the following command not running since it is a dry run.
kubectl-kjob help
[XPK] kjob found
[XPK] Applying kjob CDRs
[XPK] Task: `Create kjob CRDs on cluster` is implemented by the following command not running since it is a dry run.
kubectl kjob printcrds | kubectl apply --server-side -f -
[XPK] Creating kjob CRDs succeeded
[XPK] Task: `GKE Cluster Get ConfigMap` is implemented by the following command not running since it is a dry run.
kubectl get configmap golden-cluster-resources-configmap -o=custom-columns="ConfigData:data" --no-headers=true
[XPK] Task: `Creating JobTemplate` is implemented by the following command not running since it is a dry run.
kubectl apply -f 4abb796ed6e7c9d7256a51f13124efd989fc12ee83839bed432fcf7d64f68e61
[XPK] Task: `Creating PodTemplate` is implemented by the following command not running since it is a dry run.
kubectl apply -f a63aa3c4593c38ad90671fd8b067d1886f6313ad558379b364b51791aa50f4e8
[XPK] Task: `Creating AppProfile` is implemented by the following command not running since it is a dry run.
kubectl apply -f 1d13ddebae3c90a05ba26b312df088982dd0df0edc4f4013b88384e476c20486
[XPK] GKE commands done! Resources are created.
[XPK] See your GKE Cluster here: https://console.cloud.google.com/kubernetes/clusters/details/us-central1/golden-cluster/details?project=golden-project
[XPK] Exiting XPK cleanly
Loading
Loading