-
Notifications
You must be signed in to change notification settings - Fork 63
A4X support #643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
A4X support #643
Changes from all commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
3144f4b
feat: add a4x support
scaliby 56236ac
feat: add goldens for a4x
scaliby 35d03e9
Merge branch 'develop' into scaliby/a4x-support
scaliby 691a55a
Merge branch 'develop' into scaliby/a4x-support
scaliby 66d84c6
Merge branch 'develop' into scaliby/a4x-support
scaliby 91ecea4
Merge branch 'scaliby/a4x-support' of github.com:AI-Hypercomputer/xpk…
scaliby b353196
Merge branch 'develop' into scaliby/a4x-support
scaliby c994f6b
feat: updates after develop changes
scaliby 7973b44
Merge branch 'develop' into scaliby/a4x-support
scaliby 49009a2
feat: update goldens
scaliby File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,111 @@ | ||
| $ python3 xpk.py cluster create --project=golden-project --zone=us-central1-a --cluster=golden-cluster --device-type=gb200-4 --reservation=golden-reservation --dry-run | ||
| [XPK] Starting xpk | ||
| [XPK] Starting cluster create for cluster golden-cluster: | ||
| [XPK] Working on golden-project and us-central1-a | ||
| [XPK] Task: `Determine server supported GKE versions for default rapid gke version` is implemented by the following command not running since it is a dry run. | ||
| gcloud container get-server-config --project=golden-project --region=us-central1 --flatten="channels" --filter="channels.channel=RAPID" --format="value(channels.defaultVersion)" | ||
| [XPK] Task: `Determine server supported GKE versions for valid versions` is implemented by the following command not running since it is a dry run. | ||
| gcloud container get-server-config --project=golden-project --region=us-central1 --flatten="channels" --filter="channels.channel=RAPID" --format="value(channels.validVersions)" | ||
| [XPK] Task: `Find if Cluster Exists` is implemented by the following command not running since it is a dry run. | ||
| gcloud container clusters list --project=golden-project --region=us-central1 --format="csv[no-heading](name)" | ||
| [XPK] Task: `GKE Cluster Create` is implemented by the following command not running since it is a dry run. | ||
| gcloud beta container clusters create golden-cluster --project=golden-project --region=us-central1 --node-locations=us-central1-a --cluster-version=0 --machine-type=e2-standard-16 --enable-autoscaling --total-min-nodes 1 --total-max-nodes 1000 --num-nodes 6 --enable-dns-access --autoscaling-profile=optimize-utilization --enable-dataplane-v2 --enable-multi-networking --no-enable-autoupgrade --enable-ip-alias | ||
| [XPK] Task: `Check if Private Nodes is enabled in cluster.` is implemented by the following command not running since it is a dry run. | ||
| gcloud container clusters describe golden-cluster --project=golden-project --region=us-central1 --format="value(privateClusterConfig.enablePrivateNodes)" | ||
| [XPK] Private Nodes is not enabled on the cluster. | ||
| [XPK] Cluster is public and no need to authorize networks. | ||
| [XPK] Try 1: get-credentials to cluster golden-cluster | ||
| [XPK] Task: `get-credentials to cluster golden-cluster` is implemented by the following command not running since it is a dry run. | ||
| gcloud container clusters get-credentials golden-cluster --region=us-central1 --project=golden-project && kubectl config view && kubectl config set-context --current --namespace=default | ||
| [XPK] Task: 'Checking CoreDNS deployment existence' in progress for namespace: kube-system | ||
| [XPK] Task: `Check CoreDNS deployment in kube-system` is implemented by the following command not running since it is a dry run. | ||
| kubectl get deployment coredns -n kube-system | ||
| [XPK] Now verifying CoreDNS readiness... | ||
| [XPK] Task: `Waiting for kubeDNS to be checked.` is implemented by the following command not running since it is a dry run. | ||
| kubectl get deployment kube-dns -n kube-system --ignore-not-found | ||
| [XPK] kube-dns deployment not found. | ||
| [XPK] Verifying if CoreDNS is available... | ||
| [XPK] Task: `Wait for coredns available` is implemented by the following command not running since it is a dry run. | ||
| kubectl wait deployment/coredns --for=condition=Available=true --namespace=kube-system --timeout=240s | ||
| [XPK] CoreDNS has successfully started and passed verification. | ||
| [XPK] CoreDNS deployment 'coredns' found in namespace 'kube-system'. | ||
| [XPK] Skipping CoreDNS deployment since it already exists. | ||
| [XPK] Task: `Determine current gke master version` is implemented by the following command not running since it is a dry run. | ||
| gcloud beta container clusters describe golden-cluster --region us-central1 --project golden-project --format="value(currentMasterVersion)" | ||
| [XPK] Creating 1 node pool or pools of gb200-4 | ||
| We assume that the underlying system is: SystemCharacteristics(topology='1x72', vms_per_slice=1, gke_accelerator='nvidia-gb200', gce_machine_type='a4x-highgpu-4g', chips_per_vm=4, accelerator_type=2, device_type='gb200-4') | ||
| [XPK] Task: `Get All Node Pools` is implemented by the following command not running since it is a dry run. | ||
| gcloud beta container node-pools list --cluster golden-cluster --project=golden-project --region=us-central1 --format="csv[no-heading](name)" | ||
| [XPK] Task: `Describe reservation` is implemented by the following command not running since it is a dry run. | ||
| gcloud beta compute reservations describe golden-reservation --project=golden-project --zone=us-central1-a | ||
| [XPK] Creating 1 node pool with 2 nodes of gb200-4 | ||
| Underlyingly, we assume that means: SystemCharacteristics(topology='1x72', vms_per_slice=1, gke_accelerator='nvidia-gb200', gce_machine_type='a4x-highgpu-4g', chips_per_vm=4, accelerator_type=2, device_type='gb200-4') | ||
| [XPK] Task: `Get Node Pool Zone` is implemented by the following command not running since it is a dry run. | ||
| gcloud beta container node-pools describe 0 --cluster golden-cluster --project=golden-project --region=us-central1 --format="value(locations)" | ||
| [XPK] Task: `GKE Cluster Get ConfigMap` is implemented by the following command not running since it is a dry run. | ||
| kubectl get configmap golden-cluster-resources-configmap -o=custom-columns="ConfigData:data" --no-headers=true | ||
| [XPK] Existing node pool names ['0'] | ||
| [XPK] Task: `Retrieve resource policy` is implemented by the following command not running since it is a dry run. | ||
| gcloud compute resource-policies describe golden-cluster-placement-policy --project=golden-project --region=us-central1 | ||
| [XPK] To complete NodepoolCreate-golden-cluster-np-0 we are executing gcloud beta container node-pools create golden-cluster-np-0 --region=us-central1 --cluster=golden-cluster --project=golden-project --node-locations=us-central1-a --machine-type=a4x-highgpu-4g --host-maintenance-interval=AS_NEEDED --reservation-affinity=specific --reservation=golden-reservation --placement-policy=golden-cluster-placement-policy --enable-gvnic --num-nodes=2 --accelerator type=nvidia-gb200,count=4,gpu-driver-version=latest --no-enable-autoupgrade --scopes="https://www.googleapis.com/auth/cloud-platform" | ||
| [XPK] Breaking up a total of 1 commands into 1 batches | ||
| [XPK] Pretending all the jobs succeeded | ||
| [XPK] Create or delete node pool request complete. | ||
| [XPK] Creating ConfigMap for cluster | ||
| [XPK] Task: `Describe reservation` is implemented by the following command not running since it is a dry run. | ||
| gcloud beta compute reservations describe golden-reservation --project=golden-project --zone=us-central1-a | ||
| [XPK] Breaking up a total of 2 commands into 1 batches | ||
| [XPK] Pretending all the jobs succeeded | ||
| [XPK] Enabling the jobset API on our cluster, to be deprecated when Jobset is globally available | ||
| [XPK] Try 1: Install Jobset on golden-cluster | ||
| [XPK] Task: `Install Jobset on golden-cluster` is implemented by the following command not running since it is a dry run. | ||
| kubectl apply --server-side --force-conflicts -f https://github.com/kubernetes-sigs/jobset/releases/download/v0.8.0/manifests.yaml | ||
| [XPK] Task: `Count total nodes` is implemented by the following command not running since it is a dry run. | ||
| kubectl get node --no-headers | wc -l | ||
| [XPK] Try 1: Updating jobset Controller Manager resources | ||
| [XPK] Task: `Updating jobset Controller Manager resources` is implemented by the following command not running since it is a dry run. | ||
| kubectl apply -f 1b31e624e490f9c8c4ef4e369f08d3fa467990af5a261e4405bd045265d70e95 | ||
| [XPK] Try 1: Install PathwaysJob on golden-cluster | ||
| [XPK] Task: `Install PathwaysJob on golden-cluster` is implemented by the following command not running since it is a dry run. | ||
| kubectl apply --server-side -f https://github.com/google/pathways-job/releases/download/v0.1.2/install.yaml | ||
| [XPK] Enabling Kueue on the cluster | ||
| [XPK] Task: `Get kueue version on server` is implemented by the following command not running since it is a dry run. | ||
| kubectl kueue version | ||
| [XPK] Try 1: Set Kueue On Cluster | ||
| [XPK] Task: `Set Kueue On Cluster` is implemented by the following command not running since it is a dry run. | ||
| kubectl apply --server-side --force-conflicts -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.12.2/manifests.yaml | ||
| [XPK] Wait for Kueue to be fully available | ||
| [XPK] Task: `Wait for Kueue to be available` is implemented by the following command not running since it is a dry run. | ||
| kubectl wait deploy/kueue-controller-manager -nkueue-system --for=condition=available --timeout=10m | ||
| [XPK] Install Kueue Custom Resources | ||
| [XPK] Try 1: Applying Kueue Custom Resources | ||
| [XPK] Task: `Applying Kueue Custom Resources` is implemented by the following command not running since it is a dry run. | ||
| kubectl apply -f 7aee1635a549cbab3308e64e5f973f49f1b09f0ea7c3633a60b69828be981fc5 | ||
| [XPK] Update Kueue Controller Manager resources | ||
| [XPK] Task: `Count total nodes` is implemented by the following command not running since it is a dry run. | ||
| kubectl get node --no-headers | wc -l | ||
| [XPK] Try 1: Updating Kueue Controller Manager resources | ||
| [XPK] Task: `Updating Kueue Controller Manager resources` is implemented by the following command not running since it is a dry run. | ||
| kubectl apply -f 012e1b15b6941e9d47cb2cdb35488d57c2f3ce0ef0b18093d2759f2e02ed81dc | ||
| [XPK] Verifying kjob installation | ||
| [XPK] Task: `Verify kjob installation ` is implemented by the following command not running since it is a dry run. | ||
| kubectl-kjob help | ||
| [XPK] kjob found | ||
| [XPK] Applying kjob CDRs | ||
| [XPK] Task: `Create kjob CRDs on cluster` is implemented by the following command not running since it is a dry run. | ||
| kubectl kjob printcrds | kubectl apply --server-side -f - | ||
| [XPK] Creating kjob CRDs succeeded | ||
| [XPK] Task: `GKE Cluster Get ConfigMap` is implemented by the following command not running since it is a dry run. | ||
| kubectl get configmap golden-cluster-resources-configmap -o=custom-columns="ConfigData:data" --no-headers=true | ||
| [XPK] Task: `Creating JobTemplate` is implemented by the following command not running since it is a dry run. | ||
| kubectl apply -f 4abb796ed6e7c9d7256a51f13124efd989fc12ee83839bed432fcf7d64f68e61 | ||
| [XPK] Task: `Creating PodTemplate` is implemented by the following command not running since it is a dry run. | ||
| kubectl apply -f a63aa3c4593c38ad90671fd8b067d1886f6313ad558379b364b51791aa50f4e8 | ||
| [XPK] Task: `Creating AppProfile` is implemented by the following command not running since it is a dry run. | ||
| kubectl apply -f 1d13ddebae3c90a05ba26b312df088982dd0df0edc4f4013b88384e476c20486 | ||
| [XPK] Installing NCCL Plugin for cluster | ||
| [XPK] Task: `Install NCCL Plugin On Cluster` is implemented by the following command not running since it is a dry run. | ||
| kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/nccl-tcpxo-installer.yaml | ||
| [XPK] GKE commands done! Resources are created. | ||
| [XPK] See your GKE Cluster here: https://console.cloud.google.com/kubernetes/clusters/details/us-central1/golden-cluster/details?project=golden-project | ||
| [XPK] Exiting XPK cleanly |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,37 @@ | ||
| """ | ||
| Copyright 2025 Google LLC | ||
|
|
||
| Licensed under the Apache License, Version 2.0 (the "License"); | ||
| you may not use this file except in compliance with the License. | ||
| You may obtain a copy of the License at | ||
|
|
||
| https://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| Unless required by applicable law or agreed to in writing, software | ||
| distributed under the License is distributed on an "AS IS" BASIS, | ||
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| See the License for the specific language governing permissions and | ||
| limitations under the License. | ||
| """ | ||
|
|
||
| from functools import reduce | ||
| from operator import mul | ||
|
|
||
|
|
||
| def is_topology_valid(topology: str) -> bool: | ||
| try: | ||
| parse_topology(topology) | ||
| return True | ||
| except ValueError: | ||
| return False | ||
|
|
||
|
|
||
| def get_topology_product(topology: str) -> int: | ||
| return reduce(mul, parse_topology(topology), 1) | ||
|
|
||
|
|
||
| def parse_topology(topology: str) -> list[int]: | ||
| if len(topology) <= 0: | ||
| raise ValueError("Topology is an empty string") | ||
|
|
||
| return [int(el) for el in topology.lower().split("x")] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,43 @@ | ||
| """ | ||
| Copyright 2025 Google LLC | ||
|
|
||
| Licensed under the Apache License, Version 2.0 (the "License"); | ||
| you may not use this file except in compliance with the License. | ||
| You may obtain a copy of the License at | ||
|
|
||
| https://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| Unless required by applicable law or agreed to in writing, software | ||
| distributed under the License is distributed on an "AS IS" BASIS, | ||
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| See the License for the specific language governing permissions and | ||
| limitations under the License. | ||
| """ | ||
|
|
||
| import pytest | ||
| from .topology import is_topology_valid, get_topology_product, parse_topology | ||
|
|
||
|
|
||
| def test_is_topology_valid_with_invalid_topology(): | ||
| result = is_topology_valid("N/A") | ||
| assert result is False | ||
|
|
||
|
|
||
| def test_is_topology_valid_with_valid_topology(): | ||
| result = is_topology_valid("1x1x1") | ||
| assert result is True | ||
|
|
||
|
|
||
| def test_parse_topology_with_valid_topology(): | ||
| result = parse_topology("1x2x3") | ||
| assert result == [1, 2, 3] | ||
|
|
||
|
|
||
| def test_parse_topology_with_empty_input(): | ||
| with pytest.raises(ValueError): | ||
| parse_topology("") | ||
|
|
||
|
|
||
| def test_get_topology_product(): | ||
| result = get_topology_product("1x2x3") | ||
| assert result == 6 |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.