Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
2e2745e
Remove args from run_command_for_value (#647)
scaliby Sep 23, 2025
ebaec5d
Golden buddy improvements (#645)
scaliby Sep 24, 2025
9f0d561
A4X support (#643)
scaliby Sep 24, 2025
d99f40f
Remove integration tests from feature branches (#641)
scaliby Sep 24, 2025
c9a99cb
feat: remove args from run_command_with_updates_retry (#652)
scaliby Sep 24, 2025
548e3de
build: remove build_tests from post submit (#655)
scaliby Sep 24, 2025
8464ce2
Merge main to develop (#657)
FIoannides Sep 25, 2025
b1637d4
Set gke_product_type:xpk cluster label (#659)
SikaGrr Sep 25, 2025
f0b70a6
feat: remove dry_run arg from run_commands (#660)
scaliby Sep 26, 2025
ad5d667
feat: remove args from run_command_with_full_controls (#661)
scaliby Sep 26, 2025
ad402fc
feat: check if yq is installed in golden buddy (#662)
scaliby Sep 26, 2025
504bf65
docs: update supported platforms (#663)
scaliby Sep 26, 2025
6e5ef00
Pass autoprovisioning_args to Pathways workload yaml (#664)
wstcliyu Sep 30, 2025
4e2543e
feat: Add credential test with DNS retry logic (#588)
DannyLiCom Oct 2, 2025
d6cc688
Feat: Add --skip-validation flag to bypass system dependency checks (…
RexBearIU Oct 2, 2025
2e3c566
Add Kueue_manager and use it to provision Kueue resources (#638)
FIoannides Oct 3, 2025
2c455a5
Use map NodeSelector instead of string for Pathways workload (#669)
wstcliyu Oct 6, 2025
a33aa55
refactor: Use named arguments in UserFacingNameToSystemCharacteristic…
jamOne- Oct 7, 2025
6965318
Tpu7x workload policy (#671)
FIoannides Oct 7, 2025
7e8413b
Upload nodepool creation error log (#674)
FIoannides Oct 7, 2025
968cf32
Stop using GPP if WP is used (#676)
FIoannides Oct 8, 2025
b29f717
fix: Don't check flex in is_TAS_possible (#675)
jamOne- Oct 9, 2025
d186567
Fix the call to Kueue manager to pass if the cluster is pathways (#680)
FIoannides Oct 10, 2025
d822dcf
fix: remove unused code (#679)
scaliby Oct 10, 2025
accd90a
Use shorter names in nightly tests (#681)
FIoannides Oct 10, 2025
6537525
Update PathwaysJob version to v0.1.3 (#683)
wstcliyu Oct 13, 2025
bad6640
Fix plain XPK command execution (#684)
scaliby Oct 14, 2025
c410444
No kueue admission checks for FSNQ (#686)
scaliby Oct 16, 2025
7aa8f6d
Allow use of zonal clusters by resolving actual cluster region (#682)
scaliby Oct 16, 2025
80e45be
feat: Add --sub-slicing flag to cluster create (#689)
jamOne- Oct 17, 2025
5b24a6b
Enable granular dependency validation (#688)
scaliby Oct 17, 2025
f6a9e27
Move deps validation into commands (#690)
scaliby Oct 20, 2025
4e0aad5
Workload commands granular dependencies validation (#693)
scaliby Oct 20, 2025
2f4a26b
Cluster commands granular dependencies validation (#694)
scaliby Oct 20, 2025
3a6bda6
feat: Add sub_slicing_support to SystemCharacteristics (#695)
jamOne- Oct 20, 2025
d91fe52
Storage commands granular dependencies validation (#703)
scaliby Oct 20, 2025
59a9cbc
Shell commands granular dependencies validation (#702)
scaliby Oct 20, 2025
9f3df11
Run commands granular dependencies validation (#701)
scaliby Oct 20, 2025
e3c3757
Job commands granular dependencies validation (#699)
scaliby Oct 20, 2025
201f4d8
Inspector commands granular dependencies validation (#698)
scaliby Oct 20, 2025
62a5b07
Info commands granular dependencies validation (#697)
scaliby Oct 20, 2025
3bc229a
Batch commands granular dependencies validation (#696)
scaliby Oct 20, 2025
d0b5dd9
Gcluster migration to Kueue manager (#692)
FIoannides Oct 20, 2025
ebf259e
Kind commands granular dependencies validation (#700)
scaliby Oct 20, 2025
da2a079
Install kueue resources if version is the same or lower (#706)
FIoannides Oct 21, 2025
d696d91
Remove need for mocking feature flags (#707)
scaliby Oct 21, 2025
9d1c9c1
Split workload command parser into separate functions (#708)
scaliby Oct 21, 2025
af3f031
Sublicing topology workload create (#709)
scaliby Oct 21, 2025
554bde4
Kind kueue (#704)
FIoannides Oct 21, 2025
556d912
feat: Bump up the default Kueue version to v0.14.1 (#691)
jamOne- Oct 21, 2025
097958e
Fix admission_checks test case in kueue_manager_test (#714)
jamOne- Oct 22, 2025
26abe68
style: Explicitly set AcceleratorCharacteristics fields (#716)
jamOne- Oct 22, 2025
2e7bc43
Use get_topology_product instead of calculating topology product (#713)
scaliby Oct 22, 2025
b1ccea4
Subslicing workload annotations (#712)
scaliby Oct 22, 2025
f6513ee
refactor: use helper functions in kueue_manager_test (#715)
jamOne- Oct 22, 2025
6cba442
Release v0.14.0
FIoannides Oct 22, 2025
93b5266
Merge branch 'main' into release-0.14
FIoannides Oct 22, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
120 changes: 0 additions & 120 deletions .github/workflows/build_tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,6 @@ on:
type: choice
options:
- v4-8
push:
branches: ["main","develop"]
pull_request: # By default this runs for types assigned, opened and synchronize.

jobs:
Expand All @@ -35,13 +33,6 @@ jobs:
group: set-variables-${{ github.event.number}}
cancel-in-progress: true
outputs:
cluster-name: ${{ steps.set-cluster-name.outputs.cluster-name }}
cluster-name-dws: ${{ steps.set-cluster-name-dws.outputs.cluster-name-dws }}
group-name: ${{ steps.set-group-name.outputs.group-name }}
zone: ${{ steps.set-zone.outputs.zone }}
tpu-type: ${{ steps.set-tpu-type.outputs.tpu-type }}
tpu-type-topology: ${{ steps.set-tpu-type-topology.outputs.tpu-type-topology }}
location: ${{steps.set-location.outputs.location}}
run-id: ${{steps.set-run-id.outputs.run-id}}
steps:
- name: set run-id
Expand All @@ -57,34 +48,6 @@ jobs:
RUN_ID="${{ github.event.number }}"
fi
echo run-id=$RUN_ID >> $GITHUB_OUTPUT
- name: set cluster-name
id: set-cluster-name
run: |
echo cluster-name=build-xpk-2-nodepools-${{steps.set-run-id.outputs.run-id}} >> $GITHUB_OUTPUT
- name: set cluster-name-dws
id: set-cluster-name-dws
run: |
echo cluster-name-dws=build-xpk-2-nodepools-dws-${{steps.set-run-id.outputs.run-id}} >> $GITHUB_OUTPUT
- name: set group-name
id: set-group-name
run: |
echo group-name=xpk-${{steps.set-run-id.outputs.run-id}} >> $GITHUB_OUTPUT
- name: set zone
id: set-zone
run: |
echo zone=us-central2-b >> $GITHUB_OUTPUT
- name: set tpu-type
id: set-tpu-type
run: |
echo tpu-type=v4-8 >> $GITHUB_OUTPUT
- name: set tpu-type-topology
id: set-tpu-type-topology
run: |
echo tpu-type-topology=v4-2x2x1 >> $GITHUB_OUTPUT
- name: set location
id: set-location
run: |
echo location=us-central2 >> $GITHUB_OUTPUT
install-dependencies:
needs: [set-variables]
runs-on: ubuntu-22.04
Expand Down Expand Up @@ -144,86 +107,3 @@ jobs:
concurrency: # We support one build or nightly test to run at a time currently.
group: unit-tests-${{needs.set-variables.outputs.run-id}}
cancel-in-progress: true
run-integration-tests:
needs: [install-dependencies, set-variables]
uses: ./.github/workflows/reusable_integration_tests.yaml
with:
run-id: '${{needs.set-variables.outputs.run-id}}'
concurrency: # We support one build or nightly test to run at a time currently.
group: integration-tests-${{needs.set-variables.outputs.run-id}}
cancel-in-progress: true
secrets: inherit
cluster-private:
needs: [linter, run-unit-tests, run-integration-tests, set-variables]
uses: ./.github/workflows/reusable_cluster_private.yaml
concurrency: # We support one build or nightly test to run at a time currently.
group: cluster-private-${{needs.set-variables.outputs.run-id}}
cancel-in-progress: true
with:
run-id: '${{needs.set-variables.outputs.run-id}}'
cluster-name: '${{needs.set-variables.outputs.cluster-name}}'
tpu-type: '${{needs.set-variables.outputs.tpu-type-topology || inputs.tpu-type}}'
zone: '${{needs.set-variables.outputs.zone}}'
location: '${{needs.set-variables.outputs.location}}'
secrets: inherit
cluster-create:
needs: [linter, run-unit-tests, run-integration-tests, set-variables]
concurrency: # We support one build or nightly test to run at a time currently.
group: cluster-create-${{needs.set-variables.outputs.run-id}}
cancel-in-progress: true
uses: ./.github/workflows/reusable_cluster_create.yaml
with:
cluster-name-dws: '${{needs.set-variables.outputs.cluster-name-dws}}'
cluster-name: '${{needs.set-variables.outputs.cluster-name}}'
tpu-type: '${{needs.set-variables.outputs.tpu-type-topology || inputs.tpu-type}}'
zone: '${{needs.set-variables.outputs.zone}}'
location: '${{needs.set-variables.outputs.location}}'
run-id: '${{needs.set-variables.outputs.run-id}}'
secrets: inherit
workloads-tests:
needs: [cluster-create, set-variables]
uses: ./.github/workflows/reusable_workload_tests.yaml
concurrency: # We support one build or nightly test to run at a time currently.
group: workload-tests-${{needs.set-variables.outputs.run-id}}
cancel-in-progress: true
with:
cluster-name: ${{needs.set-variables.outputs.cluster-name}}
cluster-name-dws: '${{needs.set-variables.outputs.cluster-name-dws}}'
tpu-type: ${{needs.set-variables.outputs.tpu-type}}
tpu-type-topology: ${{needs.set-variables.outputs.tpu-type-topology}}
zone: ${{needs.set-variables.outputs.zone}}
run-id: '${{needs.set-variables.outputs.run-id}}'
secrets: inherit
batch-tests:
needs: [cluster-create, set-variables]
uses: ./.github/workflows/reusable_batch_tests.yaml
concurrency: # We support one build or nightly test to run at a time currently.
group: batch-tests-${{needs.set-variables.outputs.run-id}}
cancel-in-progress: true
with:
cluster-name: ${{needs.set-variables.outputs.cluster-name}}
zone: ${{needs.set-variables.outputs.zone}}
run-id: ${{needs.set-variables.outputs.run-id}}
secrets: inherit
storage-tests:
needs: [cluster-create, set-variables, batch-tests, workloads-tests]
uses: ./.github/workflows/reusable_storage_tests.yaml
concurrency: # We support one build or nightly test to run at a time currently.
group: storage-tests-${{needs.set-variables.outputs.run-id}}
cancel-in-progress: true
with:
cluster-name: ${{needs.set-variables.outputs.cluster-name}}
tpu-type: ${{needs.set-variables.outputs.tpu-type}}
zone: ${{needs.set-variables.outputs.zone}}
run-id: ${{needs.set-variables.outputs.run-id}}
secrets: inherit
cluster-delete:
if: always()
needs: [set-variables, storage-tests]
uses: ./.github/workflows/reusable_cluster_delete.yaml
with:
cluster-name-dws: ${{needs.set-variables.outputs.cluster-name-dws}}
cluster-name: ${{needs.set-variables.outputs.cluster-name}}
run-id: ${{needs.set-variables.outputs.run-id}}
zone: ${{needs.set-variables.outputs.zone}}
secrets: inherit
214 changes: 211 additions & 3 deletions .github/workflows/nightly_tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,12 @@
name: Nightly Tests

on:
push:
branches: ["develop"]
workflow_dispatch:
schedule: # Schedule the job run at 12AM PST daily.
- cron: '0 8 * * *'

permissions:
contents: read

env:
CLUSTER_NETWORK_ARGUMENTS: "--network=${{secrets.NETWORK_NAME}} --subnetwork=${{secrets.SUBNETWORK_NAME}}"
Expand Down Expand Up @@ -58,7 +58,7 @@ jobs:
gcloud config get compute/zone
- name: Install xpk dependencies
run: |
make install
make install
echo $PWD/bin >> "$GITHUB_PATH"
- name: Check xpk installation
run: xpk --help
Expand Down Expand Up @@ -151,6 +151,24 @@ jobs:
- name: Delete the cluster created
if: always()
run: python xpk.py cluster delete --cluster $TPU_CLUSTER_NAME --zone=us-central2-b --force
- name: Upload cluster nodepool creation log
if: always()
uses: actions/upload-artifact@v4
with:
name: cluster-nodepool-log-${{github.run_id}}
path: /tmp/NodepoolCreate-$TPU_CLUSTER_NAME-np-*
- name: Upload empty cluster nodepool creation log
if: always()
uses: actions/upload-artifact@v4
with:
name: empty-cluster-nodepool-log-${{github.run_id}}
path: /tmp/NodepoolCreate-$EMPTY_CLUSTER_NAME-np-*
- name: Upload private cluster nodepool creation log
if: always()
uses: actions/upload-artifact@v4
with:
name: empty-private-cluster-nodepool-log-${{github.run_id}}
path: /tmp/NodepoolCreate-$PRIVATE_CLUSTER_NAME-np-*

pw-cluster-and-workload:
runs-on: [ubuntu-22.04]
Expand Down Expand Up @@ -195,6 +213,12 @@ jobs:
- name: Delete the Pathways cluster created
if: always()
run: python xpk.py cluster delete --cluster $TPU_CLUSTER_NAME --zone=us-central2-b --force
- name: Upload nodepool creation log
if: always()
uses: actions/upload-artifact@v4
with:
name: pw-nodepool-log-${{github.run_id}}
path: /tmp/NodepoolCreate-$TPU_CLUSTER_NAME-np-*

rc-cluster:
runs-on: [ubuntu-22.04]
Expand Down Expand Up @@ -230,3 +254,187 @@ jobs:
- name: Delete the RayCluster-enabled XPK cluster
if: always()
run: python xpk.py cluster delete --cluster $TPU_CLUSTER_NAME --zone=us-central2-b
- name: Upload nodepool creation log
if: always()
uses: actions/upload-artifact@v4
with:
name: rc-nodepool-log-${{github.run_id}}
path: /tmp/NodepoolCreate-$TPU_CLUSTER_NAME-np-*

set-variables:
runs-on: [ubuntu-22.04]
concurrency:
group: set-variables-${{ github.event.number}}
cancel-in-progress: true
outputs:
cluster-name: ${{ steps.set-cluster-name.outputs.cluster-name }}
cluster-name-dws: ${{ steps.set-cluster-name-dws.outputs.cluster-name-dws }}
group-name: ${{ steps.set-group-name.outputs.group-name }}
zone: ${{ steps.set-zone.outputs.zone }}
tpu-type: ${{ steps.set-tpu-type.outputs.tpu-type }}
tpu-type-topology: ${{ steps.set-tpu-type-topology.outputs.tpu-type-topology }}
location: ${{steps.set-location.outputs.location}}
run-id: ${{steps.set-run-id.outputs.run-id}}
steps:
- name: set run-id
id: set-run-id
run: |
if [ "${{ github.event_name }}" == "workflow_dispatch" ]; then
RUN_ID="dispatch"
elif [ "${{ github.ref }}" == "refs/heads/main" ]; then
RUN_ID="main"
elif [ "${{ github.ref }}" == "refs/heads/develop" ]; then
RUN_ID="develop"
else
RUN_ID="${{ github.event.number }}"
fi
echo run-id=$RUN_ID >> $GITHUB_OUTPUT
- name: set cluster-name
id: set-cluster-name
run: |
echo cluster-name=xpk-${{steps.set-run-id.outputs.run-id}} >> $GITHUB_OUTPUT
- name: set cluster-name-dws
id: set-cluster-name-dws
run: |
echo cluster-name-dws=xpk-dws-${{steps.set-run-id.outputs.run-id}} >> $GITHUB_OUTPUT
- name: set group-name
id: set-group-name
run: |
echo group-name=xpk-${{steps.set-run-id.outputs.run-id}} >> $GITHUB_OUTPUT
- name: set zone
id: set-zone
run: |
echo zone=us-central2-b >> $GITHUB_OUTPUT
- name: set tpu-type
id: set-tpu-type
run: |
echo tpu-type=v4-8 >> $GITHUB_OUTPUT
- name: set tpu-type-topology
id: set-tpu-type-topology
run: |
echo tpu-type-topology=v4-2x2x1 >> $GITHUB_OUTPUT
- name: set location
id: set-location
run: |
echo location=us-central2 >> $GITHUB_OUTPUT
install-dependencies:
needs: [set-variables]
runs-on: ubuntu-22.04
strategy:
matrix:
python-version: ["3.10", "3.11"]
steps:
- uses: actions/checkout@v4
- uses: google-github-actions/setup-gcloud@v2
with:
version: '>= 363.0.0'
install_components: 'beta, gke-gcloud-auth-plugin'
- uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Check if cache exists
id: check-cache
uses: actions/cache@v3
with:
path: |
usr/local/bin/
~/.cache/pip
${{env.pythonLocation}}
key: xpk-deps-${{ matrix.python-version }}-${{github.run_id}}-${{github.run_attempt}}
lookup-only: true
- name: install dependencies
if : steps.check-cache.outputs.cache-hit != 'true'
run: make install-dev && cp ./bin/kubectl-kueue /usr/local/bin/kubectl-kueue && cp ./bin/kubectl-kjob /usr/local/bin/kubectl-kjob
- name: Cache dependencies
if : steps.check-cache.outputs.cache-hit != 'true'
uses: actions/cache/save@v3
with:
path: |
/usr/local/bin/kubectl-kueue
/usr/local/bin/kubectl-kjob
~/.cache/pip
${{env.pythonLocation}}
key: xpk-deps-${{ matrix.python-version }}-${{github.run_id}}-${{github.run_attempt}}
run-integration-tests:
needs: [install-dependencies, set-variables]
uses: ./.github/workflows/reusable_integration_tests.yaml
with:
run-id: '${{needs.set-variables.outputs.run-id}}'
concurrency: # We support one build or nightly test to run at a time currently.
group: integration-tests-${{needs.set-variables.outputs.run-id}}
cancel-in-progress: true
secrets: inherit
cluster-private:
needs: [run-integration-tests, set-variables]
uses: ./.github/workflows/reusable_cluster_private.yaml
concurrency: # We support one build or nightly test to run at a time currently.
group: cluster-private-${{needs.set-variables.outputs.run-id}}
cancel-in-progress: true
with:
run-id: '${{needs.set-variables.outputs.run-id}}'
cluster-name: '${{needs.set-variables.outputs.cluster-name}}'
tpu-type: '${{needs.set-variables.outputs.tpu-type-topology || inputs.tpu-type}}'
zone: '${{needs.set-variables.outputs.zone}}'
location: '${{needs.set-variables.outputs.location}}'
secrets: inherit
cluster-create:
needs: [run-integration-tests, set-variables]
concurrency: # We support one build or nightly test to run at a time currently.
group: cluster-create-${{needs.set-variables.outputs.run-id}}
cancel-in-progress: true
uses: ./.github/workflows/reusable_cluster_create.yaml
with:
cluster-name-dws: '${{needs.set-variables.outputs.cluster-name-dws}}'
cluster-name: '${{needs.set-variables.outputs.cluster-name}}'
tpu-type: '${{needs.set-variables.outputs.tpu-type-topology || inputs.tpu-type}}'
zone: '${{needs.set-variables.outputs.zone}}'
location: '${{needs.set-variables.outputs.location}}'
run-id: '${{needs.set-variables.outputs.run-id}}'
secrets: inherit
workloads-tests:
needs: [cluster-create, set-variables]
uses: ./.github/workflows/reusable_workload_tests.yaml
concurrency: # We support one build or nightly test to run at a time currently.
group: workload-tests-${{needs.set-variables.outputs.run-id}}
cancel-in-progress: true
with:
cluster-name: ${{needs.set-variables.outputs.cluster-name}}
cluster-name-dws: '${{needs.set-variables.outputs.cluster-name-dws}}'
tpu-type: ${{needs.set-variables.outputs.tpu-type}}
tpu-type-topology: ${{needs.set-variables.outputs.tpu-type-topology}}
zone: ${{needs.set-variables.outputs.zone}}
run-id: '${{needs.set-variables.outputs.run-id}}'
secrets: inherit
batch-tests:
needs: [cluster-create, set-variables]
uses: ./.github/workflows/reusable_batch_tests.yaml
concurrency: # We support one build or nightly test to run at a time currently.
group: batch-tests-${{needs.set-variables.outputs.run-id}}
cancel-in-progress: true
with:
cluster-name: ${{needs.set-variables.outputs.cluster-name}}
zone: ${{needs.set-variables.outputs.zone}}
run-id: ${{needs.set-variables.outputs.run-id}}
secrets: inherit
storage-tests:
needs: [cluster-create, set-variables, batch-tests, workloads-tests]
uses: ./.github/workflows/reusable_storage_tests.yaml
concurrency: # We support one build or nightly test to run at a time currently.
group: storage-tests-${{needs.set-variables.outputs.run-id}}
cancel-in-progress: true
with:
cluster-name: ${{needs.set-variables.outputs.cluster-name}}
tpu-type: ${{needs.set-variables.outputs.tpu-type}}
zone: ${{needs.set-variables.outputs.zone}}
run-id: ${{needs.set-variables.outputs.run-id}}
secrets: inherit
cluster-delete:
if: always()
needs: [set-variables, storage-tests]
uses: ./.github/workflows/reusable_cluster_delete.yaml
with:
cluster-name-dws: ${{needs.set-variables.outputs.cluster-name-dws}}
cluster-name: ${{needs.set-variables.outputs.cluster-name}}
run-id: ${{needs.set-variables.outputs.run-id}}
zone: ${{needs.set-variables.outputs.zone}}
secrets: inherit
Loading
Loading