v1.92.0
What's Changed
Key New Features 🎉
- feat: add ML Diagnostics module and integration for GKE TPU blueprints by @AdarshK15 in #5350
- NAP support on GKE Clusters (gke-cluster module) by @SwarnaBharathiMantena in #5420
- feat: optional infra setup for inference gateway by @jessicaochen in #5453
- feat(slurm): support compact placement with DWS Flex-Start for H4D, A3Ultra and A4 by @parulbajaj01 in #5579
Breaking Changes 🚨
- Transitioning to Slurm Native Auth with resilient workbench keys distribution by @arpit974 in #5695
- default to sauth for newer deployments in h4d and a3mega-gcsfuse blueprints by @arpit974 in #5707
New Modules 🧱
- adding new dns-managed-zone module. by @arpit974 in #5485
- adding new global static ip module. by @arpit974 in #5559
- adding new module for kubernetes namespace. by @arpit974 in #5562
- adding new iap-policy module. by @arpit974 in #5564
- Adding new cloud run module. by @arpit974 in #5567
- adding new redis module. by @arpit974 in #5569
- adding new kubernetes-secret module. by @arpit974 in #5572
- adding new workload_identity_binding module. by @arpit974 in #5574
- adding new scripting module gke-backend-fetcher under community folder. by @arpit974 in #5593
- adding a new helm-upgrade module under community folder. by @arpit974 in #5595
- adding new spanner-migrations runner module under community folder. by @arpit974 in #5597
Module Improvements 🔨
- Adding native K8s annotations and GKE cluster enhancements by @arpit974 in #5610
- Default Kueue config for Pathways by @scaliby in #5628
Improvements 🛠
- [Telemetry] Get blueprint even from deployment directory by @kadupoornima in #5656
- [Telemetry] Capture exit code upon fatal command failures by @kadupoornima in #5658
- (gke) Remove additional network settings from A3U blueprint by @agrawalkhushi18 in #5652
- (gke) Remove additional networks from A4 and A4X family blueprints by @agrawalkhushi18 in #5682
- (gke) Remove additional network settings from TPU v6e,7x and g4 by @agrawalkhushi18 in #5692
- [Telemetry] Add support to merge vars from deployment files and CLI --vars by @kadupoornima in #5694
- [Telemetry] Add support for collection of CPU machines and Default machines when unset in module by @kadupoornima in #5696
- Make Managed lustre default in A3u and A3m series Slurm blueprints by @saara-tyagi27 in #5396
- [Telemetry] Add a retry mechanism to get the GCP Project information to eliminate transient issues by @kadupoornima in #5702
- [Telemetry] Add an atomic flag to ensure telemetry event is not recurrently called by @kadupoornima in #5705
- Pin DCGM to version 4.5.3 by @shubpal07 in #5721
- feat(gke): expose monitoring components as a parameter by @cboneti in #5722
- feat(job submission): Dynamic topology routing for gke jobs by @Neelabh94 in #5664
Deprecations 💤
- Remove hpc-slurm-static blueprint by @kadupoornima in #5672
Version Updates ⏫
- Fix A3 HighGPU test by pinning GKE version to 1.33 to resolve COS incompatibility by @kadupoornima in #5673
- Update minimum required Packer version to 1.15.3 by @AdarshK15 in #5701
Bug fixes 🐞
- fix: Add tpu_topology conditional logic for TPU flex start by @agrawalkhushi18 in #5655
- fix: Update the vpc module output name for additional network by @agrawalkhushi18 in #5690
- fix(slurm): correct vNUMA socket and SMT thread calculations in util.py by @kadupoornima in #5683
- Multi NIC support & cluster ID fix for Slurm controller by @rahimkhan19 in #5563
- [Telemetry] Collect the correct exit code when user intentionally stops deployment (0 instead of 1) by @kadupoornima in #5704
- Clean up custom spot VM variables during standard fallback by @rahimkhan19 in #5697
- fix: accelerator label auto resolution by @Neelabh94 in #5717
Full Changelog: v1.91.0...v1.92.0