v0.7.4
·
1064 commits
to main
since this release
Immutable
release. Only release title and notes can be modified.
Changelog
Features
- 4defc33: feat(ci): add CNCF AI conformance validations to inference workflow (#162) (@dims)
- 06bccdf: feat(ci): add ClamAV malware scanning GitHub Action (#171) (@dims)
- 9da2501: feat(ci): add DRA GPU allocation test to H100 smoke test (#153) (@dims)
- e55f9a2: feat(ci): add HPA pod autoscaling validation to inference workflow (#163) (@dims)
- 0c435fb: feat(ci): add OSS community automation workflows (@mchmarny)
- 1f39bce: feat(ci): collect AI conformance evidence in H100 smoke test (#147) (@dims)
- 60023fd: feat(skyhook): temporarily remove skyhook tuning due to bugs (#154) (@ayuskauskas)
- 4ddf3b8: feat: add CNCF AI Conformance evidence collection (#158) (@yuanchen8911)
- 9a96d23: feat: add CUJ2 inference demo chat UI and update CUJ2 instructions (#151) (@yuanchen8911)
- 0c53d8e: feat: add DRA and gang scheduling test manifests for CNCF AI conformance (#150) (@yuanchen8911)
- f04d3e5: feat: add GPU training CI workflow with gang scheduling test (#155) (@dims)
- 0463e2d: feat: add expected-resources deployment check for validating Kubernetes resources exist (#149) (@xdu31)
- 2a922bc: feat: add support for workload-gate and workload-selector (#166) (@ayuskauskas)
- f176162: feat: add two-phase expected resource auto-discovery to validator (#164) (@xdu31)
Tasks
- 69c37d0: chore: improve consistency across GPU CI workflows (#160) (@dims)
- f9f1ec0: chore: update cuj1 (@mchmarny)
- 4962b9b: chore: update demos (@mchmarny)
- 84bf48c: chore: update demos (@mchmarny)
- 1701080: chore: update e2e demo (@mchmarny)
- 2a0f22c: chore: update e2e demo (@mchmarny)
- 0fa18e1: chore: update e2e demo (@mchmarny)
- 84d975e: chore: update e2e demo (@mchmarny)
- 54eceaa: chore: update s3c demo (@mchmarny)
Others
- 592e640: fix(ci): add pull_request trigger to vuln-scan workflow (@mchmarny)
- ca16886: fix(ci): break long lines in welcome workflow to pass yamllint (#148) (@dims)
- bcd26bd: fix(ci): combine path and size label workflows to prevent race condition (#161) (@yuanchen8911)
- 177e92e: fix(ci): harden workflows and improve CI/CD hygiene (@mchmarny)
- a40f754: fix(ci): lower vuln scan threshold to MEDIUM and add container image scanning (#172) (@dims)
- 60a2adc: fix(ci): re-enable CDI for H100 kind smoke test (#143) (@dims)
- 02e7c1c: fix(ci): run attestation and vuln scan concurrently in release workflow (#173) (@dims)
- 49f1333: fix(ci): use PR number in KWOK concurrency group (@mchmarny)
- 5be5a93: fix(ci): use pull_request_target for write-permission workflows (@mchmarny)
- f38d7b2: fix(docs): update bundle commands with correct tolerations in CUJ demos (#176) (@yuanchen8911)
- f93e618: fix: add kube-prometheus-stack as gpu-operator dependency (#170) (@yuanchen8911)
- 490aa0f: fix: add markdown rendering to chat UI and update CUJ2 documentation (#159) (@yuanchen8911)
- 20c3e4a: fix: enable DCGM exporter ServiceMonitor for Prometheus scraping (#157) (@yuanchen8911)
- ade7ff7: fix: move DRA controller nodeAffinity override to EKS overlay (#174) (@yuanchen8911)
- fa58fba: fix: remove admission.cdi from kai-scheduler values (#146) (@yuanchen8911)
- cbcc3e9: fix: remove nodeSelector from EBS CSI node DaemonSet scheduling (#175) (@yuanchen8911)
- f9cfa47: fix: remove trailing quote from skyhook no-op package version (#177) (@yuanchen8911)
- 4e15a0e: fix: skip --wait for KAI scheduler in deploy script (#169) (@yuanchen8911)
- 9a3cd3f: fix: update inference stack versions and enable Grove for dynamo workloads (#145) (@yuanchen8911)
- 03b5483: refactor: move examples/demos to project root demos directory (@mchmarny)
- ed4973b: refactor: move kai-scheduler and DRA driver to base overlay for CNCF AI conformance (#139) (@yuanchen8911)
- f15c6b3: refactor: rename PreDeployment to Readiness across codebase and docs (#156) (@xdu31)
- ea8f626: rename: eidos → aicr (AI Cluster Runtime) (@mchmarny)