Skip to content

Vlin/ACR 1000 nodes image pull test#1059

Merged
vlin-ms merged 2 commits intomainfrom
vlin/acr-1000n-perftest
Feb 18, 2026
Merged

Vlin/ACR 1000 nodes image pull test#1059
vlin-ms merged 2 commits intomainfrom
vlin/acr-1000n-perftest

Conversation

@vlin-ms
Copy link
Contributor

@vlin-ms vlin-ms commented Feb 12, 2026

Summary
Adds a 1000-node ACR image pull benchmark to measure concurrent image pulling throughput at scale against ACR dogfood environment with anonymous pull. Also adds support for custom pod memory requests in the CRI module to prevent pod scheduling failures on nodes with low max_pods settings.

Changes

  1. New scenario: image-pull-n1000
    • Pipeline: New pipeline targeting australiaeast with 1000 user nodes, anonymous pull from acrperftestaue.azurecr-test.io
    • Terraform: 1004-node cluster (3 default + 1 Standard_D64_v3 Prometheus + 1000 Standard_D4ds_v5 user nodes)
  2. Custom memory request override (memory_request_override)
    • Problem: When max_pods is low, the auto-calculated memory request per pod becomes too large (allocatable memory ÷ few pods), causing Insufficient memory scheduling failures:
      FailedScheduling: 0/1004 nodes are available: 1000 Insufficient memory
    • Solution: Added memory_request_override parameter to cri.py and execute.yml, allowing explicit control over pod memory requests instead of relying on auto-calculation
  3. Shared topology: parameterized desired_nodes
    • Changed desired_nodes from type: number to type: string in validate.yml to support runtime matrix variables
    • Updated image-pull topology to use $(desired_nodes) from pipeline matrix, enabling n10 (14 nodes) and n1000 (1004 nodes) to share the same topology

Validation
Validated with 1000 nodes cluster pulling ~5GB/~10GB images
Pipelines - Run 20260212.7

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new large-scale ACR image pull benchmark (1000-node) and updates the ClusterLoader2 CRI module/topology plumbing to better support large clusters by allowing explicit per-pod memory request overrides and runtime-parameterized node validation.

Changes:

  • Parameterize image-pull topology resource validation to use a runtime desired_nodes value (and extend validation timeout).
  • Add memory_request_override support end-to-end (pipeline env → execute step → CRI override logic) to avoid scheduling failures when max_pods is low.
  • Introduce a new image-pull-n1000 perf-eval scenario (Terraform inputs + test inputs + README).

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
steps/topology/image-pull/validate-resources.yml Switch validation to use runtime $(desired_nodes) and increase validation timeout.
steps/engine/clusterloader2/large-cluster/validate.yml Change desired_nodes parameter type to string to support runtime substitution.
steps/engine/clusterloader2/cri/execute.yml Plumb MEMORY_REQUEST_OVERRIDE env var into CRI override CLI invocation.
modules/python/clusterloader2/cri/cri.py Add --memory_request_override flag and implement override parsing/behavior in override generation.
scenarios/perf-eval/image-pull-n1000/terraform-test-inputs/azure.json Add terraform test input for the new scenario.
scenarios/perf-eval/image-pull-n1000/terraform-inputs/azure.tfvars Add Terraform configuration for the 1000-node image pull cluster.
scenarios/perf-eval/image-pull-n1000/README.md Document the new image-pull-n1000 scenario.
pipelines/system/new-pipeline-test.yml Minor formatting/line adjustment in pipeline template example.

1000n acr image pull

1000n acr image pull

1000n acr image pull

1000n acr image pull

1000n test

1000n test

fix desired node

clean up test changes

format and update test

fix format

fix format

Revert new-pipeline-test.yml to match main
@vlin-ms vlin-ms force-pushed the vlin/acr-1000n-perftest branch from 55ffd48 to 5854625 Compare February 18, 2026 03:10
@vlin-ms vlin-ms requested a review from liyu-ma as a code owner February 18, 2026 03:10
@vlin-ms
Copy link
Contributor Author

vlin-ms commented Feb 18, 2026

@vlin-ms please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.

@microsoft-github-policy-service agree [company="{your company}"]

Options:

  • (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
  • (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"

Contributor License Agreement

@microsoft-github-policy-service agree company="Microsoft"

@vlin-ms vlin-ms closed this Feb 18, 2026
@vlin-ms vlin-ms reopened this Feb 18, 2026
@vlin-ms vlin-ms merged commit 010e58b into main Feb 18, 2026
11 checks passed
@vlin-ms vlin-ms deleted the vlin/acr-1000n-perftest branch February 18, 2026 06:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants