Skip to content

fix(e2e): set PrincipalType on role assignment to avoid AAD replication race#8251

Merged
ganeshkumarashok merged 1 commit intomainfrom
fix/gpu-e2e-role-assignment-principal-type
Apr 8, 2026
Merged

fix(e2e): set PrincipalType on role assignment to avoid AAD replication race#8251
ganeshkumarashok merged 1 commit intomainfrom
fix/gpu-e2e-role-assignment-principal-type

Conversation

@ganeshkumarashok
Copy link
Copy Markdown
Contributor

Summary

  • Adds PrincipalType: ServicePrincipal to the Storage Blob Data Contributor role assignment in assignRolesToVMIdentity
  • Fixes intermittent PrincipalNotFound errors in GPU E2E tests (e.g. Test_Ubuntu2404_NvidiaDevicePluginRunning_MIG) caused by AAD replication delay when a managed identity is created and immediately assigned a role
  • Consistent with assignACRPullToIdentity in cluster.go which already sets this field for the same managed identity

Root Cause

When CreateVMManagedIdentity creates a user-assigned managed identity and immediately calls assignRolesToVMIdentity, ARM tries to look up the principal in AAD. If the AAD replica handling the request hasn't replicated the new principal yet, it returns PrincipalNotFound. Setting PrincipalType tells ARM to skip this lookup.

Reference: https://learn.microsoft.com/en-us/azure/role-based-access-control/role-assignments-rest#new-service-principal

Test plan

  • GPU E2E pipeline passes (Agentbaker GPU E2E)

…on race

When creating a managed identity and immediately assigning a role,
the role assignment can fail with PrincipalNotFound due to AAD
replication delay. Setting PrincipalType to ServicePrincipal tells
ARM to skip the AAD principal lookup, avoiding the race condition.

This is consistent with assignACRPullToIdentity in cluster.go which
already sets PrincipalType for the same managed identity.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Mitigates intermittent Azure RBAC PrincipalNotFound failures in GPU E2E by explicitly setting the principal type during role assignment, avoiding AAD replication lookup races for newly created managed identities.

Changes:

  • Sets PrincipalType: ServicePrincipal on the “Storage Blob Data Contributor” role assignment in assignRolesToVMIdentity.

@ganeshkumarashok ganeshkumarashok merged commit 9811447 into main Apr 8, 2026
27 of 33 checks passed
@ganeshkumarashok ganeshkumarashok deleted the fix/gpu-e2e-role-assignment-principal-type branch April 8, 2026 23:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants