Skip to content

Comments

monitor: reduce memory usage by stripping unused fields from cached d…#4591

Merged
hlipsig merged 1 commit intomasterfrom
fix/monitor-cache-memory-optimization
Feb 6, 2026
Merged

monitor: reduce memory usage by stripping unused fields from cached d…#4591
hlipsig merged 1 commit intomasterfrom
fix/monitor-cache-memory-optimization

Conversation

@shubhadapaithankar
Copy link
Collaborator

@shubhadapaithankar shubhadapaithankar commented Feb 5, 2026

Which issue this PR addresses:

Fixes #itn-2026-00027 - Investigate probable memory leak in RP code as shown by RP VMSS behaviour (CentralUS and other regions, pattern going back to December 2025)

What this PR does / why we need it:

The monitor cache was storing full OpenShiftClusterDocument objects (~50-200KB each) including large fields that are never used during monitoring:

  • AdminKubeconfig, AROServiceKubeconfig (~10-15KB each)
  • PullSecret (~1-5KB)
  • SSHKey, KubeadminPassword, RegistryProfiles
  • ServicePrincipalProfile.ClientSecret

This caused unbounded memory growth proportional to cluster count. The existing code even had a TODO comment acknowledging this issue: // TODO: improve memory usage by storing a subset of doc in mon.docs

This PR introduces stripUnusedFields() which creates a lightweight copy of documents containing only the fields needed for monitoring, reducing memory usage by an estimated 15-30KB per cached document.

Test plan for issue:

  • Added comprehensive unit tests for stripUnusedFields() function covering:
    • Nil document handling
    • Sensitive field stripping (PullSecret, SSHKey, ClientSecret, etc.)
    • Kubeconfig preference (AROServiceKubeconfig over AdminKubeconfig)
    • Preservation of required monitoring fields
    • PlatformWorkloadIdentityProfile presence detection
  • All existing monitor package tests pass

Is there any documentation that needs to be updated for this PR?

No documentation updates required. This is an internal optimization that doesn't change any external behavior or APIs.

How do you know this will function as expected in production?

  1. Code analysis: Thoroughly reviewed all monitor code paths (pkg/monitor/cluster/, pkg/monitor/hive/, pkg/monitor/azure/nsg/) to identify exactly which fields from OpenShiftCluster are accessed. Only those fields are retained in the stripped document.

  2. Existing metrics: The monitor already emits monitor.cache.size metrics which can be used to verify cache behavior remains correct after deployment.

  3. No behavioral change: The stripped document contains all fields needed for:

    • REST config creation (kubeconfig, APIServerPrivateEndpointIP)
    • Cluster monitoring (ProvisioningState, APIServerProfile.URL)
    • NSG monitoring (MasterProfile.SubnetID, WorkerProfiles[].SubnetID)
    • Hive monitoring (HiveProfile.Namespace)
    • Metrics dimensions (ID, Location, SubscriptionID)
  4. Safe fallback: If AROServiceKubeconfig is nil, the code falls back to AdminKubeconfig, matching existing behavior in restconfig.RestConfig().

…ocuments

The monitor cache was storing full OpenShiftClusterDocument objects including
large fields like kubeconfigs, pull secrets, SSH keys, and registry credentials
that are never used during monitoring. This caused unbounded memory growth
proportional to cluster count.

This change introduces stripUnusedFields() which creates a lightweight copy of
documents containing only the fields needed for monitoring:
- Document metadata (ID, Key, PartitionKey, Bucket)
- Cluster identity (ID, Name, Location, Type)
- Cluster state (ProvisioningState, FailedProvisioningState, etc.)
- Network config (APIServerPrivateEndpointIP, PreconfiguredNSG)
- Subnet info (MasterProfile.SubnetID, WorkerProfiles[].SubnetID)
- API access (APIServerProfile.URL, one kubeconfig)
- Hive integration (HiveProfile)
- Auth type detection (presence of ServicePrincipalProfile/PlatformWorkloadIdentityProfile)

Fields intentionally stripped to save memory:
- PullSecret, SSHKey, KubeadminPassword
- AdminKubeconfig (when AROServiceKubeconfig exists)
- UserAdminKubeconfig, RegistryProfiles
- ServicePrincipalProfile.ClientSecret
- PlatformWorkloadIdentityProfile details
- Worker profile details (VMSize, DiskSizeGB, etc.)

Estimated memory savings: 15-30KB per cached document.

This addresses the memory leak pattern observed in CentralUS and other regions
since December 2025 (related to #itn-2026-00027).
@hlipsig
Copy link
Collaborator

hlipsig commented Feb 6, 2026

e2e failure is in the cluster clean up, unrelated. Merging.

@hlipsig hlipsig merged commit 7ba8a20 into master Feb 6, 2026
27 of 29 checks passed
@shubhadapaithankar shubhadapaithankar deleted the fix/monitor-cache-memory-optimization branch February 6, 2026 00:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants