monitor: reduce memory usage by stripping unused fields from cached d…#4591
Merged
monitor: reduce memory usage by stripping unused fields from cached d…#4591
Conversation
…ocuments The monitor cache was storing full OpenShiftClusterDocument objects including large fields like kubeconfigs, pull secrets, SSH keys, and registry credentials that are never used during monitoring. This caused unbounded memory growth proportional to cluster count. This change introduces stripUnusedFields() which creates a lightweight copy of documents containing only the fields needed for monitoring: - Document metadata (ID, Key, PartitionKey, Bucket) - Cluster identity (ID, Name, Location, Type) - Cluster state (ProvisioningState, FailedProvisioningState, etc.) - Network config (APIServerPrivateEndpointIP, PreconfiguredNSG) - Subnet info (MasterProfile.SubnetID, WorkerProfiles[].SubnetID) - API access (APIServerProfile.URL, one kubeconfig) - Hive integration (HiveProfile) - Auth type detection (presence of ServicePrincipalProfile/PlatformWorkloadIdentityProfile) Fields intentionally stripped to save memory: - PullSecret, SSHKey, KubeadminPassword - AdminKubeconfig (when AROServiceKubeconfig exists) - UserAdminKubeconfig, RegistryProfiles - ServicePrincipalProfile.ClientSecret - PlatformWorkloadIdentityProfile details - Worker profile details (VMSize, DiskSizeGB, etc.) Estimated memory savings: 15-30KB per cached document. This addresses the memory leak pattern observed in CentralUS and other regions since December 2025 (related to #itn-2026-00027).
hlipsig
reviewed
Feb 5, 2026
hlipsig
approved these changes
Feb 5, 2026
Collaborator
|
e2e failure is in the cluster clean up, unrelated. Merging. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue this PR addresses:
Fixes #itn-2026-00027 - Investigate probable memory leak in RP code as shown by RP VMSS behaviour (CentralUS and other regions, pattern going back to December 2025)
What this PR does / why we need it:
The monitor cache was storing full
OpenShiftClusterDocumentobjects (~50-200KB each) including large fields that are never used during monitoring:AdminKubeconfig,AROServiceKubeconfig(~10-15KB each)PullSecret(~1-5KB)SSHKey,KubeadminPassword,RegistryProfilesServicePrincipalProfile.ClientSecretThis caused unbounded memory growth proportional to cluster count. The existing code even had a TODO comment acknowledging this issue:
// TODO: improve memory usage by storing a subset of doc in mon.docsThis PR introduces
stripUnusedFields()which creates a lightweight copy of documents containing only the fields needed for monitoring, reducing memory usage by an estimated 15-30KB per cached document.Test plan for issue:
stripUnusedFields()function covering:Is there any documentation that needs to be updated for this PR?
No documentation updates required. This is an internal optimization that doesn't change any external behavior or APIs.
How do you know this will function as expected in production?
Code analysis: Thoroughly reviewed all monitor code paths (
pkg/monitor/cluster/,pkg/monitor/hive/,pkg/monitor/azure/nsg/) to identify exactly which fields fromOpenShiftClusterare accessed. Only those fields are retained in the stripped document.Existing metrics: The monitor already emits
monitor.cache.sizemetrics which can be used to verify cache behavior remains correct after deployment.No behavioral change: The stripped document contains all fields needed for:
Safe fallback: If
AROServiceKubeconfigis nil, the code falls back toAdminKubeconfig, matching existing behavior inrestconfig.RestConfig().