Skip to content

[202412] Backport GCU container framework (#4310) + perf enhancements (#4476, #4478, #4554)#352

Open
rimunagala wants to merge 8 commits into
202412from
rimunagala/202412-gcu-container-backport-v2
Open

[202412] Backport GCU container framework (#4310) + perf enhancements (#4476, #4478, #4554)#352
rimunagala wants to merge 8 commits into
202412from
rimunagala/202412-gcu-container-backport-v2

Conversation

@rimunagala
Copy link
Copy Markdown

@rimunagala rimunagala commented May 28, 2026

Backport GCU enhancements to 202412 for per-branch GCU sidecar container

Why

This backport unblocks the MRC isolation scenario by enabling a per-branch
GCU sidecar container build off the 202412 source tree. Without these
enhancements the 202412 GCU has both correctness gaps (multi-ASIC) and
performance gaps (no YANG load cache, no bulk leaf-list moves) compared with
the master container that the team has been delivering via KubeSONiC.

It also pulls in the gcu-standalone packaging from upstream PR #4310 so a
202412-source-based sidecar container can actually be built (entry point +
wheel).

Commits

In dependency order:

  1. #4310 — Create GCU wheel (be57e378)
    Adds setup.py, main.py, pytest.ini, .coveragerc, defines the
    gcu-standalone entry point. Required for any container build off this
    branch.

  2. #4476 — Cache loadData() calls (015eacbd)
    YANG loadData is hashed and cached; subsequent identical loads short-
    circuit. Major apply-patch latency win. (Upstream quiet=True parameter
    intentionally dropped — purely cosmetic logging suppression, not in 202412
    sonic-yang-mgmt, no behavioral impact.)

  3. #4478 — BulkLeafListMoveGenerator (8bedc0a9)
    Coalesces N leaf-list element changes into a single REPLACE move.

  4. #4554 — Multi-ASIC SonicDBConfig init in gcu-standalone (a810a595)
    3-line fix: initializes SonicDBConfig from gcu-standalone for multi-
    ASIC platforms (would otherwise crash on multi-ASIC).

  5. fix: remove --path-trace usage not yet backported to 202412 (21ddca6a)
    Upstream PR #4317 added the --path-trace click option referenced by
    config/main.py:apply_patch. That PR is observability-only and outside
    this backport's scope. Strips the 5 path_trace/trace_io lines from
    apply_patch and removes the 3 corresponding test methods.
    Fixes flake8 F821 undefined name 'path_trace'.

  6. fix(gcu): complete --path-trace removal started in 21ddca6 (47e51daf)
    Sibling cleanup in generic_config_updater/main.py (the gcu-standalone
    entry point) — removes the same --path-trace argparse surface that 21ddca6
    removed from the config click wrapper, so the two entry points stay aligned.

  7. fix(gcu): strip --time from list-checkpoints (320afd98)
    Upstream PR #3746 (which added a time= parameter to
    GenericUpdater.list_checkpoints) is not backported to 202412. Calling
    list_checkpoints(args.time, args.verbose) would crash with
    TypeError: takes 2 positional arguments but 3 were given. Removes the
    -t/--time flag, the args.time parameter passing, and the conditional
    timestamp-formatter branch.

  8. fix(gcu): adapt #4478 BulkLeafListMoveGenerator to 202412 JsonMoveGroup (abab28a7)
    The headline incomplete-cherry-pick fix. Upstream #4478 calls
    JsonMoveGroup(self.__class__.__name__, JsonMove(...)) — passing the
    generator class name as a path-trace label. On 202412 the JsonMoveGroup
    signature (from prior Azure [action] [PR:3831] Generic Configuration Updater (GCU) performance enhancements #254 lineage) is __init__(self, move=None)
    it doesn't accept the name argument. Drops the self.__class__.__name__
    argument at the single call site. Functionally equivalent because path-
    trace observability is being explicitly removed by 21ddca6/47e51daf.

Validation

Detailed evidence is in the PR comments below. Summary:

  • CI: license/cla, Semgrep, CodeQL Analyze (python) — all SUCCESS
  • KVM 21-test functional sweep (kvm-t0 testbed) — all green; see comment
    #issuecomment-4604977295
  • Mellanox SN5640 hardware validation (str4-sn5640-7, 514-port Spectrum-4,
    597 KB ConfigDB), wheel built directly from PR HEAD abab28a7:
    • Stock-vs-fix matrix on 12 scenarios → 25.3× speedup on leaf-list shrink/grow,
      33.9× on rollback, 26.4× on dry-run
      , ~13× overall matrix wall time
    • 8 functional/entrypoint/negative tests (T1-T8): gcu-standalone subcommands,
      idempotency, invalid input handling, --ignore-path, verified --time and
      --path-trace are properly rejected
    • See comment #issuecomment-4607582232

Multi-ASIC path (a810a595) was not directly exercised on a multi-ASIC device
in this validation cycle — single-ASIC behavior is unchanged and the fix follows
the standard SONiC load_db_config() initialization pattern (same as config,
show, sonic-cfggen).

Out of scope (separate follow-ups)

  • sonic-net/sonic-mgmt#23848 (enforce timeout in apply_patch() shared
    helper) → backport to Azure/sonic-mgmt:202412 first; framework prereq
    for the smoke test below
  • sonic-net/sonic-mgmt#24092 (GCU apply-patch performance smoke test) →
    backport to Azure/sonic-mgmt:202412 after #23848
  • Triggering the 202412 sonic-addon-containers build to produce the
    per-branch GCU sidecar image — done after this PR merges

Background

  • Internal feature: ADO #37580620 "Enable GCU Sidecar Container Framework"
  • Delivery: KubeSONiC sidecar from sonic-addon-containers repo
  • Blocker without this PR: container build needs OSVersion="202412" source

xincunli-sonic and others added 5 commits May 28, 2026 16:45
* Create GCU wheel

Signed-off-by: Xincun Li <stli@microsoft.com>

* fix test

Signed-off-by: Xincun Li <stli@microsoft.com>

* Refactor

Signed-off-by: Xincun Li <stli@microsoft.com>

* Fixed show vxlan remotemac ambiguity (#4121)

Summary:
When users type ambiguous commands like show vxlan remote, they see a Python traceback instead of a clean error message because the CLI finds multiple matching commands and throws an backtrace and an exception.

Root Cause
The AliasedGroup.get_command() method calls ctx.fail() which raises a UsageError exception that propagates through Click's bash completion system, causing the traceback.

Approach:
Implement context-aware error handling in the [AliasedGroup.get_command() method to differentiate between:
Bash completion context: Where tracebacks should be suppressed
Normal command execution context: Where clean error messages should be displayed

How did you do it?
Added environment variable detection to handle bash completion context differently
Command execution results in a clear error message without any tracebacks
No changes to the existing CLI functionality

Signed-off-by: Gnanapriya Sethuramarajan <gsethuramara@marvell.com>
Signed-off-by: Xincun Li <stli@microsoft.com>

* Fix spelling typos across utilities_common, config plugins, and misc modules (#4264)

* Fix spelling typos across utilities_common, config plugins, and misc modules

Fix misspellings found via codespell across 37 files:

utilities_common/: Neighbhor, seperate, Contants, Explicity, classs,
  retreive/retreived, wont, fileds, statisitics
crm/: recources, neigbor
flow_counter_util/: formated
sonic_cli_gen/: separeted
sonic_installer/: Excpetion, commond, necessarry, threhold, Verifing,
  orignal, reconcilation
sonic_package_manager/: wether, componenets, infromation, spliting,
  Seperator, Wether
pfcwd/: explicitely, Paramter
sfputil/: EEEPROM
pcieutil/: Vender
acl_loader/: overriden
dump/: Multipe, incluing, orignal, recieved, Proceding
generic_config_updater/: relevent, acending, confing, happend
config/: Defualt, configurtion, patter, seperated, obect, cummulative
rcli/: commmand
sonic-utilities-data/: obect (template)

Signed-off-by: Rustiqly <rustiqly@users.noreply.github.com>

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Signed-off-by: Rustiqly <rustiqly@users.noreply.github.com>
Co-authored-by: Rustiqly <rustiqly@users.noreply.github.com>
Co-authored-by: Lihua Yuan <lihuay@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Xincun Li <stli@microsoft.com>

* In route_check.py, Convey the IJSON Backend using an env variable (#4294)

* Fix route_check.py to not hog a lot of memory

This diff modifies the route_check.py to not
invoke "show" and rather invoke the vtysh cmd directly.
It then attempt to interpret one route at
a time in a paginated manner. This prevents a sudden transient memory
buildup. The zebra process already does the right thing and backs off
when the output socket buffers are full. There is probably scope to
improve that further
(Refer to
https://sonicfoundation.dev/2025-sonic-hackathon-most-impactful-award-spotlight-optimizing-output-buffer-memory-for-show-commands/)

Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* Fix merge conflicts related test failure from upstream

Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* Fix precommit check failure

Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* Revert back to using the TIMEOUT from the earlier code.

Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* Fixed review comments from upstream

Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* Removed CHUNK_SIZE as it is not used any more

Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* Fix multi asic connection creation (#4109)

- What I did
Create a cache for the SonicV2Connector objects which are created, because currently we are creating n interfaces * m namespace amount of connectors in case of multi asic implementation, which is very high and would lead to the show interface counters command to crash

root@sonic:/home/admin# show interfaces counters
Traceback (most recent call last):
  File "/usr/local/bin/portstat", line 168, in
    main()
  File "/usr/local/bin/portstat", line 158, in main
    portstat.cnstat_diff_print(cnstat_dict, {}, ratestat_dict, intf_list, use_json, print_all, errors_only,
  File "/usr/local/lib/python3.11/dist-packages/utilities_common/portstat.py", line 572, in cnstat_diff_print
    port_speed = self.get_port_speed(key)
                 ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/utilities_common/portstat.py", line 373, in get_port_speed
    self.db = multi_asic.connect_to_all_dbs_for_ns(ns)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sonic_py_common/multi_asic.py", line 81, in connect_to_all_dbs_for_ns
    db.connect(db_id)
  File "/usr/lib/python3/dist-packages/swsscommon/swsscommon.py", line 2069, in connect
    return _swsscommon.SonicV2Connector_Native_connect(self, db_name, retry_on)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Unable to connect to redis - Cannot assign requested address(1): Cannot assign requested address

- How I did it
Cache the connectors in a dictionary

- How to verify it
Run show interfaces counters command

Signed-off-by: gpunathilell <gpunathilell@nvidia.com>
Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* Add q3d SKUs to gcu_field_operation_validators.conf.json (#4201)

Signed-off-by: arista-hpandya <hpandya@arista.com>
Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* sonic-utilities: Support for clearing aggregate VOQ counters(#2001) (#4044)

* Caching the current counters when sonic-clear queuecounters is executed.
* Calculating and displaying the difference in counter values when the show command is run.
* Providing clear CLI messaging to indicate the behavior when run from supervisor(clear aggregate VOQ counters only).
* Unit test for clear aggregate VOQ counters is added verifying the data is cached and counters are cleared properly.

Signed-off-by: manish <manish1@arista.com>
Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* [multi-asic][Mellanox] Add multi-ASIC support for generate_dump and update FW upgrade script (#4192)

- What I did
Add multi-ASIC support for generate_dump and update FW upgrade script

- How I did it
1. Refactor collect_mellanox() to support multi-ASIC architecture
2. Add collect_mellanox_sai_sdk_dump() function to collect SAI SDK dumps per ASIC
3. Process CMIS host management files for each ASIC instance separately
4. Collect SAI SDK dumps in parallel for all ASICs using background processes
5. Update fast-reboot to use mlnx-fw-manager instead of mlnx-fw-upgrade.sh
6. Fix file paths to be relative to SKU folder for multi-ASIC setups
7. Support namespace-aware command execution for multi-ASIC environments

- How to verify it
Run regression tests

Signed-off-by: Oleksandr Ivantsiv <oivantsiv@nvidia.com>
Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* Added counterpoll CLI support (#4106)

* Added counterpoll CLI support (enable/disable/interval/show)

Signed-off-by: dhanasekar-arista <dhanasekar@arista.com>

* change port_attr to port_phy_attr

Signed-off-by: dhanasekar-arista <dhanasekar@arista.com>

* add unit tests for counterpoll phy configs

Signed-off-by: dhanasekar-arista <dhanasekar@arista.com>

---------

Signed-off-by: dhanasekar-arista <dhanasekar@arista.com>
Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* Add current and configured frequency to DOM CLI (#4209)

* Add current and configured frequency to DOM CLI

Signed-off-by: Ariz Zubair <arizzubair@microsoft.com>

* Update unit test for 400ZR.

Signed-off-by: Ariz Zubair <arizzubair@microsoft.com>

* Fix the parameter name.

Signed-off-by: Ariz Zubair <arizzubair@microsoft.com>

* Update the command reference doc.

Signed-off-by: Ariz Zubair <arizzubair@microsoft.com>

* Redact vendor details.

Signed-off-by: Ariz Zubair <arizzubair@microsoft.com>

* Added requested tx power to dom output

Signed-off-by: Ariz Zubair <arizzubair@microsoft.com>

* Update command reference.

Signed-off-by: Ariz Zubair <arizzubair@microsoft.com>

* Fix unit test.

Signed-off-by: Ariz Zubair <arizzubair@microsoft.com>

* Fix linting error.

Signed-off-by: Ariz Zubair <arizzubair@microsoft.com>

* Undo the output changes.

Signed-off-by: Ariz Zubair <arizzubair@microsoft.com>

---------

Signed-off-by: Ariz Zubair <arizzubair@microsoft.com>
Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* Fix multi asic initialization for dump command (#4108)

- What I did
To add initializeGlobalConfig for dump command in case of multi asic implementation, This is to prevent the error:

root@dut:/home/admin# dump state interface Ethernet0 -n asic0
Traceback (most recent call last):
  File "/usr/local/bin/dump", line 8, in <module>
    sys.exit(dump())
             ^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/dump/main.py", line 96, in state
    collected_info = populate_fv(collected_info, module, namespace, ctx.obj.conn_pool, obj.return_pb2_obj())
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/dump/main.py", line 159, in populate_fv
    conn_pool.get(db_name, namespace)
  File "/usr/local/lib/python3.11/dist-packages/dump/match_infra.py", line 316, in get
    self.cache[ns][CONN] = self.initialize_connector(ns)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/dump/match_infra.py", line 298, in initialize_connector
    return SonicV2Connector(namespace=ns, use_unix_socket_path=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/swsscommon/swsscommon.py", line 2138, in __init__
    for db_name in self.get_db_list():
                   ^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/swsscommon/swsscommon.py", line 2075, in get_db_list
    return _swsscommon.SonicV2Connector_Native_get_db_list(self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: :- validateNamespace: Initialize global DB config using API SonicDBConfig::initializeGlobalConfig
On multi asic system

- How I did it
Initialize global config

- How to verify it
Run unit test

Signed-off-by: gpunathilell <gpunathilell@nvidia.com>
Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* Fix issue that namespace is not correctly fetched in Multi ASIC environment for mirror capability checking (#4159)

- What I did
Fix issue sonic-net/sonic-mgmt#21690

- How I did it
The logic to check the mirror capability is:

orchagent exposes capability to SWITCH_CAPABILITY table in STATE_DB during initialization
CLI (config mirror) fetches capability from the table when a CLI command is issued by a user.
On the multi ASIC environment, the table is in ASIC's namespace. But the CLI command fetches the capability from the host. As a result it always treats mirror is unsupported and fails the test.

Fixed by checking the mirror capability from the namespaces based on source and destination ports.

- How to verify it
Manual test.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* Fix the PSU show command error message on platform without psu at all (#4151)

What I did
de-escalate the message when no psu had been detected at all from error to more moderate info.

- How I did it
simply change the print output and remove the redundance ones

- How to verify it
UT as well as manual test

- Previous command output (if the output of a command-line utility has changed)
Error: Failed to get the number of PSUs
Error: Failed to get PSU status
Error: failed to get PSU status from state DB

- New command output (if the output of a command-line utility has changed)
PSU not detected

Signed-off-by: Yuanzhe Liu <yualiu@nvidia.com>
Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* Update bash completions for sonic-utilities commands (#4163)

What I did
Update the bash completion files for all sonic-utilities commands to make them compatible with the current Click version.

Fixes sonic-net/sonic-buildimage#24594.

How I did it
Use Click's documentation to generate the bash completion script for each command that is packaged from sonic-utilities and uses Click.

How to verify it
Tested in KVM in Trixie image.

admin@vlab-01:~$ sonic-package-manager
install     list        manifests   migrate     repository  reset       show        uninstall   update
admin@vlab-01:~$ sonic-package-manager
install     list        manifests   migrate     repository  reset       show        uninstall   update
admin@vlab-01:~$ sonic-package-manager
install     list        manifests   migrate     repository  reset       show        uninstall   update
admin@vlab-01:~$ spm
install     list        manifests   migrate     repository  reset       show        uninstall   update
admin@vlab-01:~$ spm ^C
admin@vlab-01:~$ show
Display all 105 possibilities? (y or n)
aaa                       buffer_pool               environment               icmp                      macsec                    passw-hardening           runningconfiguration      suppress-fib-pending      vlan
acl                       chassis                   event-counters            interfaces                management_interface      pbh                       serial_console            switch                    vnet
arp                       clock                     fabric                    ip                        mgmt-vrf                  pfc                       services                  switch-hash               vrf
asic-sdk-health-event     copp                      feature                   ipv6                      mirror_session            pfcwd                     sflow                     switch-trimming           vrrp
auto-techsupport          dhcp4relay-counters       fg-nhg                    kdump                     mmu                       platform                  snmpagentaddress          syslog                    vrrp6
auto-techsupport-feature  dhcp6relay_counters       fg-nhg-member             kubernetes                muxcable                  policer                   snmptrap                  system-health             vxlan
banner                    dhcp_relay                fg-nhg-prefix             ldap                      nat                       priority-group            spanning-tree             system-memory             warm_restart
bfd                       dhcp_server               fgnhg                     ldap-server               ndp                       processes                 srv6                      tacacs                    watermark
bgp                       dhcprelay_helper          flowcnt-route             line                      ntp                       queue                     ssh                       techsupport               ztp
bmp                       dns                       flowcnt-trap              lldp                      nvgre-tunnel              radius                    startupconfiguration      uptime
boot                      dropcounters              headroom-pool             logging                   nvgre-tunnel-map          reboot-cause              storm-control             users
buffer                    ecn                       history                   mac                       p4-table                  route-map                 subinterfaces             version
admin@vlab-01:~$ config
aaa                       cbf                       dropcounters              interface_naming_mode     loopback                  nvgre-tunnel-map          reload                    spanning-tree             unique-ip
acl                       chassis                   ecn                       ipv6                      macsec                    override-config-table     replace                   ssh                       vlan
apply-patch               checkpoint                fabric                    kdump                     mclag                     passw-hardening           rollback                  subinterface              vnet
asic-sdk-health-event     clock                     feature                   kubernetes                member                    pbh                       route                     suppress-fib-pending      vrf
auto-techsupport          console                   fg-nhg                    ldap                      mirror_session            pfcwd                     save                      switch-hash               vxlan
auto-techsupport-feature  delete-checkpoint         fg-nhg-member             ldap-server               mmu                       platform                  serial_console            switch-trimming           warm_restart
banner                    dhcp_relay                fg-nhg-prefix             list-checkpoints          muxcable                  portchannel               sflow                     switchport                watermark
bgp                       dhcp_server               flowcnt-route             load                      nat                       qos                       snmp                      synchronous_mode          yang_config_validation
bmp                       dhcpv4_relay              hostname                  load_mgmt_config          ntp                       radius                    snmpagentaddress          syslog                    ztp
buffer                    dns                       interface                 load_minigraph            nvgre-tunnel              rate                      snmptrap                  tacac
Note that these commands don't have a completion script generated, likely because an exception is being raised when just importing that module:

Cannot generate completion for counterpoll.main:cli!
Cannot generate completion for debug.main:cli!
Cannot generate completion for fwutil.main:cli!
Cannot generate completion for psuutil.main:cli!
Cannot generate completion for sfputil.main:cli!
Cannot generate completion for undebug.main:cli!

Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* [GCU] Update WRED_PROFILE and BUFFER_POOL validators for GCU (#4219)

What I did
Remove strict validation for WRED_PROFILE changes
Add stricter controls on BUFFER_POOL changes
Other RDMA tables do not need strict validators
How I did it
Modify the allowlist of ops and fields

How to verify it
Tested on lab device

admin@STR-SN5640-RDMA-1:~$ sudo config apply-patch -v buffer_pool_allowed_replace.json
Patch Applier: localhost: Patch application starting.
Patch Applier: localhost: Patch: [{"op": "replace", "path": "/BUFFER_POOL/ingress_lossless_pool/size", "value": "136200192"}, {"op": "replace", "path": "/BUFFER_POOL/egress_lossy_pool/size", "value": "136200192"}]
Patch Applier: localhost getting current config db.
Patch Applier: localhost: simulating the target full config after applying the patch.
Patch Applier: localhost: validating all JsonPatch operations are permitted on the specified fields
Failed to apply patch due to: Failed to apply patch on the following scopes:
- localhost: Modification of BUFFER_POOL table is illegal- validating function generic_config_updater.field_operation_validators.rdma_config_update_validator returned False
Usage: config apply-patch [OPTIONS] PATCH_FILE_PATH
Try "config apply-patch -h" for help.

Error: Failed to apply patch on the following scopes:
- localhost: Modification of BUFFER_POOL table is illegal- validating function generic_config_updater.field_operation_validators.rdma_config_update_validator returned False
Validation for RDMA tables

| Table                           | GCU Supported | Validator Present | Allowed Ops                         | Notes |
|---------------------------------|---------------|-------------------|-------------------------------------|-------|
| WRED_PROFILE                    | ✅ Yes        | ❌ Removed        | add, replace, remove                | YANG-only enforcement is sufficient |
| BUFFER_POOL                     | 🚫 No         | ✅ Yes            | none (blocked)                      | Blocked due to potential unintended ASIC impact |
| BUFFER_PROFILE                  | ⚠️ Limited    | ✅ Yes            | replace, add (field-specific)       | Strictly allow-listed by validator. Only `dynamic_th` field change allowed on this table |
| BUFFER_QUEUE                    | ✅ Yes        | ❌ No             | add, replace, remove (entry-level)  | Field-level remove of profile is invalid (leafref → "0"); entry-level remove works |
| BUFFER_PG                       | ✅ Yes        | ❌ No             | add, replace, remove (entry-level)  | Field-level remove of profile is invalid (leafref → "0"); entry-level remove works |
| BUFFER_PORT_EGRESS_PROFILE_LIST | ✅ Yes        | ❌ No             | add, replace, remove                | No RDMA-specific validator |
| BUFFER_PORT_INGRESS_PROFILE_LIST| ✅ Yes        | ❌ No             | add, replace, remove                | No RDMA-specific validator |
| QUEUE                           | ✅ Yes        | ❌ No             | add, replace, remove                | Used to bind scheduler and wred_profile per (port\|queue). Remove likely unsafe unless entry-level delete is supported by YANG |
| PORT_QOS_MAP                    | ✅ Yes        | ❌ No             | add, replace                        | Bindings only (`dscp_to_tc_map`, `tc_to_pg_map`, `tc_to_queue_map`, `tc_to_dscp_map`). Ignore PFC/PFCWD for this SKU |
| SCHEDULER                       | ✅ Yes        | ❌ No             | replace                             | Update weight for DWRR schedulers only. Type changes not permitted |
| DSCP_TO_TC_MAP                  | 🚫 No (blocked)| ❌ No            | none (blocked)                      | Observed failure: config apply-patch fails at “Patch Sorter - Strict … scopes” (YANG/scope enforcement). Treat as no-ops allowed for now |
| TC_TO_QUEUE_MAP                 | 🚫 No (blocked)| ❌ No            | none (blocked)                      | Observed failure: “Failed to apply patch on scopes …” → treat as no-ops allowed for now |
| TC_TO_PRIORITY_GROUP_MAP        | 🚫 No (blocked)| ❌ No            | none (blocked)                      | Same class of failure as mapping tables above |

Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* generate_dump: add interface FEC stats (#4093)

Add FEC stats to the tarball produced by "show tech". The stats can
be found in files named "interface.counters.fec-stats_$idx".

Signed-off-by: Fraser Gordon <fraserg@arista.com>
Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* [sfputil] Fix issue: should not do low power mode or reset for non-present ports (#4206)

- What I did
Ignore get_lpmode, set_lpmode, reset for ports that with no module present

- How I did it
Check module presence before calling get_lpmode, set_lpmode, reset

- How to verify it
New unit test - PASSED
Manual test - PASSED

Signed-off-by: Junchao-Mellanox <junchao@nvidia.com>
Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* Use Singleton PlatformDataProvider to reduce module import time (#4183)

- What I did
For fwutil show command which displays the usage/help message reduce the time taken by lazily importing PlatformDataProvider. This reduced the average time taken by ~50%.

- How I did it
Use a singleton PlatformDataProvider in fwutil/main.py

- How to verify it
Before the change

Running 'fwutil show' 10 times (gap 5s)...
Run 1: 972 ms
Run 2: 1058 ms
Run 3: 948 ms
Run 4: 1213 ms
Run 5: 1507 ms
Run 6: 1235 ms
Run 7: 1553 ms
Run 8: 1037 ms
Run 9: 1000 ms
Run 10: 1037 ms
---- fwutil show stats ----
Avg: 1156 ms
Min: 948 ms
Max: 1553 ms
After the change

Running 'fwutil show' 10 times (gap 5s)...
Run 1: 496 ms
Run 2: 482 ms
Run 3: 466 ms
Run 4: 445 ms
Run 5: 482 ms
Run 6: 463 ms
Run 7: 780 ms
Run 8: 662 ms
Run 9: 653 ms
Run 10: 659 ms
---- fwutil show stats ----
Avg: 558 ms
Min: 445 ms
Max: 780 ms

Signed-off-by: Hemanth Kumar Tirupati <htirupati@nvidia.com>
Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* [Fast-linkup] Added CLIs for config/show (#4182)

HLD: fast-link-up-hld.md

What I did
Implemented CLI for Fast-linkup feature including:

config feature parameters
enable/disable the feature per-port
show feature parameters
show interfaces feature status
How I did it
By adding the new command support to config and show CLI
How to verify it
Run Fast-linkup CLIs
Which release branch to backport (provide reason below if selected)
 202511
New command output (if the output of a command-line utility has changed)
admin@sonic:/home/admin# show switch-fast-linkup global
+---------------+---------+
| Field         |   Value |
+===============+=========+
| ber_threshold |      10 |
+---------------+---------+
| guard_time    |      15 |
+---------------+---------+
| polling_time  |      60 |
+---------------+---------+
admin@sonic:/home/admin# show interfaces fast-linkup status
+-------------+---------------+
| Interface   | fast_linkup   |
+=============+===============+
| Ethernet0   | true          |
| Ethernet4   | true          |
| Ethernet8   | true          |
| Ethernet12  | false         |
| Ethernet16  | false         |
| Ethernet20  | false         |
| Ethernet24  | false         |
| Ethernet28  | false         |
| Ethernet32  | false         |
| Ethernet36  | false         |
| Ethernet40  | false         |
| Ethernet44  | false         |
| Ethernet48  | false         |
| Ethernet52  | false         |
| Ethernet56  | false         |
| Ethernet60  | false         |
| Ethernet64  | false         |
| Ethernet68  | false         |
| Ethernet72  | false         |
| Ethernet76  | false         |
| Ethernet80  | false         |
| Ethernet84  | false         |
| Ethernet88  | false         |
| Ethernet92  | false         |
| Ethernet96  | false         |
| Ethernet100 | false         |
| Ethernet104 | false         |
| Ethernet108 | false         |
| Ethernet112 | false         |
| Ethernet116 | false         |
| Ethernet120 | false         |
| Ethernet124 | false         |
+-------------+---------------+

Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* Update the error message for sfputil debug loopback command (#4224)

* Update the error message for sfputil debug loopback command when diag pages are not supported.

Signed-off-by: Ariz Zubair <arizzubair@microsoft.com>

* Update unit tests.

Signed-off-by: Ariz Zubair <arizzubair@microsoft.com>

* Fix flake8 error.

Signed-off-by: Ariz Zubair <arizzubair@microsoft.com>

* Fix unit test.

Signed-off-by: Ariz Zubair <arizzubair@microsoft.com>

---------

Signed-off-by: Ariz Zubair <arizzubair@microsoft.com>
Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* refactor: enhance show bfd summary command (#4242)

Update show bfd summary to aggregate BFD sessions across all ASIC namespaces when no -n <namespace> is provided.
Extend multi-ASIC BFD tests and expected output for the all-ASIC summary.

Signed-off-by: Chenyang Wang <chenyangw233@gmail.com>
Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* Fix JsonMove._get_value to Support Both String and Integer List Indices (#4237)

What I did:
Issue: #4221

Updated JsonMove._get_value to handle both string and integer indices when traversing lists in config data.
Adjusted related unit tests to reflect the new behavior.
How I did it:
Modified the traversal logic to convert string tokens to integers when accessing lists, allowing both "1" and 1 as valid indices.
Removed the test expecting a TypeError for integer indices and added assertions for both string and integer index access.
How to verify it:
Patched change in lab device, confirmed.

admin@STR-SN5640-RDMA-1:~$ cat /usr/local/lib/python3.11/dist-packages/generic_config_updater/patch_sorter.py | grep -C 2 "int(token)"
        for token in tokens:
            if isinstance(config, list):
                token = int(token)
            config = config[token]

admin@STR-SN5640-RDMA-1:~$ cat t_tc_to_queue_map_modify.json
[
  {
    "op": "replace",
    "path": "/TC_TO_QUEUE_MAP/AZURE/8",
    "value": "8"
  },
  {
    "op": "add",
    "path": "/TC_TO_QUEUE_MAP/AZURE/7",
    "value": "7"
  }
]

admin@STR-SN5640-RDMA-1:~$ sudo config apply-patch -v t_tc_to_queue_map_modify.json
Patch Applier: localhost: Patch application starting.
Patch Applier: localhost: Patch: [{"op": "replace", "path": "/TC_TO_QUEUE_MAP/AZURE/8", "value": "8"}, {"op": "add", "path": "/TC_TO_QUEUE_MAP/AZURE/7", "value": "7"}]
Patch Applier: localhost getting current config db.
Patch Applier: localhost: simulating the target full config after applying the patch.
Patch Applier: localhost: validating all JsonPatch operations are permitted on the specified fields
Patch Applier: localhost: validating target config does not have empty tables,
                            since they do not show up in ConfigDb.
Patch Applier: localhost: sorting patch updates.
Patch Sorter - Strict: Validating patch is not making changes to tables without YANG models.
Patch Sorter - Strict: Validating target config according to YANG models.
Patch Sorter - Strict: Sorting patch updates.
Patch Applier: The localhost patch was converted into 1 change:
Patch Applier: localhost: applying 1 change in order:
Patch Applier:   * [{"op": "replace", "path": "/TC_TO_QUEUE_MAP/AZURE/7", "value": "7"}, {"op": "replace", "path": "/TC_TO_QUEUE_MAP/AZURE/8", "value": "8"}]
Patch Applier: localhost: verifying patch updates are reflected on ConfigDB.
Patch Applier: localhost patch application completed.
Patch applied successfully.
Also run the updated unit tests and all tests should pass, confirming the fix.

Signed-off-by: Xincun Li <stli@microsoft.com>
Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* Fix j2 files not getting packaged (#4250)

What I did

Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* Fix failure with ijson library

There was a failure when sonic-mgmt tests were run in a KVM. The failure appears to be due to the environment where it is running. It seems like on this environment ijson is not able to find the C-libraries required to set a default backend. Force a python backend to iterm.

Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* Incorporate feedback from Sai

Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* Pick the python backend for ijson

The alternative C backend has an issue that is best described by a
comment from saiarcot895 in
https://github.com/sonic-net/sonic-utilities/pull/4205

Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* Add multi-asic support for sonic-clear queue wredcounters and counter poll , --nonzero support for show queue wredcounters (#4152)

* Add multi-asic support for sonic-clear queue wredcounters and counterpoll , --nonzero support for show queue wredcounters

* Add multi-asic support for sonic-clear queue wredcounters

Signed-off-by: saksarav <sakthivadivu.saravanaraj@nokia.com>

* Fix the flake8 error

Signed-off-by: saksarav <sakthivadivu.saravanaraj@nokia.com>

---------

Signed-off-by: saksarav <sakthivadivu.saravanaraj@nokia.com>
Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* [Mellanox] Add restricted sysfs to fw control list (#4240)

- What I did
Add interrupt sysfs to restricted fw control sysfs list, and took hw_present value only if control == 1.

- How I did it
Updated generate_dump script

- How to verify it
run show techsupport on switch

Signed-off-by: noaOrMlnx <noaor@nvidia.com>
Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* Clearing /tmp/tmp* is unsafe with parallel builds (#4268)

* Clearing /tmp/tmp* is unsafe with parallel builds

Many tests for various packages use /tmp/tmp.XXXXXXXX or
/tmp/tmpi_XXXXX as the temporary file or directory pattern for
mktemp.  Since the same slave container is used for multiple
simultaneous builds, destroying an in-progress build's temporary
file or directory will cause those builds to fail.

While this has existed for a year, it appears the introduction
of Trixie has reordered the builds a bit so that packages using
the temp file patterns impacted are built simultaneously.

Signed-off-by: Brad House <bhouse@nexthop.ai>

* subprocess does not need to invoke the shell

glob pattern is no longer used so we don't need to spawn a shell to
interpret.

Signed-off-by: Brad House <bhouse@nexthop.ai>

---------

Signed-off-by: Brad House <bhouse@nexthop.ai>
Co-authored-by: Brad House <brad@brad-house.com>
Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* Fix dump port state CLI command crash on multi-asic platforms (#4229)

* Fix masic dump port state crash

The error occurs because the code checks if any database configuration is loaded,
but multi-ASIC systems specifically need the global database configuration to be loaded.

Fixed it by using isGlobalInit() check for multi-ASIC and isInit() for single-ASIC to
ensure the correct DB configuration is loaded before creating connectors.

Signed-off-by: setu <setu@arista.com>

* Fix masic dump port state crash

The error occurs because the code checks if any database configuration is loaded,
but multi-ASIC systems specifically need the global database configuration to be loaded.

Fixed it by calling load_db_config helper function to ensure the correct
DB configuration is loaded before creating connectors.

Signed-off-by: setu <setu@arista.com>

---------

Signed-off-by: setu <setu@arista.com>
Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* Add .github/copilot-instructions.md for AI-assisted development (#4271)

Signed-off-by: Rustiqly <rustiqly@users.noreply.github.com>
Co-authored-by: Rustiqly <rustiqly@users.noreply.github.com>
Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* Add filesystem sync after plugin installation (#4251)

- Why I did it
In some scenarios, after install plugin then power cycle, file content might lost.
Before power cycle, file size is 205, also can found register function in python file, but after power cycle, this file size is 0, so assume this is caused by page cache didn't write back to disk on time, when power cycle happen.
Before power cycle:

2026 Feb  3 10:34:16.156531 sonic-testbed INFO  [DIAGNOSTIC] Starting CLI plugins installation for package: cpu-report
2026 Feb  3 10:34:16.157013 sonic-testbed INFO  [DIAGNOSTIC] Installing CLI plugin: package=cpu-report, command=show, src=/show.py, dst=/usr/local/lib/python3.13/dist-packages/show/plugins/cpu-report.py
2026 Feb  3 10:34:16.157177 sonic-testbed INFO  [DIAGNOSTIC] Starting extract: image=sha256:1230c222517c88863253c94dba34a788b580604618373fff24ab737a7d519c3f, src=/show.py, dst=/usr/local/lib/python3.13/dist-packages/show/plugins/cpu-report.py
2026 Feb  3 10:34:16.267834 sonic-testbed INFO  [DIAGNOSTIC] Tar buffer size: 2048 bytes, MD5: b0b48780efda61d230dc2e3592cc3ba6
2026 Feb  3 10:34:16.268709 sonic-testbed INFO  [DIAGNOSTIC] Tar member: name=show.py, size=205, isfile=True
2026 Feb  3 10:34:16.269652 sonic-testbed INFO  [DIAGNOSTIC] File extracted successfully: path=/usr/local/lib/python3.13/dist-packages/show/plugins/cpu-report.py, size=205, MD5=f2f3ca5258fd0685adf2cc44567934fb, elapsed=0.112s
2026 Feb  3 10:34:16.270313 sonic-testbed INFO  [DIAGNOSTIC] Python syntax validation: PASS for /usr/local/lib/python3.13/dist-packages/show/plugins/cpu-report.py
2026 Feb  3 10:34:16.270820 sonic-testbed INFO  [DIAGNOSTIC] Plugin file verification after extract: path=/usr/local/lib/python3.13/dist-packages/show/plugins/cpu-report.py, size=205, MD5=f2f3ca5258fd0685adf2cc44567934fb, mtime=1684332898.0, extract_time=0.113s
2026 Feb  3 10:34:16.271351 sonic-testbed INFO  [DIAGNOSTIC] Python syntax check: PASS for /usr/local/lib/python3.13/dist-packages/show/plugins/cpu-report.py
2026 Feb  3 10:34:16.271638 sonic-testbed INFO  [DIAGNOSTIC] Found "def register" in plugin file: /usr/local/lib/python3.13/dist-packages/show/plugins/cpu-report.py
2026 Feb  3 10:34:16.271918 sonic-testbed INFO  [DIAGNOSTIC] Completed CLI plugins installation for package: cpu-report, elapsed=0.115s
After power cycle:

admin@sonic-testbed:~$ show version 2>&1
failed to import plugin show.plugins.cpu-report: module 'show.plugins.cpu-report' has no attribute 'register'

admin@sonic-testbed:~$ ls -lih /usr/local/lib/python3.13/dist-packages/show/plugins/cpu-report.py
830572 -rw-r--r-- 1 root root 0 May 17  2023 /usr/local/lib/python3.13/dist-packages/show/plugins/cpu-report.py
admin@sonic-testbed:~$ sudo md5sum /usr/local/lib/python3.13/dist-packages/show/plugins/cpu-report.py
d41d8cd98f00b204e9800998ecf8427e  /usr/local/lib/python3.13/dist-packages/show/plugins/cpu-report.py
admin@sonic-testbed:~$ sudo stat /usr/local/lib/python3.13/dist-packages/show/plugins/cpu-report.py
  File: /usr/local/lib/python3.13/dist-packages/show/plugins/cpu-report.py
  Size: 0               Blocks: 0          IO Block: 4096   regular empty file
Device: 0,27    Inode: 830572      Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2026-02-03 10:34:16.266593882 +0200
Modify: 2023-05-17 17:14:58.000000000 +0300
Change: 2026-02-03 10:34:16.262593831 +0200
 Birth: 2026-02-03 10:34:16.262593831 +0200
admin@sonic-testbed:~$ cat /usr/local/lib/python3.13/dist-packages/show/plugins/cpu-report.py
admin@sonic-testbed:~$

- What I did
Fix intermittent plugin corruption after power cycle by adding os.sync() to flush filesystem buffers after all CLI plugins are installed. This prevents incomplete plugin files that cause 'module has no attribute 'register'' errors in show commands after system reboot.

- How I did it
Added os.sync() system call in PackageManager._install_cli_plugins() method after all CLI plugin files are extracted and installed. This ensures that:

All plugin file data is flushed from the OS page cache to disk
File metadata and data are both persisted before the method returns
Plugin files remain intact even if an abrupt power loss occurs shortly after installation

- How to verify it
1. Install cpu-report package: sonic-package-manager install cpu-report==1.0.0 -y
2. Enable feature: config feature state cpu-report enabled
3. Upgrade package: sonic-package-manager install cpu-report==1.0.7 -y
4. Upgrade again: sonic-package-manager install cpu-report==1.0.8 -y
Immediately perform power cycle
5. After reboot, run: show version
If there is problem, error is: failed to import plugin show.plugins.cpu-report: module 'show.plugins.cpu-report' has no attribute 'register'.

Signed-off-by: Jianyue Wu <jianyuew@nvidia.com>
Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* [multi-asic][warm_restart] add Multi-ASIC support for warm_restart commands (#4200)

- What I did
Added Multi-ASIC support for warm_restart commands.

- How I did it
Updated the warm restart commands to operate per ASIC namespace and handle multi-ASIC execution consistently.

- How to verify it
Run warm_restart commands on a Multi-ASIC system and confirm per-ASIC namespaces are handled.
Verify warm restart flags/status are correct per namespace.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* [multi-asic][warm-reboot] Support warm-reboot on Multi-ASIC systems (#4199)

- What I did
Implement warm-reboot script support for Multi-ASIC systems.

- How I did it
Modified warm-reboot script.

- How to verify it
1. Verified on Multi-ASIC KVM with 4 ASICs
2. On boot SAI started in warm boot mode
3. Tested on single-ASIC real HW to ensure flow is as was before

---------

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Signed-off-by: Yair Raviv <yraviv@nvidia.com>
Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* [centralize_database] Add --namespace option (#4198)

- What I did
Added --namespace option to centralize_database script

- How I did it
Added --namespace option to centralize_database script

- How to verify it
Run centralize_database script with --namespace option

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* [check_db_integrity] Add NETNS environment (#4197)

- What I did
Renamed DB dump files to include database name and namespace.

- How I did it
Adjusted the dump file naming to ".json" to uniquely identify per-ASIC/namespace outputs.

- How to verify it
Run the DB dump command with and without a namespace.
Confirm the output file name matches DBNAME plus NETNS (when provided).
Ensure dumps are still created successfully.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* [warm/fast-reboot] check per-ASIC FW upgrade status (#4196)

- What I did
Added per-ASIC firmware upgrade status checks during warm/fast reboot.

- How I did it
Updated the warm/fast reboot flow to query and validate FW upgrade status per ASIC namespace instead of relying on a single/global check.

- How to verify it
Trigger warm/fast reboot on a Multi-ASIC system with mixed FW upgrade states and confirm the per-ASIC check reflects each namespace.
Confirm reboot proceeds only when all ASICs report FW upgrade completion.
Run existing warm reboot tests and ensure they pass.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* [teamd_retry_count] Add support for --namespace parameter (#4195)

- What I did
Added support for --namespace parameter in both config portchannel retry-count CLI as well as teamd_increase_retry_count.py script to support Multi-ASIC systems.

- How I did it
Pass namespace to DB interfaces and CLI commands, in teamd_increase_retry_count.py script - switch to network namespace to perform network operations within that namespace.

- How to verify it
Manual test.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* [lag_keepalive] add `--namespace` option (#4194)

- What I did
Added --namespace option to lag_keepalive.py.

- How I did it
Added --namespace option to lag_keepalive.py.

- How to verify it
Run lag_keepalive.py with --namepsace option.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* [fast-reboot] Remove teamsyncd timer override by fast-boot (#4233)

Timer override to 1 sec was used to speed up kernel IP configuration on PortChannel as a W/A.
This PR reopened this PR - #3996

- What I did
Remove teamsyncd 1 sec timer override. It was used to speed up kernel IP configuration on PortChannel as a W/A.
Original issue is solved by sonic-net/sonic-swss#4170

- How I did it
Remove teamsyncd 1 sec timer override.

- How to verify it
Ran fast-boot and warm-boot tests.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* Prevent early exit of reboot status (#4282)

Signed-off-by: gpunathilell <gpunathilell@nvidia.com>
Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* [multi-asic] fix utilities_common Db helper (#4273)

- What I did
This is to fix the utilities_common.Db() helper class.

Using it now in the multi-asic environment leads to an error:

RuntimeError: :- validateNamespace: Initialize global DB config using API SonicDBConfig::initializeGlobalConfig
This impacts the counterpoll switch CLI command.

- How I did it
Added a proper DB config initialization

- How to verify it
Manual test for the Db() helper
Running counterpoll switch disable in multi-asic environment

Signed-off-by: Yakiv Huryk <yhuryk@nvidia.com>
Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* Convey the IJSON Backend using an env variable

Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* Revert "Convey the IJSON Backend using an env variable"

This reverts commit 916442c9df260653783f14dcebfa65aa7f1ed393.

Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* Convey the IJSON Backend using an env variable

Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* Fix flake8 error

Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* Fix flake8 errors

Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

* Fix merge conflict error

Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>

---------

Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>
Signed-off-by: gpunathilell <gpunathilell@nvidia.com>
Signed-off-by: arista-hpandya <hpandya@arista.com>
Signed-off-by: manish <manish1@arista.com>
Signed-off-by: Oleksandr Ivantsiv <oivantsiv@nvidia.com>
Signed-off-by: dhanasekar-arista <dhanasekar@arista.com>
Signed-off-by: Ariz Zubair <arizzubair@microsoft.com>
Signed-off-by: Stephen Sun <stephens@nvidia.com>
Signed-off-by: Yuanzhe Liu <yualiu@nvidia.com>
Signed-off-by: Fraser Gordon <fraserg@arista.com>
Signed-off-by: Junchao-Mellanox <junchao@nvidia.com>
Signed-off-by: Hemanth Kumar Tirupati <htirupati@nvidia.com>
Signed-off-by: Chenyang Wang <chenyangw233@gmail.com>
Signed-off-by: Xincun Li <stli@microsoft.com>
Signed-off-by: saksarav <sakthivadivu.saravanaraj@nokia.com>
Signed-off-by: noaOrMlnx <noaor@nvidia.com>
Signed-off-by: Brad House <bhouse@nexthop.ai>
Signed-off-by: setu <setu@arista.com>
Signed-off-by: Rustiqly <rustiqly@users.noreply.github.com>
Signed-off-by: Jianyue Wu <jianyuew@nvidia.com>
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Signed-off-by: Yair Raviv <yraviv@nvidia.com>
Signed-off-by: Yakiv Huryk <yhuryk@nvidia.com>
Co-authored-by: Gagan Punathil Ellath <gpunathilell@nvidia.com>
Co-authored-by: HP <hpandya@arista.com>
Co-authored-by: manish1-arista <manish1@arista.com>
Co-authored-by: Oleksandr Ivantsiv <oivantsiv@nvidia.com>
Co-authored-by: Dhanasekar Rathinavel <dhanasekar@arista.com>
Co-authored-by: Ariz Zubair <5427064+az-pz@users.noreply.github.com>
Co-authored-by: Stephen Sun <5379172+stephenxs@users.noreply.github.com>
Co-authored-by: Yuanzhe <150663541+yuazhe@users.noreply.github.com>
Co-authored-by: Saikrishna Arcot <sarcot@microsoft.com>
Co-authored-by: Dev Ojha <47282568+developfast@users.noreply.github.com>
Co-authored-by: Fraser Gordon <fraserg@arista.com>
Co-authored-by: Junchao-Mellanox <57339448+Junchao-Mellanox@users.noreply.github.com>
Co-authored-by: Hemanth Kumar Tirupati <htirupati@nvidia.com>
Co-authored-by: Yair Raviv <73100906+YairRaviv@users.noreply.github.com>
Co-authored-by: Chenyang Wang <49756587+cyw233@users.noreply.github.com>
Co-authored-by: Xincun Li <147451452+xincunli-sonic@users.noreply.github.com>
Co-authored-by: saksarav-nokia <sakthivadivu.saravanaraj@nokia.com>
Co-authored-by: Noa Or <58519608+noaOrMlnx@users.noreply.github.com>
Co-authored-by: Brad House - NextHop <bhouse@nexthop.ai>
Co-authored-by: Brad House <brad@brad-house.com>
Co-authored-by: Setu Patel <171176331+arista-setu@users.noreply.github.com>
Co-authored-by: rustiqly <245760149+rustiqly@users.noreply.github.com>
Co-authored-by: Rustiqly <rustiqly@users.noreply.github.com>
Co-authored-by: Jianyue Wu <jianyuew@nvidia.com>
Co-authored-by: Yakiv Huryk <62013282+Yakiv-Huryk@users.noreply.github.com>
Signed-off-by: Xincun Li <stli@microsoft.com>

* Fix spelling typos in config/nat.py (#4258)

Fix repeated misspelling of 'configuration' (configutation) throughout
NAT configuration commands, plus 'suported' -> 'supported' and
'Enbale' -> 'Enable'.

Signed-off-by: Rustiqly <rustiqly@users.noreply.github.com>
Co-authored-by: Rustiqly <rustiqly@users.noreply.github.com>
Signed-off-by: Xincun Li <stli@microsoft.com>

* Fix spelling typos in config/config_mgmt.py (#4260)

Fix misspellings: managment, Seperator, dependecies, delets,
sucessful, relavant, compitible.

Signed-off-by: Rustiqly <rustiqly@users.noreply.github.com>
Co-authored-by: Rustiqly <rustiqly@users.noreply.github.com>
Signed-off-by: Xincun Li <stli@microsoft.com>

* Fix spelling typos in show/ and clear/ modules (#4263)

Fix misspellings in show and clear commands:
- dislay -> display (bgp_common.py)
- lastest -> latest (kdump.py)
- continous -> continuous (show/main.py, clear/main.py)
- deafult -> default (interfaces/__init__.py)
- Erorrs -> Errors (interfaces/__init__.py)
- fomatted -> formatted (5 plugin files)
- cummulative -> cumulative (auto_techsupport.py)

Signed-off-by: Rustiqly <rustiqly@users.noreply.github.com>
Co-authored-by: Rustiqly <rustiqly@users.noreply.github.com>
Signed-off-by: Xincun Li <stli@microsoft.com>

* Fix spelling typos in scripts/ (#4262)

Signed-off-by: Rustiqly <rustiqly@users.noreply.github.com>
Co-authored-by: Rustiqly <rustiqly@users.noreply.github.com>
Signed-off-by: Xincun Li <stli@microsoft.com>

* Fix spelling typos in config/main.py (#4261)

Fix the following spelling errors in comments and string literals:

- relavent -> relevant
- retreive -> retrieve
- cant -> can't
- environmnet -> environment
- funtion -> function
- dependecy -> dependency
- overriden -> overridden (2 occurrences)
- exmaple -> example
- sepcified -> specified (5 occurrences)
- Interation -> Iteration
- Remvoe -> Remove
- transciever -> transceiver (2 occurrences)
- Disble -> Disable
- doesnt exists -> doesn't exist (2 occurrences)
- doesnt exist -> doesn't exist
- doesnot exist -> does not exist (2 occurrences)
- cant delete -> can't delete (2 occurrences)

Signed-off-by: Rustiqly <rustiqly@users.noreply.github.com>
Co-authored-by: Rustiqly <rustiqly@users.noreply.github.com>
Signed-off-by: Xincun Li <stli@microsoft.com>

* Fix spelling typos in muxcable modules (#4259)

Fix 'retreive' -> 'retrieve', 'cant' -> 'can\'t', 'standy' -> 'standby',
and 'detemine' -> 'determine' in config/muxcable.py and show/muxcable.py.

Signed-off-by: Rustiqly <rustiqly@users.noreply.github.com>
Co-authored-by: Rustiqly <rustiqly@users.noreply.github.com>
Signed-off-by: Xincun Li <stli@microsoft.com>

* Fix unit test assertions broken by spelling typo PRs (#4321)

What is the motivation for this PR
Fix unit test assertions broken by recent spelling correction PRs, and revert the 'Neighbhor' → 'Neighbor' header change which is intentionally preserved for backward compatibility.

How did you do it
Updated test expected strings to match corrected source messages and restored the 'Neighbhor' header in bgp_util.py.

How did you verify/test it
Not provided in PR description.

Signed-off-by: Rustiqly <rustiqly@users.noreply.github.com>
Signed-off-by: Xincun Li <stli@microsoft.com>

* Add fsync to config save to persist config across power cycle (#4313)

What I did
Fixed config_db.json not persisting across power cycle. Config changes (e.g., FEC) were lost after power cycle because data stayed in page cache and was never flushed to disk.

How I did it
Added flush() and os.fsync() after json.dump() to ensures config is written to disk before returning, so it survives power cycle.
How to verify it
config interface fec Ethernet0 auto
config save -y
cat /etc/sonic/config_db.json | grep -i fec

Signed-off-by: Xincun Li <stli@microsoft.com>

* [LACP retry-count] Syntax Fix for Trixie (#4274)

Signed-off-by: Yair Raviv <yraviv@nvidia.com>
Signed-off-by: Xincun Li <stli@microsoft.com>

* fix scapy delayed import when we have large routes (#4315)

* Fix delayed scapy import in teamd retry count script

Signed-off-by: Hemanth Kumar Tirupati <htirupati@nvidia.com>

* fix scapy delayed import.

Signed-off-by: Hemanth Kumar Tirupati <htirupati@nvidia.com>

---------

Signed-off-by: Hemanth Kumar Tirupati <htirupati@nvidia.com>
Signed-off-by: Xincun Li <stli@microsoft.com>

* fix: skip PORT_INGRESS/EGRESS_MIRROR_CAPABLE check for ERSPAN mirror sessions (#4323)

* fix: skip PORT_INGRESS/EGRESS_MIRROR_CAPABLE check for ERSPAN sessions

ERSPAN sessions (direction=None) use source/destination IPs, not ports.
The PORT_INGRESS_MIRROR_CAPABLE and PORT_EGRESS_MIRROR_CAPABLE capability
flags in STATE_DB only apply to SPAN (port mirror) sessions. Checking
these flags for ERSPAN incorrectly blocks session creation on platforms
that do not populate these STATE_DB keys (e.g., multi-ASIC T1 KVM).

Changes:
- Return True immediately when direction=None (ERSPAN) in
  is_port_mirror_capability_supported(), bypassing the capability check
- Treat absent STATE_DB keys (None value) as 'supported' for backward
  compatibility on platforms that don't populate SWITCH_CAPABILITY table

Fixes: https://github.com/sonic-net/sonic-mgmt/issues/21690

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Bing Wang <bingwang@microsoft.com>

* fix: skip PORT_INGRESS/EGRESS_MIRROR_CAPABLE check for ERSPAN sessions

ERSPAN sessions use src/dst IPs (GRE tunnel), not ports. The capability
flags PORT_INGRESS_MIRROR_CAPABLE and PORT_EGRESS_MIRROR_CAPABLE in
STATE_DB SWITCH_CAPABILITY|switch only apply to SPAN (port mirror) sessions.

Root cause: platforms that do not populate these STATE_DB keys return None,
which != 'true', so is_port_mirror_capability_supported() incorrectly returns
False and blocks ERSPAN session creation.

Fix:
- In validate_mirror_session_config(): skip the capability check entirely
  for ERSPAN sessions (dst_port=None is always passed by the ERSPAN code path)
- In is_port_mirror_capability_supported(): treat absent STATE_DB keys (None)
  as 'supported' for backward compatibility; direction=None now correctly
  checks both ingress and egress capabilities for SPAN sessions

Fixes: https://github.com/sonic-net/sonic-mgmt/issues/21690

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Bing Wang <bingwang@microsoft.com>

---------

Signed-off-by: Bing Wang <bingwang@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Xincun Li <stli@microsoft.com>

* Fix 'show version' KeyError when sonic_version.yml has missing fields (#4324)

'show version' crashes with KeyError when debian_version or
kernel_version are missing from sonic_version.yml. This happens in
docker-sonic-vs containers where the version file is generated without
these fields (they are only set during full image builds).

Use .get() with sensible runtime fallbacks:
- debian_version: 'N/A' (not available in container context)
- kernel_version: os.uname().release (actual host kernel at runtime)
- build_version: 'N/A'
- sonic_os_version: 'N/A'

Fixes sonic-net/sonic-buildimage#25765

Signed-off-by: securely1g <securely1g@users.noreply.github.com>
Signed-off-by: Xincun Li <stli@microsoft.com>

* Modified dualtor_neighbor_check to use mux neighbor_mode (#4227)

What I did
Adjusted the dualtor_neighbor_check.py based on mux neighbor_mode described in HLD : sonic-net/SONiC#2176

Output of dualtor_neighbor_check will now depend on neighbor_mode set in STATE_DB|MUX_CABLE_TABLE

How I did it
How to verify it
Previous command output (if the output of a command-line utility has changed)
NEIGHBOR      MAC                PORT         MUX_STATE    IN_MUX_TOGGLE    NEIGHBOR_IN_ASIC    TUNNEL_IN_ASIC    HWSTATUS
------------  -----------------  -----------  -----------  ---------------  ------------------  ----------------  ----------
192.168.0.3   16:8d:06:da:8d:0d  Ethernet8    active       no               yes                 no                consistent
192.168.0.5   42:85:ce:ff:2b:7a  Ethernet16   active       no               yes                 no                consistent
New command output (if the output of a command-line utility has changed)
================================================================================
Neighbors in PREFIX-ROUTE mode:
================================================================================
NEIGHBOR      MAC                PORT         MUX_STATE    IN_MUX_TOGGLE    NEIGHBOR_IN_ASIC    PREFIX_ROUTE    NEXTHOP_TYPE    HWSTATUS
------------  -----------------  -----------  -----------  ---------------  ------------------  --------------  --------------  ----------
192.168.0.7   5e:9d:89:07:66:83  Ethernet24   active       no               yes                 yes             NEIGHBOR        consistent
192.168.0.9   e2:2a:a8:65:1e:50  Ethernet32   active       no               yes                 yes             NEIGHBOR        consistent
================================================================================
Neighbors in HOST-ROUTE mode:
================================================================================
NEIGHBOR     MAC                PORT        MUX_STATE    IN_MUX_TOGGLE    NEIGHBOR_IN_ASIC    TUNNEL_IN_ASIC    HWSTATUS
-----------  -----------------  ----------  -----------  ---------------  ------------------  ----------------  ----------
192.168.0.3  16:8d:06:da:8d:0d  Ethernet8   active       no               yes                 no                consistent
192.168.0.5  42:85:ce:ff:2b:7a  Ethernet16  active       no               yes                 no                consistent

Signed-off-by: Xincun Li <stli@microsoft.com>

* [tests/gcu]: Improve code coverage for generic_config_updater/main.py and config/main.py

Add a new test module tests/generic_config_updater/main_test.py with
comprehensive unit tests covering:
- validate_patch_format: all valid ops, non-list/non-dict/missing-field/
  invalid-op branches
- get_all_running_config: success return, nonzero returncode exception
- filter_duplicate_patch_operations: no-leaf-list fast path, duplicate
  removal, non-duplicate unchanged, dict config input
- append_emptytables_if_required: no-insert, single/multiple missing
  tables, op-without-path skip, asic-scoped paths
- validate_patch: ImportError skip, True/False validation, unexpected
  exception, multi-asic all-asics loop and per-asic failure
- apply_patch_for_scope: success, exception -> failure, HOST_NAMESPACE
  scope mapping
- apply_patch_from_file: invalid format, no-preprocess success, preprocess
  helper invocations, preprocess validation failure, parallel threadpool
  dispatch, scope failure aggregation, empty patch single/multi-asic
- print_error / print_success output targets
- multiasic_save_to_singlefile: host + asic configs
- Sub-command functions: create_checkpoint, delete_checkpoint,
  list_checkpoints, apply_patch, replace_config, save_config,
  rollback_config (success, verbose output, failure -> sys.exit)
- build_parser: all 7 sub-commands with default and non-default flags
- main(): no-command help, all commands dispatched, missing-file exit

Extend tests/config_test.py (TestGenericUpdateCommands) to cover the
previously uncovered lines in config/main.py:
- print_dry_run_message: dry_run=True banner / dry_run=False no output
- run_gcu_standalone: basic cmd construction, non-default format flag,
  all optional flags (--dry-run, --parallel, --ignore-non-yang-tables,
  --ignore-path, --verbose), return value pass-through
- apply-patch GCU standalone redirect: success (returncode 0) and
  failure (returncode != 0 -> ctx.fail) branches

Signed-off-by: Xincun Li <stli@microsoft.com>

* [tests/gcu]: Fix flake8 lint errors in main_test.py

- Remove unused 'jsonpatch' import (F401)
- Add '# noqa: E402' to imports that follow sys.path.insert calls (E402)

Signed-off-by: Xincun Li <stli@microsoft.com>

* [GCU/config]: Port path-trace support from PR #4317

Adopt the --path-trace option added in upstream PR #4317.

config/main.py:
- Add import jsonpatch and validate_patch from generic_config_updater.main
- Update run_gcu_standalone() to forward --path-trace to the standalone binary
- In apply_patch(): when --path-trace is set, open the trace file and call
  GenericUpdater.apply_patch() directly with trace_io parameter, closing the
  file in a finally block; for the non-trace path delegate unchanged to
  _gcu_apply_patch_from_file()

generic_config_updater/main.py:
- Add trace_io=None parameter to apply_patch_for_scope() and
  apply_patch_from_file(), propagating it through parallel and serial dispatch
- Document the new parameter in the docstring
- Remove the incorrect open() call inside apply_patch_from_file() that would
  have leaked file handles and conflated file-path strings with IO objects

tests/config_test.py:
- Add test_apply_patch__path_trace_option__trace_file_opened_and_passed
- Update existing assertion helpers to include trace_io=None for the no-trace path

Signed-off-by: Xincun Li <stli@microsoft.com>

* [tests/config]: Update run_gcu_standalone test calls to pass path_trace=None

After adding the path_trace parameter to run_gcu_standalone(), existing
test call sites need to be updated to supply the new argument so the
function signature matches and --path-trace absence is explicitly asserted.

Signed-off-by: Xincun Li <stli@microsoft.com>

* [config/GCU]: Cleanup imports and fix multi-asic mock in apply_patch tests

config/main.py: remove unused validate_patch import from
generic_config_updater.main (apply_patch no longer calls it directly
after the path_trace refactor).

tests/generic_config_updater/main_test.py: add
mock.patch('sonic_py_common.multi_asic.is_multi_asic', return_value=False)
around the apply_patch_from_file test so it does not attempt real
multi-ASIC detection when running in a unit-test environment.

Signed-off-by: Xincun Li <stli@microsoft.com>

* [tests/config]: Fix validate_patch mock target after import cleanup

After removing the validate_patch re-import from config.main, the
@patch decorator in the path_trace tests must target the canonical
location generic_config_updater.main.validate_patch instead of
config.main.validate_patch.

Signed-off-by: Xincun Li <stli@microsoft.com>

* [config/apply-patch]: Guard standalone GCU redirect against infinite loop

The redirect to gcu-standalone used os.path.exists(GCU_STANDALONE_BIN)
as the sole condition. If the standalone binary itself ever re-enters
this code path (e.g. it also delegates to GCU_STANDALONE_BIN, or is
replaced by a wrapper that calls 'config apply-patch'), the process
would recurse or loop indefinitely.

Fix: set GCU_STANDALONE_ACTIVE=1 in the subprocess environment inside
run_gcu_standalone(), and add 'not os.environ.get(GCU_STANDALONE_ACTIVE)'
to the redirect guard. This ensures at most one level of delegation
regardless of what the standalone binary does.

Signed-off-by: Xincun Li <stli@microsoft.com>

* [config/apply-patch]: Always delegate to _gcu_apply_patch_from_file, pass trace_io directly

Signed-off-by: Xincun Li <stli@microsoft.com>

* [config,generic_config_updater,utilities_common]: Fix stale comment and add sync warnings for DEFAULT_SUPPORTED_FECS_LIST

Signed-off-by: Xincun Li <stli@microsoft.com>

* [tests/config]: Fix GenericUpdater mock target for path_trace tests after GCU refactor

Signed-off-by: Xincun Li <stli@microsoft.com>

* [gcu]: Add explanatory comment at top of setup.py

Clarifies why gcu/ exists as a separate build context for the sonic-gcu
wheel, how gcu-standalone relates to sonic-utilities, and why setup.py
and pytest.ini must stay in gcu/ rather than being moved into
generic_config_updater/.

Signed-off-by: Xincun Li <stli@microsoft.com>

* [GCU/standalone]: Add --path-trace support to gcu-standalone apply-patch

The apply-patch subparser in build_parser() was missing the -t/--path-trace
argument, so any invocation that included --path-trace would be rejected
with 'unrecognised arguments' by the standalone binary.

- Add '-t'/'--path-trace' argument to the apply-patch subparser
- Wire it through apply_patch(args) by opening the file and passing the
  handle as trace_io= to apply_patch_from_file(), closing it in a finally
  block to avoid resource leaks

Signed-off-by: Xincun Li <stli@microsoft.com>

* [GCU]: Fix AttributeError for path_trace in apply_patch and test helper

apply_patch() accessed args.path_trace directly, causing AttributeError
when called from test Namespace objects or older callers that predate the
attribute. Use…
…ANG parsing (#4476)

The two caches in this PR target different layers:

1. _currently_loaded_hash in SonicYangCfg.loadData() — skips re-parsing when the same config (by content hash) is loaded consecutively. This helps when multiple validators call loadData() with identical config within a single move validation.

2. _validate_config_cache in ConfigWrapper.validate_config_db() — caches the validation result for a given config hash, so if the same config state is validated again later, it returns the cached pass/fail without calling loadData() at all.

Per-operation analysis

Operation	Helps?	Why
REMOVE (individual)	❌ No	Each DFS step removes one item → unique config at each step. Neither cache hits because every state is different.
ADD	⚠️ Marginal	Typically 1 move → few loadData calls total. Cache might save 1 call if FullConfigMoveValidator and NoDependencyMoveValidator validate the same state.
REPLACE (scalar)	⚠️ Marginal	Same as ADD — few moves, small absolute savings.
REMOVE (batched via #4478)	✅ Yes	#4478 collapses N individual REMOVEs into 1 bulk REPLACE move. That single move still triggers multiple validator calls with the same config. Cache deduplicates those, reducing loads/move from ~10.6x to ~7.7

---------

Signed-off-by: Rithvick Reddy Munagala <rimunagala@microsoft.com>
(cherry picked from commit 5d54e441ad9a7881ca4182a56e77f45a8e44300b)
Add BulkLeafListMoveGenerator that produces a single REPLACE move for
leaf-list fields whose items differ between current and target configs,
instead of decomposing into N individual REMOVE/ADD moves.

This is registered as a non-extendable generator (tried before individual
moves in DFS). If validation fails, DFS falls through to per-item moves.

Impact: For a 512-port ACL table where half the ports are removed, this
reduces ~256 individual moves (each triggering 2 loadData calls at ~1.4s
each = ~717s) to 1 move (1 loadData = ~1.4s).

Conservative scope:
- Only handles leaf-lists (lists of scalars, not lists of dicts)
- Only replaces lists that exist in both current and target
- Falls through to individual moves if the bulk replace fails validation

Signed-off-by: vaibhavhd <vaibhav.dixit@microsoft.com>
Co-authored-by: rookie-who <rookie-who@users.noreply.github.com>
(cherry picked from commit bfc67f56837f56a7e0fea1d22a6a653836d6e74a)
…upport (#4554)

The gcu-standalone entry point (main()) does not call load_db_config(), causing ConfigDBConnector(namespace='asicN') to fail with:

swsscommon.SonicDBException: validateNamespace: Initialize global DB config
on multi-ASIC platforms (e.g., Nokia 7250 IXR chassis).

Root Cause
When config apply-patch runs, the config Click group callback in config/main.py calls load_db_config() before dispatching — so the DB config is already initialized. However, gcu-standalone invokes generic_config_updater/main.py:main() directly, bypassing that initialization entirely.

Fix
Add load_db_config() as the first call in main(). This matches the established pattern used by every other standalone entry point in sonic-utilities (config/main.py, show/main.py, scripts/port2alias, scripts/portconfig, acl_loader/main.py, etc.).

load_db_config() is idempotent (guarded by isGlobalInit()/isInit() checks) and a no-op on single-ASIC devices where DB is already initialized.

Testing
Verified on Nokia 7250 IXR multi-ASIC chassis (2 ASICs: asic0, asic1)
gcu-standalone apply-patch with asic0-scoped patch proceeds past DB init (previously crashed)
Single-ASIC regression: no behavior change (idempotent no-op)

(cherry picked from commit 5734b264e79c2a1e307b7818aa98c89881ab9d9e)
Upstream PR #4317 (GCU path tracing) is not in this backport scope. Strip the path_trace/trace_io references from apply_patch and remove the 3 corresponding --path-trace test methods. Fixes flake8 F821 undefined name path_trace at config/main.py:1576-1577 introduced by cherry-pick of #4310.
@rimunagala
Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree company="Microsoft"

@rimunagala
Copy link
Copy Markdown
Author

/azp run Azure.sonic-utilities.msft.PR

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

 Commit 21ddca6 removed --path-trace from the outer 'config apply-patch'
 click command but left trace_io plumbing in generic_config_updater/main.py
 (the gcu-standalone entry point) and the corresponding -t/--path-trace
 argparse option. With GenericUpdater.apply_patch (the receiver in
 generic_updater.py) not accepting trace_io, every 'config apply-patch'
 invocation on 202412 fails:

   TypeError: GenericUpdater.apply_patch() got an unexpected keyword
   argument 'trace_io'

 Discovered during on-image validation on KVM SONiC-OS-20241212.58.

 This commit removes:
   - trace_io kwarg from apply_patch_for_scope / apply_patch_from_file
   - trace_io kwarg from the GenericUpdater().apply_patch() call
   - trace_file open/close logic in the apply_patch CLI handler
   - -t/--path-trace argparse option on gcu-standalone apply-patch
   - Stale trace_io docstring entry

 The --path-trace observability feature depends on upstream PR #4317
 which is intentionally out of scope for this 202412 backport (whose
 goal is GCU sidecar container delivery for MRC isolation). A separate
 followup backport can introduce path-trace cleanly when needed.
…ported)

PR #4310's gcu-standalone main.py calls
   updater.list_checkpoints(args.time, args.verbose)
 which assumes the 3-arg signature added by upstream PR
 sonic-net/sonic-utilities#3746 (Enhance list-checkpoints CLI,
 merged 2025-03-05). #3746 is NOT backported to 202412, so the
 202412 receiver still has the original 2-arg signature
 list_checkpoints(self, verbose), causing:

   Error: Failed to list checkpoints: GenericUpdater.list_checkpoints()
   takes 2 positional arguments but 3 were given

 Same class of incomplete-cherry-pick as the trace_io cleanup in
 47e51da — feature plumbing brought in by #4310, but its
 dependency PR was never backported.

 Same approach as 21ddca6 (--path-trace strip): drop the
 unbackported feature surface from gcu-standalone, leave the
 underlying receiver untouched.

 - Removes -t/--time argparse option
 - Simplifies list_checkpoints() output flow (no time-dict branch)
 - Caller now matches 202412 receiver's 2-arg signature

 Diagnostic-only loss: --time was UX sugar for last-modified
 timestamps; operators can still 'stat' checkpoint files
 directly. MRC apply-patch path unaffected.
…up signature

Upstream PR #4478 (BulkLeafListMoveGenerator, master commit bfc67f56) was written
against masters JsonMoveGroup(generator_name, move) signature introduced by
upstream PR #3831 (Generic Configuration Updater performance enhancements,
merged 2025-11-03).

When #3831 was backported to Azure 202412 as PR #254 (merge 3967604, merged
2025-11-09), the generator_name diagnostic parameter was deliberately omitted.
All 21 existing call sites were adapted to the 1-arg form JsonMoveGroup(move).

Our cherry-pick of #4478 (8bedc0a) added a 22nd call site in
BulkLeafListMoveGenerator._traverse but missed the same adaptation, leaving:

    yield JsonMoveGroup(
        self.__class__.__name__,
        JsonMove(diff, OperationType.REPLACE, list(tokens), list(tokens)),
    )

This fails at runtime the moment any leaf-list REPLACE is sorted (e.g. ACL
table ports replace):

    JsonMoveGroup.__init__() takes from 1 to 2 positional arguments
    but 3 were given

Drop the generator_name argument to match #254s convention. No functional
impact: generator_name is unused on 202412 (the field does not exist on the
class) and the leaf-list batching optimization itself is preserved.

Validated on KVM SONiC-OS-20241212.58: leaf-list REPLACE applies cleanly via
a single bulk move.

Signed-off-by: Ramana Munagala <rimunagala@microsoft.com>
@rimunagala
Copy link
Copy Markdown
Author

rimunagala commented Jun 2, 2026

PR #352 — Updated validation evidence (post-fix sweep)

This PR now contains 7 commits: the 4 upstream cherry-picks plus 3 small fixes for incomplete-cherry-pick issues uncovered during validation. Each fix has a clear upstream-PR provenance trail and is independently surgical (1–17 net lines).

Commit stack

SHA Title Net Origin
be57e378 Create GCU wheel (#4310) cherry-pick upstream #4310
015eacbd Cache loadData() (#4476) cherry-pick upstream #4476
8bedc0a9 Batch leaf-list (#4478) cherry-pick upstream #4478
a810a595 SonicDBConfig multi-ASIC (#4554) cherry-pick upstream #4554
21ddca6a remove --path-trace usage -1 file upstream #4317 not in 202412
47e51daf complete --path-trace removal small finishes 21ddca6
320afd98 strip --time from list-checkpoints +4/-16 upstream #3746 not in 202412
abab28a7 adapt #4478 to 202412 JsonMoveGroup signature -1 line upstream #3831 backported as #254 dropping generator_name

Three incomplete-cherry-pick fixes — provenance traced

1. 21ddca6a + 47e51daf — strip --path-trace

  • Upstream dependency: PR sonic-net/sonic-utilities #4317 "GCU: Add path tracing support"
  • Not present in 202412 — path_addressing.trace_io and related receiver methods don't exist
  • Cherry-pick of #4310 created gcu-standalone callers using args.path_trace
  • Fix: caller-side strip; container behavior is unchanged (path tracing was diagnostic only)

2. 320afd98 — strip --time from list-checkpoints

  • Upstream dependency: PR sonic-net/sonic-utilities #3746 "Enhance list-checkpoints CLI" (merged 2025-03-05, +183/-23)
  • Not present in 202412 — GenericUpdater.list_checkpoints(self, verbose) is the original 2020 signature from #1536, unchanged
  • Cherry-pick of #4310 wrote gcu-standalone calling list_checkpoints(args.time, args.verbose) (3-arg)
  • Repro before fix: gcu-standalone list-checkpointsTypeError: list_checkpoints() takes 2 positional arguments but 3 were given
  • Fix: strip the --time flag and time-dict output branch from the gcu-standalone main.py — net +4/-16

3. abab28a7 — adapt #4478 to 202412 JsonMoveGroup signature

Validation matrix (KVM SONiC-OS-20241212.58)

All 21 scenarios run against the final commit (abab28a7):

ID Test Result Time Notes
All 7 gcu-standalone subcommands enumerated apply-patch, replace, rollback, checkpoint, list-checkpoints, delete-checkpoint, save
A1 apply-patch ADD ~1.6s 1 change
A2 apply-patch REMOVE ~1.6s 1 change
A3 apply-patch REPLACE leaf-list 3.30s 1 change — was crashing pre-fix
B mixed multi-op (replace+add+remove) 4.46s sorter deduped no-op add → 2 changes
C1 leaf-list replace cold 3.35s
C2 leaf-list replace warm 3.32s small workload, cache delta minimal
D invalid port name ✅ fail-clean 0.9s libyang validation error, no traceback
E SONICYANG format ✅ path alive 0.9s (rejected on test input shape, not code bug)
F1 config checkpoint 0.5s
F2 config list-checkpoints post --time strip
F3 config rollback <ckpt> 2.98s round-trip, witness removed
F4 config replace <whole-config> 3.00s round-trip, witness removed
F5 config delete-checkpoint
G2 leaf-list grow 4→32 ports 3.34s 1 change — #4478 batching at scale
G3 leaf-list shrink 32→4 ports 3.38s 1 change — #4478 batching at scale
T1 gcu-standalone apply-patch (container entrypoint) 2.21s Same code path as config apply-patch
T2 apply-patch -d (dry-run) on leaf-list REPLACE 1.63s "DryRun: Would apply..." printed; ConfigDB unchanged
T3 Import sanity JsonMoveGroup sig: (self, move: JsonMove = None)
T4 Idempotency (same patch twice) run 2: 0 changes
T5 Verbose mode -v strict-validation logging clean, no generator_name refs

Performance signal

  • Average apply-patch latency: ~1.6–3.4s depending on patch size
  • #4478 batching confirmed working: leaf-list grow 4→32 and shrink 32→4 both produce 1 sorted change instead of N. Without this PR's abab28a7 fix, that code path was unreachable.
  • #4476 cache is active; small workloads on KVM don't surface the per-call delta meaningfully — the actual win is at scale (master claim: ~256 moves × 1.4s/move → 1 move × 1.4s for 512-port ACL replace)

Container scope note

This PR ships the wheel — the Python code that the GCU sidecar container will install. The container image build (Dockerfile, helm/k8s plumbing, sidecar lifecycle) is intentionally a separate follow-up PR. No GCU container image exists on 202412 today; gcu-standalone is currently a wheel-installed script at /usr/local/bin/gcu-standalone. The container PR will use this wheel as its base layer.

What's deliberately not in scope

  • Backporting upstream #3746 (--time enhancement): out of scope for MRC isolation; #320afd98 strips the broken caller-side reference instead
  • Backporting upstream #4317 (path tracing): same rationale; #21ddca6a + #47e51daf strip the broken caller-side reference
  • Restoring generator_name field on JsonMoveGroup: matches the explicit choice already made by [action] [PR:3831] Generic Configuration Updater (GCU) performance enhancements #254 (the Azure backport of #3831)

@rimunagala
Copy link
Copy Markdown
Author

Hardware validation on Mellanox SN5640 (str4-sn5640-7)

Following up on the KVM evidence above (#issuecomment-4604977295) with end-to-end validation on a real Spectrum-4 device using a wheel built directly from this PR's HEAD (abab28a7).

Setup

  • Device: str4-sn5640-7, Mellanox SN5640 (Spectrum-4), HwSKU Mellanox-SN5640-C512S2
  • Image: SONiC 20241212.55, build 17bf2f6bae — clean stock 202412 image
  • Scale: 514 PORTs, 6 ACL_TABLEs (EVERFLOW/V6 at 66 ports, DATAACL at 34), 597 KB ConfigDB
  • Wheel: built from PR HEAD abab28a7 on dev-vm (sha256 3aa755c068f5dd6d7b64e5abb3215e7d62c2d1a83dad39c6d22648cdb95953cd); installed on device, post-install asserts confirm JsonMoveGroup.__init__ is 1-arg form ([action] [PR:3831] Generic Configuration Updater (GCU) performance enhancements #254 lineage preserved) and gcu-standalone resolves.
  • Methodology: stock-baseline matrix → install wheel → repeat identical matrix → measure delta.

Caveat: stock 202412.55 ships with PFC_WD/*/{detection_time,restoration_time} and PFC_WD/GLOBAL/POLL_INTERVAL set to 3400, which violates the YANG 100..3000 range and blocks ANY config apply-patch (even an empty patch). This is not a regression introduced by this PR — it's an image-level data issue. Worked around by clamping the offending fields to 3000 in ConfigDB for the duration of the matrix, then restored to 3400 at the end. Also validated that gcu-standalone apply-patch -i /PFC_WD ... (the --ignore-path flag from #4554's standalone surface) sidesteps it cleanly — useful operator workaround.

Headline speedup matrix (stock 202412.55 → wheel installed)

# Scenario Stock Post-fix Speedup
P1 Scalar add (1 move) 11.7 s 10.3 s 1.14×
P2 Scalar replace (1 move) 13.5 s 10.3 s 1.31×
P3 Scalar remove (1 move) 13.7 s 10.2 s 1.34×
P5 Leaf-list shrink 64→16 ports 4 m 08 s (48 moves) 9.8 s (1 move) 25.3×
P6 Leaf-list grow 16→64 ports 4 m 13 s (48 moves) 10.0 s (1 move) 25.3×
P7 Dry-run shrink 64→16 3 m 15 s 7.4 s 26.4×
P8 Rollback (48-move reverse-apply) 4 m 04 s 7.2 s 33.9×
P11 Mixed multi-op 2 m 59 s (34 changes) 20.1 s (3 changes) 8.9×
P12 Whole-config replace (small delta) 11.4 s 8.2 s 1.39×
Total wall time ~20 min ~93 sec ~13× overall

P5/P6/P8 are the headline. Stock decomposes a leaf-list replace into N×REMOVE + M×ADD individual moves (each round-tripping ConfigDB validation). The BulkLeafListMoveGenerator from #4478 emits a single batched REPLACE move, collapsing 48 sequential round-trips into 1. This is the dominant MRC isolation customer pattern.

Functional + entrypoint + flag-removal coverage

Test Result
T1: gcu-standalone apply-patch end-to-end ✅ 8.7 s
T2a: gcu-standalone create-checkpoint (positional name) ✅ 0.57 s
T2b: gcu-standalone list-checkpoints
T2c: gcu-standalone rollback after mutation ✅ 7.27 s, ports restored
T2d: gcu-standalone replace whole-config ✅ 8.0 s
T2e: gcu-standalone delete-checkpoint
T3: idempotency (same patch twice) ✅ run 2 = 0 changes / 6.9 s
T4: invalid port Ethernet9999 ✅ clean libyang error, no crash
T5: replace on non-existent table ✅ clean error path, no crash
T6: verbose -v shows full Patch Sorter logs
T7: --ignore-path /PFC_WD accepted on gcu-standalone apply-patch
T8a: config list-checkpoints --time rejected (320afd9) ✅ "no such option"
T8b: config apply-patch --path-trace rejected (21ddca6/47e51da) ✅ "no such option"

Why this matters for the backport

  • All 8 commits in this PR (4 upstream + 3 incomplete-cherry-pick fixes + the JsonMoveGroup signature adapter for the in-tree [action] [PR:3831] Generic Configuration Updater (GCU) performance enhancements #254 lineage) are exercised on real hardware.
  • No regression vs. stock for any scenario: scalar ops are slightly faster (#4476 YANG cache amortization), bulk leaf-list ops are 25–34× faster (#4478), removed flags stay removed (320afd9 / 21ddca6 / 47e51da), negative paths still error cleanly through libyang.
  • Both entrypoints are wired: config apply-patch (existing) and gcu-standalone <subcommand> (new from #4310 + #4554) reach the same patch_sorter / change_applier plumbing.

Device restored to stock state (PFC_WD timers re-set to 3400, no leftover checkpoints, no TEST_ACL).

@rimunagala rimunagala changed the title Backport GCU enhancements to 202412 for per-branch GCU sidecar container [202412] Backport GCU container framework (#4310) + perf enhancements (#4476, #4478, #4554) Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants