Skip to content

Development and Testing

Julius Bairaktaris edited this page Jun 21, 2026 · 4 revisions

Development and Testing

Build specifics, the no-serial-console test methodology this project was developed with, and the landmines that cost real debugging time. Future developers: read the landmines before touching anything.

Building

The easiest path is the Qualcommax_NSS_Builder: its edma-nss variant already wires the nss-packages feed in by default, so there is nothing to add by hand — just run the build.

To build by hand instead, add the feed to feeds.conf:

# feeds.conf
src-git nss https://github.com/JuliusBairaktaris/nss-packages.git;edma-nss
# local checkout instead of a remote clone:
#   src-link nss /path/to/nss-packages      # branch edma-nss

Minimum NSS config on top of a normal ipq807x build:

CONFIG_PACKAGE_kmod-qca-nss-drv=y
CONFIG_PACKAGE_kmod-qca-nss-ecm=y
CONFIG_PACKAGE_kmod-qca-nss-drv-pppoe=y     # PPPoE offload
CONFIG_PACKAGE_kmod-qca-nss-drv-qdisc=y     # NSS qdiscs for SQM
CONFIG_PACKAGE_kmod-qca-nss-drv-igs=y
CONFIG_PACKAGE_sqm-scripts-nss=y
CONFIG_NSS_FIRMWARE_VERSION_12_5=y
CONFIG_NSS_MEM_PROFILE_MEDIUM=y             # 512 MB boards!

nss-firmware follows kmod-qca-nss-drv automatically. The vendor code is not warning-clean: OpenWrt main sets CONFIG_KERNEL_WERROR, which leaks -Werror into external modules, so the NSS packages build with -Wno-error (documented in each Makefile).

Build landmines

  • sk_buff layout changes force a full kmod rebuild. Kernel patch 0969 adds a bitfield to struct sk_buff. If that patch (or anything else touching skbuff layout) changes, wipe the target build_dir and staging_dir — incremental builds will happily link kmods against stale layouts and the result misbehaves at runtime.
  • Hand-written patches are forbidden. Generate patches with git/quilt and verify them by application (make package/X/prepare / quilt push). A hand-typed hunk with a malformed offset cost a build day.
  • WSL2 only: building with the Windows PATH leaked into the environment breaks find -execdir inside the kernel build. Sanitize PATH for every build (export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin). Build on a real ext4 filesystem, never on /mnt/c.
  • DISTRIB_REVISION is the git HEAD at build time. Building with uncommitted changes produces an image that reports the old revision. When in doubt, identify an image by symbols (/proc/kallsyms markers), not by the revision string.

Test methodology (no serial console)

The entire bring-up was done sysupgrade-only: every image must boot with working EDMA networking by itself, and NSS is strictly runtime opt-in. The techniques that made that survivable:

  • pstore/ramoops is the crash console. The DT reserves a ramoops region (with console-size, so the last kernel console survives a hard crash); the pstore-archive init script copies /sys/fs/pstore to flash on every boot. After any suspicious reboot, read the archive first.
  • Dead-man's switch for risky experiments. Run experiments detached, with logs synced to flash continuously, ending in reboot -f unless a barrier file is present: (experiment; sleep 1500; [ -f /tmp/keep ] || reboot -f) &reboot -f works even with networking gone, which turns "drive out and pull the plug" into "wait 25 minutes".
  • ssh liveness is not a hang detector. A firmware boot with unarmed ports kills all wired RX while the SoC is healthy. Only synced logs + pstore distinguish "network dead" from "SoC dead".
  • Reachability without ICMP. Client firewalls often drop echo: test reachability via ARP (ip neigh flush dev X; ping -c1 <ip>; ip neigh show <ip> → REACHABLE/STALE state), not ping success.
  • tcpdump on pppoe-wan silently matches nothing (unsupported linktype for BPF filters). Capture on the physical port or the ifb instead.
  • Flash quirk: sysupgrade's final reboot can hang after remoteproc: stopped q6v5_wcss. The flash itself succeeded — a power cycle boots the new image. Do not misread this as a failed upgrade.
  • /root is not preserved by sysupgrade unless listed in /etc/sysupgrade.conf. Keep bring-up scripts there listed, or re-push them after every flash.

Useful debug surfaces

Surface What it tells you
/sys/kernel/debug/qca-ppe-nss/status per-port attach state, tx_redirect_pkts, rx_fw_pkts, tx_busy, rx_unexpected
/sys/kernel/debug/qca-nss-drv/stats/n2h firmware↔host queue counters (n2h_rx_pkts climbing = fw path alive)
/sys/kernel/debug/ecm/ per-flow acceleration state, defunct_all, front-end stats
tc -s qdisc show dev <wan> / ifb nsstbl overlimits prove the fw shaper has authority
/sys/fs/pstore, archived by init script crash console of the previous boot
dmesg at drv load firmware version banner (e.g. NSS.FW.12.5-210-HK.R)

Verifying offload is actually working

To confirm that forwarded traffic is genuinely riding the firmware (not just that counters exist), use a delta under a confirmed-traversing load, not a static snapshot:

  1. Pin a flow through the router. On a split-routing test host only some paths traverse the device — drive a sustained download over a path you know hits the WAN (IPv6, or a DNAT/explicit route) and confirm the WAN counters move with it.
  2. Sample a window straddling the load. Read counters, hold ~25 s under load, read again. The offload signature:
    • pppoe_rx_bytes (or the WAN port rx_fw_pkts) climbs by the transferred volume — the bytes entered via the firmware;
    • n2h_n2h_data_byts (firmware→host delivery) climbs by a tiny fraction of that (host sees <0.1 % of the bytes) — the rest was forwarded inside the firmware;
    • per-port glue tx_redirect_pkts/rx_fw_pkts stay nearly flat for the accelerated flow (it bypasses the host redirect path entirely);
    • host CPU stays ~idle: /proc/stat idle delta ≈ 100 % of the window × nproc, softirq delta near zero, ksoftirqd at 0 %. ecm_nss_ipv{4,6}/accelerated_count should be non-zero with pending_accel/pending_decel at 0.
  3. Read the exception stats as by-design, not as leaks. The big numbers in ecm/stats/ecm_v{4,6}_exception_stats (local_packets_ignored, bcast/mcast_feature_disabled, not_ip_pppoe_packet, *_tcp_not_estab/not_confirm, fragments) are traffic a flow engine cannot accelerate (router-local, broadcast/multicast, ARP/ND/L2, flow-setup first packets). They are the expected residual, not a fault.
  4. Rule out a competing native engine. Low host CPU only proves NSS offload if nothing else is doing the forwarding: confirm there is no nft flowtable (nft list ruleset | grep -i flowtable), firewall.@defaults[0].flow_offloading{,_hw} is unset, and ethtool -k <conduit> shows hw-tc-offload: off. Otherwise a host fastpath, not NSS, may be moving the bytes.

Gate suite (acceptance tests)

The integration was accepted through staged hardware gates; re-run the relevant ones after any significant change:

  1. Firmware boot gate — fw boots with armed ports, n2h_rx_pkts climbing, two consecutive clean boots, pstore clean.
  2. Data plane gate — 100 attach/detach cycles per port under traffic with zero stuck cycles; rmmod with attached ports; no duplicate-delivery (host rings must stay silent: zero duplicate ICMP sequence numbers).
  3. All-ports gate — every physical port through the fw path simultaneously, including a PPPoE(+VLAN) uplink, with per-port RX verified by cable hops; 15-minute soak; memory flat.
  4. ECM gate — accelerated bulk flow at line rate with CPU ~idle; ECM stop returns flows to software path; Wi-Fi clients unaffected.
  5. SQM gate — shaper authority at a tight rate; RTT-under-load at the production rate; sqm stop clean; reboot-with-sqm-enabled boots safely (guard refuses, wired RX alive).
  6. Soak — multi-hour production traffic, then: counters sane (tx_busy=0, rx_unexpected=0), memory flat, pstore empty.

Clone this wiki locally