Production hardening: 18 bugs from review + stress test by BorisYamp · Pull Request #1 · BorisYamp/panicmode

BorisYamp · 2026-04-28T05:34:55Z

Summary

Addresses 28 bugs across four rounds — initial code review, basic stress tests on a real Contabo VPS, adversarial / white-spot tests, and a final clean regression pass. All fixes are deployed, tested, and verified on a live VPS that's been actively defending against real internet attackers throughout (26+ real botnet IPs blocked over the test window).

Zero regressions in the final round. Ready to ship.

Round 1 — code review + basic stress (18 bugs)

🔴 Critical

Production hardening: 18 bugs from review + stress test #1 ctl socket umask race — local privilege-escalation window
#2 sshd unprotected if user mass_freeze.yaml omits it
#13 UFW-based block scripts fail silently under systemd hardening
#14 systemd unit missing AF_NETLINK and writable /run
#16 freeze SIGSTOPs own tokio runtime threads
#17 freeze SIGSTOPs kernel threads (kcompactd, kworker)

🟠 High

#3 Telegram messages > 4096 UTF-16 silently dropped
#4 restore_blocked_ips silent failure
#7 ctl socket EADDRINUSE on restart (no RuntimeDirectory)
#8 self-check task spammed ~12 alerts/min on idle
#9 IncidentState::save() race on shared .tmp file
#11 disk_cache_ttl=60s blocked disk_usage for full minute
#12 auth_failures details didn't include IPs — block_ip had nothing to extract
#15 sysinfo first refresh returns 0% CPU
#18 block_ip.sh not idempotent — duplicate iptables rules on restart

🟡 Medium

#5 script env vars uncapped + missing security note
#6 path comment typos repo-wide; orphan src/mass_freez.yaml
#10 "missing actions [AlertCritical]" cosmetic warning at every startup

Round 2 — adversarial / behavioral testing (6 bugs)

🔴 Critical

#19 Log-injection vulnerability. Any local non-root user could write to /var/log/auth.log via logger -p auth.warn -t sshd[X] "Failed password ... from <IP>" and trick PanicMode into iptables-banning that IP. Verified exploit on real VPS: a UID 1000 attacker baited a ban of 192.0.2.123. Fix: switched auth_monitor to journalctl -u ssh.service --since=Ns ago — the kernel's cgroup-attributed _SYSTEMD_UNIT field is unforgeable, so logger from another user is invisible. Confirmed post-fix: 30 forged entries from non-root → no incident, no ban; real Failed password from sshd still triggers detection.
#20 DiskIoMonitor::clone produced fresh state. Mutex<HashMap<...>> was cloned by value on every spawn_blocking tick — each clone updated its own state and was dropped, the original never moved. Result: every tick saw a "first sample" baseline, util always 0% even under heavy load. Fix: Arc<Mutex<>>. Verified under sustained fio: util reads now match /proc/diskstats math (~70%).

🟠 High

#21 NetworkMonitor — same Clone-shape bug. connection_rate permanently stuck near 0. Same Arc<Mutex<>> fix.
#22 AuthMonitor — same shape, plus rotation surface. last_position per-clone, so every tick re-read entire auth.log from byte 0. Obsoleted by the journald rewrite for #19.
#23 file_monitor never started. start_file_monitoring() defined but no caller; notify never registered any inotify watches. Every file_monitor rule returned event_count = 0. Fix: at startup walk config.monitors, pick MonitorType::FileMonitor, pass paths to start_file_monitoring.
#24 file_watcher path-match was exact. notify keys events by FILE path, operators configure parent DIRECTORY — exact-match HashMap lookup never hit. Fix: containment match — sum events for any stored path that equals the configured path or is under it.

Round 3 — deeper white-spot testing (4 bugs)

🟡 Cosmetic

#25 panicmode-ctl pipe panic. CLI panicked with "failed printing to stdout: Broken pipe (os error 32)" when piped to head/grep (Rust's default stdio treats EPIPE as panic). Fix: restore default SIGPIPE handler in main; CLI now exits silently with 141 like every other Unix tool.

🟠 High (UX / correctness)

#26 Five action variants documented but not implemented. mass_freeze, mass_freeze_top, mass_freeze_cluster:<name>, kill_process, rate_limit are accepted by the YAML parser, categorized as "protective" by the detector, but never registered in ActionExecutor — they silently no-op. Fix: builder now reports them as "NOT YET IMPLEMENTED — will silently no-op until shipped" with explicit guidance to remove or substitute, separate from the genuine "missing actions" warning that catches typos.
#27 Discord webhook_url two sources of truth. Validation requires channel.webhook_url; is_integration_enabled for Discord only checked integrations.discord.enabled. A config with just channel-level URL passed validation and then silently dropped every Discord alert. Fix: is_integration_enabled accepts either source; send_discord prefers channel-level URL and falls back to integration-level.
#28 Empty SMTP creds triggered AUTH attempt. smtp_username: "" deserializes to Some("") not None, so lettre attempted PLAIN/LOGIN with empty user — server rejected with "No compatible authentication mechanism was found". Fix: filter empty strings before passing to credentials, so a blank field behaves like an absent field (correct for dev relays and many internal SMTP servers).

Round 4 — clean regression pass (0 new bugs)

A final adversarial sweep through everything we touched:

Run	Coverage	Result
1 Smoke + startup hygiene	ctl list/help/validate, service active, ctl pipe behaviour, no spurious "missing actions" on startup	✅ 6/6 PASS
2 Alert channels	Telegram, ntfy poll, Discord (channel-only path verifying #27), Email (no-auth verifying #28), Email (auth path)	✅ 5/5 PASS
3 Detection	file_monitor (30 events from 10 mods), run_script (6 PANIC_* env vars correct); CPU/memory/SSH-brute/disk_io/JSON-metric covered in earlier rounds	✅ all confirmed
4 Resilience + #26 wording	SIGKILL → auto-restart in ~6s, broken YAML → systemd-marks-failed cleanly, zero `.tmp` rename races over 30 min, `NOT YET IMPLEMENTED` warning fires for `mass_freeze`/`mass_freeze_cluster:website`/`kill_process`	✅ 4/4 PASS

What changed (high level)

Reliability

Three "delta" monitors (DiskIo, Network, Auth) share state across spawn_blocking clones via Arc<Mutex<>>
IncidentState::save() writes to a unique tmp filename — no concurrent-write race
restore_blocked_ips pre-flights block_script existence and ERROR-logs failures with full IP list
Disk metric cache TTL 60s → 5s default, config-driven
Self-check per-condition cooldown (default 5min, configurable); FD/thread thresholds raised to realistic values for tokio/reqwest steady state
Telegram truncation counts UTF-16 code units (matches Telegram's actual limit)
block_ip.sh is idempotent (iptables -C before -I)

Security

ctl socket: umask(0o077) around bind() closes the race
HARDCODED_PROTECTED merged on top of user mass_freeze whitelist — sshd/systemd/init/kthreadd/dbus/getty/panicmode unfreezable by config
Freeze action skips own tokio threads (Tgid via /proc/pid/status) and kernel threads (PPID==2 via /proc/pid/stat)
Script env vars truncated to 8 KB; long doc comment on never-eval-untrusted-env
auth_monitor reads kernel-attributed journald (_SYSTEMD_UNIT=ssh.service) — closes log-injection vector

Operations

panicmode --validate <path> to verify config before systemctl restart
panicmode --help with usage examples
New tunables: self_fd_threshold, self_thread_threshold, self_alert_cooldown, disk_cache_ttl
systemd unit gains RuntimeDirectory=panicmode, AF_NETLINK, /run in ReadWritePaths
examples/{block_ip,unblock_ip}.sh: iptables-based, idempotent, documented, work under hardening
panicmode-ctl no longer panics on SIGPIPE — pipes to head/grep/awk work normally

UX / configuration

Discord channel works with channel-level OR integration-level webhook_url
Email with empty SMTP creds skips auth instead of attempting empty-PLAIN
Builder distinguishes "NOT YET IMPLEMENTED" actions from genuine missing-action typos at startup
Cosmetic "missing actions [AlertCritical]" startup spam silenced

Known limitations (intentional, deferred)

SIGHUP hot-reload — systemctl restart panicmode (~1-2s gap) is the apply-config workflow. True zero-downtime via arc-swap is doable as a follow-up.
5 action variants documented but unimplemented: mass_freeze, mass_freeze_top, mass_freeze_cluster, kill_process, rate_limit. Parser accepts them, builder warns clearly, executor no-ops. Implementations welcome in follow-up PRs.
Many Connections monitor name vs metric semantics: name suggests absolute count, actual metric is rate (new conns / sec). At threshold 1000 it only triggers on DDoS-level bursts. Naming/docs improvement deserved but unblocking nothing.
NVMe disk_io util naturally low: %util from /proc/diskstats is queue-time-based; on NVMe with high parallelism even saturated workloads can sit at ~5%. Operators on NVMe hosts should drop the threshold (e.g. 30%) or treat IOPS/bandwidth as proxies in a future feature.

Test plan (full)

Tested on a fresh Contabo Cloud VPS 10 (4 vCPU, 8 GB, Ubuntu 24.04, ed25519-only SSH, fail2ban + UFW). Each round above lists what was exercised.

cargo test --release — 124 unit tests pass (one flaky-on-shared-FS script test setup unrelated to this PR).

Notes

15 commits on the branch, each tied to a small set of bug numbers and self-contained. Squashing is fine if that fits the review style better — the per-bug history is there for cherry-picking, not because it has to ship in this shape.

🤖 Generated with Claude Code

Repo had a consistent typo where doc comments at top of source files referenced "PanicMode/scr/..." instead of "PanicMode/src/...", and example YAMLs claimed "PanicMode/exampels/..." instead of "examples/". Also drops src/mass_freez.yaml — an orphan duplicate of examples/mass_freeze.yaml that no code path read (the typoed name made it invisible to MassFreezeConfig::load_from_path_or_default). Cosmetic only — no behavior change. Refs: review bug #6

UnixListener::bind() creates the socket file using the process umask (typically 0o022 -> world-readable). set_permissions(0o600) only ran afterwards, leaving a race window where a local non-root user could connect to the ctl socket and issue commands like 'panicmode-ctl unblock <IP>'. Force umask=0o077 around bind() so the socket is created 0o600 atomically. The explicit set_permissions afterwards stays as defense-in-depth in case some kernel/filesystem ignores umask for AF_UNIX sockets. Also tighten the parent directory (typically /run/panicmode) to 0o700 so non-root users cannot traverse to the socket path even if perms on the socket itself are loose for any reason. Refs: review bug #1

Telegram's sendMessage API rejects any text longer than 4096 UTF-16 code units with HTTP 400. PanicMode previously sent the formatted incident text as-is, so a long incident dump (snapshot output, auth-log excerpts, large details fields) was silently dropped — the operator would never see the alert at all, despite the most critical incidents being the ones most likely to produce long text. Add truncate_for_telegram() which: - counts UTF-16 code units (not bytes/chars), matching how Telegram enforces the limit (matters for emoji and supplementary-plane chars) - short-circuits on the fast path (returns Cow::Borrowed) when text is already within budget - truncates at a char boundary and appends a marker so the operator sees that the message was cut Other channels (Discord, ntfy, email, Twilio) have different limits and are unaffected — only send_telegram() applies the truncation. Refs: review bug #3

IncidentState::save() previously used a single "<state_file>.tmp" path for the atomic write+rename pattern. When several incidents fired within the same millisecond (common during a CPU spike that triggers Critical CPU + CPU Spike + High CPU at once), two save() calls would race: Task A: write .tmp, rename .tmp -> .json [success] Task B: write .tmp [success] Task A: rename .tmp -> .json [.tmp gone, ENOENT] The error showed up under stress as: ERROR: Failed to rename incident_state.json.tmp -> incident_state.json Use a per-call unique suffix (pid + nanos) so each save writes to its own .tmp, and rename() last-writer-wins on the destination. Each save still ends with a self-consistent snapshot. Refs: deployment bug #9

The block_ip action extracts target IPs by parsing the incident's details string. The auth_failures format previously read: "Auth failures: 18, from 13 IP(s), successful logins: 52" — no concrete IP appeared anywhere in the text, so block_ip skipped with "no public IPs in incident details". Brute-force detection fired but no IP was actually blocked. Build the details from the deduplicated-by-IP failure map (the underlying failures_by_ip is keyed by user@ip, so the same source IP with N usernames produces N entries — "from N IP(s)" was also misleading; now reports "from N unique IP(s)"). New format: "Auth failures: 91, from 2 unique IP(s), successful logins: 66, top: [198.51.100.1(35), 161.132.4.167(6)]" The top-5 IPs (by count, descending) appear in the string, and extract_public_ips() picks them up cleanly. Verified end-to-end on Contabo VPS with simulated brute force. Refs: deployment bug #12

…note Two issues addressed: 1. Some incident metadata (snapshot dumps, large auth-log excerpts) can exceed the kernel's argv+envp limit (MAX_ARG_STRLEN, ~128 KB on most kernels). Without a cap, an oversized PANIC_DETAILS made exec() fail with E2BIG, dropping the whole script invocation silently. 2. Documentation gap: PANIC_DESCRIPTION / PANIC_DETAILS may contain attacker-influenced text (e.g. usernames extracted from /var/log/auth.log). Command::env() does NOT pass values through a shell, so they are NOT command-injection vulnerabilities at the panicmode boundary. But if a user's script does something like `eval "\"`, the user reintroduces the injection. The risk wasn't documented anywhere. Cap each env var at 8 KB (MAX_ENV_BYTES) with a char-boundary-safe truncation, and add a prominent comment on the constant explaining the eval-is-unsafe expectation for user scripts. Refs: review bug #5

The self-check task and the disk-metrics cache had their thresholds hardcoded as compile-time constants. Tuning them required a rebuild — painful in production where operators want to ssh in, edit YAML, 'systemctl restart panicmode' and move on. Move into PerformanceConfig: self_fd_threshold (default 1000) — was hardcoded 100 self_thread_threshold (default 200) — was hardcoded 20 self_alert_cooldown (default 5 min) — was implicit (no cooldown) disk_cache_ttl (default 5 s) — was hardcoded 60 s Notes on the new defaults: - The old FD threshold (100) and thread threshold (20) produced immediate false-positive 'leak' alerts on a fresh start: tokio's multi-threaded runtime alone runs ~num_cpus*2 worker threads, and reqwest+tokio+tracing easily holds 100+ open FDs at steady state. - self_alert_cooldown is consumed in run_self_check_task to prevent the same condition from re-alerting every check_interval. - disk_cache_ttl: 60s was wildly too long for a monitoring tool — a runaway log file could fill the disk while the cached '7%' reading was still being served. All fields use serde defaults — existing configs continue to load without edits. Refs: deployment bugs #8 (self-check), #11 (disk cache)

MonitorEngine had a hardcoded 60-second cache on disk metrics. Calling sysinfo::Disks::new_with_refreshed_list() is mildly expensive (parses /proc/mounts, stats each mount), so caching makes sense — but a full minute is wildly too long for a monitoring tool. A runaway log file or an attacker filling the disk could push usage from 50% to 95% within seconds, but PanicMode would keep serving the stale 7% reading from cache for up to 60 s before the Disk Almost Full incident could fire. Verified: filled the test VPS to 81%, no incident fired in 15 s — only after the cache expired. Read disk_cache_ttl from config.performance (introduced in the previous commit) instead of the hardcoded constant. Default is 5 s which matches the standard check_interval, so a real disk spike is caught the next tick. Refs: deployment bug #11

…p own/kernel threads Four issues in the process freeze action, all surfaced by stress testing: 1) sshd / systemd / init protection was config-only (#2 review). If a user's mass_freeze.yaml omitted sshd from the whitelist, the freeze action would happily SIGSTOP it under load — locking the operator out of the box mid-incident. Add HARDCODED_PROTECTED (sshd, systemd, init, kthreadd, dbus, getty, panicmode) and merge it on top of the user list in ProcessAction::new(). Even a hostile YAML can't reduce the safety floor. 2) sysinfo::Process::cpu_usage() returned 0 on first read (#15). sysinfo needs two refreshes spaced by an interval to compute a delta. The action did a single refresh, so every process showed 0% CPU and the threshold filter (>= 1.0%) skipped everything. Verified: "No processes to freeze (none above threshold)" on a stress-ng-saturated box. Fix: refresh_processes(), sleep 200 ms, refresh_processes() again, then read. 3) Tokio runtime threads of panicmode itself got SIGSTOP'd (#16). On Linux, sysinfo enumerates threads as separate Process entries (each TID has its own /proc/<tid> dir). The pid == own_pid check only matched the main thread, so worker threads were eligible for freeze. We watched a tokio-rt-worker get SIGSTOP'd during a stress test. Fix: read /proc/<pid>/status, skip if Tgid == own_pid (i.e. it's a thread within our own thread group). 4) Kernel threads got SIGSTOP'd (#17). Names like kcompactd0, kworker/u8:1, ksoftirqd/0 are kernel threads — SIGSTOP can wedge the kernel. The substring whitelist for "kthreadd" only matched PID 2 itself, not its many kernel- thread children. Fix: read /proc/<pid>/stat, skip if PPID == 2 (parent is kthreadd). Tried Process::cmd().is_empty() first, but that returned empty for normal userspace processes too on this sysinfo version, breaking the whole freeze action. Refs: review bug #2; deployment bugs #15, #16, #17

…idate flag Three changes in main.rs: 1) Self-check task spammed alerts every check_interval (#8). With cpu_limit=5%, memory_limit_mb=50, FD threshold 100, thread threshold 20 — and zero cooldown — a fresh deploy produced 10+ Telegram alerts per minute on idle: 'PanicMode CPU high: 50%', 'FD leak: 175', etc. The thresholds were unrealistic for a modern Rust async app, and the lack of dedup made any persistent condition cascade. Each condition now tracks its own last-alert Instant in a HashMap, and only fires if cooldown has elapsed. Thresholds and cooldown all come from config.performance (added in earlier commit) so operators can tune without rebuilding. 2) restore_blocked_ips silent-failure (#4 review). Per-IP failures used warn! and the function returned Ok with no aggregate error log. Operators saw 'restored 5 blocks' incident DB rows but the firewall was empty — attackers were back in. Pre-flight: error+return if block_script doesn't exist (catches the common 'I renamed the script and forgot to update config' case). Aggregate failures into an error! log at the end with the full list of failed IPs and reasons. 3) Operator UX: --validate / --check / --help. The standard 'edit yaml -> systemctl restart' workflow needed a way to verify a config before bouncing the daemon. Adding 'panicmode --validate /etc/panicmode/config.yaml' parses and validates, exits 0 on OK / 1 on error. --help prints usage. Refs: review bug #4; deployment bug #8; ux

…rewall scripts Three changes to the unit, all surfaced by stress testing on the Contabo VPS deploy: 1) RuntimeDirectory=panicmode (#7). ctl_socket lives at /run/panicmode/ctl.sock by default. With ReadOnlyPaths=/ and only /var/lib/panicmode etc. in ReadWritePaths, panicmode could not create the parent dir or write the socket file. After a crash, the stale socket from the previous instance couldn't be removed either, so restarts hit EADDRINUSE in a loop. RuntimeDirectory= asks systemd to create /run/panicmode at start with the configured mode and remove it on stop — clean lifecycle. 2) AF_NETLINK in RestrictAddressFamilies (#14). iptables/ip6tables (used by the standard block_ip / unblock_ip scripts) talk to kernel netfilter over an AF_NETLINK socket. Without it the scripts fail with cryptic errors like "Operation not permitted" on rule insertion, and block_ip is silently a no-op — brute-force detection fires but no IP is actually banned. 3) /run added to ReadWritePaths (#14). UFW takes /run/ufw.lock, iptables takes /run/xtables.lock — neither of which fit under /run/panicmode (and we don't want to bind-mount each lock). Granting /run keeps the rest of the hardening intact (still no /etc, no /var beyond panicmode's own dirs). Refs: deployment bugs #7, #14

…hardening The repo previously shipped no example block_ip / unblock_ip scripts even though config.firewall references them by path. Operators wrote their own — typically the obvious "ufw deny from $IP", which silently fails under panicmode's systemd hardening (UFW takes /etc/ufw/user.rules and /run/ufw.lock; both are unwritable under ReadOnlyPaths=/). Add reference scripts that: - use iptables/ip6tables directly (no /etc state, no UFW lock file) - pick the right tool by address family (":" → ip6tables) - are idempotent: `-C` (check) before `-I` (insert) on block_ip prevents restore_blocked_ips from compounding duplicate rules on every restart; `-D || true` on unblock makes the script safe to call on an IP that was never blocked - are documented inline so the operator knows WHY iptables and not UFW, and what hardening flags need to be relaxed for it to work Persistence across reboots is the daemon's responsibility — the SQLite blocked_ips table + restore_blocked_ips on startup re-runs block_ip.sh for every stored IP. Verified end-to-end on Contabo VPS (reboot, all 3 stored blocks restored, no duplicates). Refs: deployment bugs #13, #18

Three monitors maintained per-tick state (last sample, file cursor) but were cloned by MonitorEngine on every spawn_blocking — each clone updated its OWN copy and was dropped, while the original inside MonitorEngine never moved. The visible effect for an operator: disk_io always reported 0% util, connection_rate always 0, and auth_monitor re-read the entire auth.log every tick instead of only the new tail. All three are fixed by wrapping the state in Arc<Mutex<>> so clones share one cell. For auth_monitor we go further. Reading /var/log/auth.log line-by-line let any local non-root user write fake entries via `logger -p auth.warn -t sshd` and bait PanicMode into iptables-banning arbitrary public IPs — verified on a real VPS with a UID 1000 attacker and 16 forged failures naming 192.0.2.123. Switching to `journalctl -u ssh.service` makes the kernel's cgroup-attributed _SYSTEMD_UNIT field the gate: messages from a random user's logger never carry that attribution and silently drop out, while genuine sshd events flow through unchanged. Re-verified post- fix: same 30 forged entries from a non-root user, no incident, no ban; real Failed password from sshd still triggers SSH Brute Force as before. The journald rewrite also obsoletes the file-cursor (last_position) state, so the auth piece of the Arc<Mutex<>> family-fix is moot — but the disk_io and network instances of the same pattern still need it. Refs: deployment bugs #19 (CRITICAL log injection), #20 (disk_io util), #21 (network rate), #22 (auth incremental — obsoleted by journald)

…s under watched directory The file_monitor monitor type was non-functional in two ways. First, MonitorEngine::start_file_monitoring(paths) existed but was never called from main, so notify never registered any inotify watches — the rule could read its threshold but get_file_event_count always saw an empty HashMap. Second, even with the watcher started, get_event_count looked up exact-match keys on the path; notify keyed events under the FILE that changed (/etc/nginx/nginx.conf), while operators configure the parent directory (/etc/nginx/), so the lookup never matched. Fix both ends: * main: after MonitorEngine is constructed, walk config.monitors, pick rules of MonitorType::FileMonitor, and pass their paths to start_file_monitoring. Log a warning rather than aborting if the watcher cannot register a path (path is missing or unreadable under systemd hardening) so unrelated rules still fire. * file_watcher::get_event_count: switch to a containment match — sum events for any stored event_path that is the configured path OR sits under it as a subdirectory. Keeps the file path as the storage key (preserves per-file event detail for future queries). Verified end-to-end on the test VPS: a file_monitor rule watching /etc/panicmode/watched fires Watched Files at current_value=30 after ten file modifications (each modification produces a couple of inotify events) — was always 0 before. Refs: deployment bugs #23 (start_file_monitoring never called), #24 (directory-vs-file path mismatch in get_event_count)

…SIGPIPE in CLI Two small operational papercuts: #10 — On every startup the action-executor builder logged a warning per monitor saying "missing actions [AlertCritical] (degraded, 1 of 2 available)". Alerts were actually firing fine — AlertCritical / AlertWarning / AlertInfo route through AlertDispatcher (the alert_tx channel), not through ActionExecutor. The validator iterates the ActionExecutor registry and naively flagged anything it didn't own. Filter the three Alert* variants out of the missing-actions check so genuine misconfiguration is still flagged, but the noise stops. #25 — `panicmode-ctl list | head -5` panicked with "failed printing to stdout: Broken pipe (os error 32)" because Rust's default stdio treats EPIPE as a panic. Restore the kernel's default SIGPIPE handler in main(), which terminates the process silently with exit 141 — the expected Unix CLI behavior. Also makes piping to grep/awk/jq feel ordinary in shell scripts. Trivial cosmetic fixes; included so a fresh `systemctl restart panicmode` startup is clean and panicmode-ctl behaves like a regular Unix utility. Refs: deployment bugs #10, #25

…spot tests Three fixes from a deeper round of channel testing on the live VPS: #26 — Several action variants are documented in examples/config.yaml, accepted by the parser, and routed by the detector as "protective", but never registered in ActionExecutor (mass_freeze, mass_freeze_top, mass_freeze_cluster:<name>, kill_process, rate_limit). An operator copying from the example would see a generic "missing actions" warning and reasonably assume they typed a name wrong, when in fact the feature itself isn't shipped yet. Builder now reports those as "NOT YET IMPLEMENTED — will silently no-op" with explicit guidance to remove them or substitute freeze_top_process / run_script. Genuine "missing" warnings are still emitted separately for typo'd action names so legitimate misconfigurations stay visible. #27 — Discord can be configured at two places: integrations.discord (with an .enabled flag) and channel.webhook_url at the alerts list. Validation requires channel.webhook_url, but is_integration_enabled ONLY checked integrations.discord.enabled — so a config with just channel.webhook_url passed validation and then silently dropped every Discord alert. Two changes: is_integration_enabled now accepts EITHER source; send_discord prefers channel.webhook_url and falls back to integrations.discord.webhook_url. Verified end-to-end against a mock localhost webhook receiver; the Discord-shaped payload (\"content\": text JSON) arrives correctly. #28 — EmailConfig has smtp_username/smtp_password as Option<String>, but YAML \"smtp_username: \\\"\\\"\" deserializes as Some(\"\") not None. lettre then attempts PLAIN/LOGIN with an empty user and fails with \"No compatible authentication mechanism was found\". A user who legitimately wants unauthenticated send (dev relay, internal SMTP that allows local clients without auth) had to omit the fields entirely — easy to miss when copying the example. Filter empty strings so an empty value behaves the same as absent. Refs: deployment bugs #26, #27, #28

UnixListener::bind() creates the socket file using the process umask (typically 0o022 -> world-readable). set_permissions(0o600) only ran afterwards, leaving a race window where a local non-root user could connect to the ctl socket and issue commands like 'panicmode-ctl unblock <IP>'. Force umask=0o077 around bind() so the socket is created 0o600 atomically. The explicit set_permissions afterwards stays as defense-in-depth in case some kernel/filesystem ignores umask for AF_UNIX sockets. Also tighten the parent directory (typically /run/panicmode) to 0o700 so non-root users cannot traverse to the socket path even if perms on the socket itself are loose for any reason. Refs: review bug #1

Earlier draft fabricated several details that didn't match what the shop hit: - Primary cause was a *DDoS attack lasting a week*, not "junior pushed a regression". The juniors were a separate compounding factor on top of the DDoS, not the lead failure mode. - The mid-level engineer wasn't a formal on-call hire — he was someone the manager called "по дружбе", as a personal favour. - The "they tried fail2ban + monit + a Telegram bot, after the third 'thanks but it broke again' I sat down" beat was invented for narrative tempo. What actually happened: I was asked outright to find a real solution; that's how this got built. - The freeze rationale lost its specific technical edge. Reframed to its actual point: the original failure typically doesn't get a chance to flush its logs to disk before a restart cycle would have wiped them — keeping the process suspended in RAM keeps the evidence accessible. - Restored the "without third-party servers and additional costs" framing on priority #1; previous draft had abstracted that away. Same rewrite mirrored in docs/show-hn.md so the HN first-comment matches the README narrative beat-for-beat.

BorisYamp added 16 commits April 28, 2026 00:17

BorisYamp force-pushed the production-hardening-2026-04 branch from d378ff6 to 0a1111f Compare April 28, 2026 23:50

BorisYamp marked this pull request as ready for review April 29, 2026 00:02

BorisYamp merged commit 0ca5d0d into main Apr 29, 2026
2 checks passed

BorisYamp deleted the production-hardening-2026-04 branch April 29, 2026 00:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Production hardening: 18 bugs from review + stress test#1

Production hardening: 18 bugs from review + stress test#1
BorisYamp merged 16 commits into
mainfrom
production-hardening-2026-04

BorisYamp commented Apr 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

BorisYamp commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Round 1 — code review + basic stress (18 bugs)

🔴 Critical

🟠 High

🟡 Medium

Round 2 — adversarial / behavioral testing (6 bugs)

🔴 Critical

🟠 High

Round 3 — deeper white-spot testing (4 bugs)

🟡 Cosmetic

🟠 High (UX / correctness)

Round 4 — clean regression pass (0 new bugs)

What changed (high level)

Reliability

Security

Operations

UX / configuration

Known limitations (intentional, deferred)

Test plan (full)

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

BorisYamp commented Apr 28, 2026 •

edited

Loading