Skip to content

Production hardening: 18 bugs from review + stress test#1

Merged
BorisYamp merged 16 commits into
mainfrom
production-hardening-2026-04
Apr 29, 2026
Merged

Production hardening: 18 bugs from review + stress test#1
BorisYamp merged 16 commits into
mainfrom
production-hardening-2026-04

Conversation

@BorisYamp
Copy link
Copy Markdown
Owner

@BorisYamp BorisYamp commented Apr 28, 2026

Summary

Addresses 28 bugs across four rounds — initial code review, basic stress tests on a real Contabo VPS, adversarial / white-spot tests, and a final clean regression pass. All fixes are deployed, tested, and verified on a live VPS that's been actively defending against real internet attackers throughout (26+ real botnet IPs blocked over the test window).

Zero regressions in the final round. Ready to ship.

Round 1 — code review + basic stress (18 bugs)

🔴 Critical

  • Production hardening: 18 bugs from review + stress test #1 ctl socket umask race — local privilege-escalation window
  • #2 sshd unprotected if user mass_freeze.yaml omits it
  • #13 UFW-based block scripts fail silently under systemd hardening
  • #14 systemd unit missing AF_NETLINK and writable /run
  • #16 freeze SIGSTOPs own tokio runtime threads
  • #17 freeze SIGSTOPs kernel threads (kcompactd, kworker)

🟠 High

  • #3 Telegram messages > 4096 UTF-16 silently dropped
  • #4 restore_blocked_ips silent failure
  • #7 ctl socket EADDRINUSE on restart (no RuntimeDirectory)
  • #8 self-check task spammed ~12 alerts/min on idle
  • #9 IncidentState::save() race on shared .tmp file
  • #11 disk_cache_ttl=60s blocked disk_usage for full minute
  • #12 auth_failures details didn't include IPs — block_ip had nothing to extract
  • #15 sysinfo first refresh returns 0% CPU
  • #18 block_ip.sh not idempotent — duplicate iptables rules on restart

🟡 Medium

  • #5 script env vars uncapped + missing security note
  • #6 path comment typos repo-wide; orphan src/mass_freez.yaml
  • #10 "missing actions [AlertCritical]" cosmetic warning at every startup

Round 2 — adversarial / behavioral testing (6 bugs)

🔴 Critical

  • #19 Log-injection vulnerability. Any local non-root user could write to /var/log/auth.log via logger -p auth.warn -t sshd[X] "Failed password ... from <IP>" and trick PanicMode into iptables-banning that IP. Verified exploit on real VPS: a UID 1000 attacker baited a ban of 192.0.2.123. Fix: switched auth_monitor to journalctl -u ssh.service --since=Ns ago — the kernel's cgroup-attributed _SYSTEMD_UNIT field is unforgeable, so logger from another user is invisible. Confirmed post-fix: 30 forged entries from non-root → no incident, no ban; real Failed password from sshd still triggers detection.
  • #20 DiskIoMonitor::clone produced fresh state. Mutex<HashMap<...>> was cloned by value on every spawn_blocking tick — each clone updated its own state and was dropped, the original never moved. Result: every tick saw a "first sample" baseline, util always 0% even under heavy load. Fix: Arc<Mutex<>>. Verified under sustained fio: util reads now match /proc/diskstats math (~70%).

🟠 High

  • #21 NetworkMonitor — same Clone-shape bug. connection_rate permanently stuck near 0. Same Arc<Mutex<>> fix.
  • #22 AuthMonitor — same shape, plus rotation surface. last_position per-clone, so every tick re-read entire auth.log from byte 0. Obsoleted by the journald rewrite for #19.
  • #23 file_monitor never started. start_file_monitoring() defined but no caller; notify never registered any inotify watches. Every file_monitor rule returned event_count = 0. Fix: at startup walk config.monitors, pick MonitorType::FileMonitor, pass paths to start_file_monitoring.
  • #24 file_watcher path-match was exact. notify keys events by FILE path, operators configure parent DIRECTORY — exact-match HashMap lookup never hit. Fix: containment match — sum events for any stored path that equals the configured path or is under it.

Round 3 — deeper white-spot testing (4 bugs)

🟡 Cosmetic

  • #25 panicmode-ctl pipe panic. CLI panicked with "failed printing to stdout: Broken pipe (os error 32)" when piped to head/grep (Rust's default stdio treats EPIPE as panic). Fix: restore default SIGPIPE handler in main; CLI now exits silently with 141 like every other Unix tool.

🟠 High (UX / correctness)

  • #26 Five action variants documented but not implemented. mass_freeze, mass_freeze_top, mass_freeze_cluster:<name>, kill_process, rate_limit are accepted by the YAML parser, categorized as "protective" by the detector, but never registered in ActionExecutor — they silently no-op. Fix: builder now reports them as "NOT YET IMPLEMENTED — will silently no-op until shipped" with explicit guidance to remove or substitute, separate from the genuine "missing actions" warning that catches typos.
  • #27 Discord webhook_url two sources of truth. Validation requires channel.webhook_url; is_integration_enabled for Discord only checked integrations.discord.enabled. A config with just channel-level URL passed validation and then silently dropped every Discord alert. Fix: is_integration_enabled accepts either source; send_discord prefers channel-level URL and falls back to integration-level.
  • #28 Empty SMTP creds triggered AUTH attempt. smtp_username: "" deserializes to Some("") not None, so lettre attempted PLAIN/LOGIN with empty user — server rejected with "No compatible authentication mechanism was found". Fix: filter empty strings before passing to credentials, so a blank field behaves like an absent field (correct for dev relays and many internal SMTP servers).

Round 4 — clean regression pass (0 new bugs)

A final adversarial sweep through everything we touched:

Run Coverage Result
1 Smoke + startup hygiene ctl list/help/validate, service active, ctl pipe behaviour, no spurious "missing actions" on startup ✅ 6/6 PASS
2 Alert channels Telegram, ntfy poll, Discord (channel-only path verifying #27), Email (no-auth verifying #28), Email (auth path) ✅ 5/5 PASS
3 Detection file_monitor (30 events from 10 mods), run_script (6 PANIC_* env vars correct); CPU/memory/SSH-brute/disk_io/JSON-metric covered in earlier rounds ✅ all confirmed
4 Resilience + #26 wording SIGKILL → auto-restart in ~6s, broken YAML → systemd-marks-failed cleanly, zero .tmp rename races over 30 min, NOT YET IMPLEMENTED warning fires for mass_freeze/mass_freeze_cluster:website/kill_process ✅ 4/4 PASS

What changed (high level)

Reliability

  • Three "delta" monitors (DiskIo, Network, Auth) share state across spawn_blocking clones via Arc<Mutex<>>
  • IncidentState::save() writes to a unique tmp filename — no concurrent-write race
  • restore_blocked_ips pre-flights block_script existence and ERROR-logs failures with full IP list
  • Disk metric cache TTL 60s → 5s default, config-driven
  • Self-check per-condition cooldown (default 5min, configurable); FD/thread thresholds raised to realistic values for tokio/reqwest steady state
  • Telegram truncation counts UTF-16 code units (matches Telegram's actual limit)
  • block_ip.sh is idempotent (iptables -C before -I)

Security

  • ctl socket: umask(0o077) around bind() closes the race
  • HARDCODED_PROTECTED merged on top of user mass_freeze whitelist — sshd/systemd/init/kthreadd/dbus/getty/panicmode unfreezable by config
  • Freeze action skips own tokio threads (Tgid via /proc/pid/status) and kernel threads (PPID==2 via /proc/pid/stat)
  • Script env vars truncated to 8 KB; long doc comment on never-eval-untrusted-env
  • auth_monitor reads kernel-attributed journald (_SYSTEMD_UNIT=ssh.service) — closes log-injection vector

Operations

  • panicmode --validate <path> to verify config before systemctl restart
  • panicmode --help with usage examples
  • New tunables: self_fd_threshold, self_thread_threshold, self_alert_cooldown, disk_cache_ttl
  • systemd unit gains RuntimeDirectory=panicmode, AF_NETLINK, /run in ReadWritePaths
  • examples/{block_ip,unblock_ip}.sh: iptables-based, idempotent, documented, work under hardening
  • panicmode-ctl no longer panics on SIGPIPE — pipes to head/grep/awk work normally

UX / configuration

  • Discord channel works with channel-level OR integration-level webhook_url
  • Email with empty SMTP creds skips auth instead of attempting empty-PLAIN
  • Builder distinguishes "NOT YET IMPLEMENTED" actions from genuine missing-action typos at startup
  • Cosmetic "missing actions [AlertCritical]" startup spam silenced

Known limitations (intentional, deferred)

  • SIGHUP hot-reloadsystemctl restart panicmode (~1-2s gap) is the apply-config workflow. True zero-downtime via arc-swap is doable as a follow-up.
  • 5 action variants documented but unimplemented: mass_freeze, mass_freeze_top, mass_freeze_cluster, kill_process, rate_limit. Parser accepts them, builder warns clearly, executor no-ops. Implementations welcome in follow-up PRs.
  • Many Connections monitor name vs metric semantics: name suggests absolute count, actual metric is rate (new conns / sec). At threshold 1000 it only triggers on DDoS-level bursts. Naming/docs improvement deserved but unblocking nothing.
  • NVMe disk_io util naturally low: %util from /proc/diskstats is queue-time-based; on NVMe with high parallelism even saturated workloads can sit at ~5%. Operators on NVMe hosts should drop the threshold (e.g. 30%) or treat IOPS/bandwidth as proxies in a future feature.

Test plan (full)

Tested on a fresh Contabo Cloud VPS 10 (4 vCPU, 8 GB, Ubuntu 24.04, ed25519-only SSH, fail2ban + UFW). Each round above lists what was exercised.

cargo test --release — 124 unit tests pass (one flaky-on-shared-FS script test setup unrelated to this PR).

Notes

15 commits on the branch, each tied to a small set of bug numbers and self-contained. Squashing is fine if that fits the review style better — the per-bug history is there for cherry-picking, not because it has to ship in this shape.

🤖 Generated with Claude Code

Repo had a consistent typo where doc comments at top of source files
referenced "PanicMode/scr/..." instead of "PanicMode/src/...", and
example YAMLs claimed "PanicMode/exampels/..." instead of "examples/".
Also drops src/mass_freez.yaml — an orphan duplicate of
examples/mass_freeze.yaml that no code path read (the typoed name made
it invisible to MassFreezeConfig::load_from_path_or_default).

Cosmetic only — no behavior change.

Refs: review bug #6
UnixListener::bind() creates the socket file using the process umask
(typically 0o022 -> world-readable). set_permissions(0o600) only ran
afterwards, leaving a race window where a local non-root user could
connect to the ctl socket and issue commands like 'panicmode-ctl
unblock <IP>'.

Force umask=0o077 around bind() so the socket is created 0o600
atomically. The explicit set_permissions afterwards stays as
defense-in-depth in case some kernel/filesystem ignores umask for
AF_UNIX sockets.

Also tighten the parent directory (typically /run/panicmode) to 0o700
so non-root users cannot traverse to the socket path even if perms
on the socket itself are loose for any reason.

Refs: review bug #1
Telegram's sendMessage API rejects any text longer than 4096 UTF-16
code units with HTTP 400. PanicMode previously sent the formatted
incident text as-is, so a long incident dump (snapshot output,
auth-log excerpts, large details fields) was silently dropped — the
operator would never see the alert at all, despite the most critical
incidents being the ones most likely to produce long text.

Add truncate_for_telegram() which:
- counts UTF-16 code units (not bytes/chars), matching how Telegram
  enforces the limit (matters for emoji and supplementary-plane chars)
- short-circuits on the fast path (returns Cow::Borrowed) when text
  is already within budget
- truncates at a char boundary and appends a marker so the operator
  sees that the message was cut

Other channels (Discord, ntfy, email, Twilio) have different limits
and are unaffected — only send_telegram() applies the truncation.

Refs: review bug #3
IncidentState::save() previously used a single "<state_file>.tmp"
path for the atomic write+rename pattern. When several incidents
fired within the same millisecond (common during a CPU spike that
triggers Critical CPU + CPU Spike + High CPU at once), two save()
calls would race:

  Task A: write .tmp, rename .tmp -> .json   [success]
  Task B: write .tmp                          [success]
  Task A: rename .tmp -> .json                [.tmp gone, ENOENT]

The error showed up under stress as:
  ERROR: Failed to rename incident_state.json.tmp -> incident_state.json

Use a per-call unique suffix (pid + nanos) so each save writes to
its own .tmp, and rename() last-writer-wins on the destination.
Each save still ends with a self-consistent snapshot.

Refs: deployment bug #9
The block_ip action extracts target IPs by parsing the incident's
details string. The auth_failures format previously read:

  "Auth failures: 18, from 13 IP(s), successful logins: 52"

— no concrete IP appeared anywhere in the text, so block_ip skipped
with "no public IPs in incident details". Brute-force detection
fired but no IP was actually blocked.

Build the details from the deduplicated-by-IP failure map (the
underlying failures_by_ip is keyed by user@ip, so the same source
IP with N usernames produces N entries — "from N IP(s)" was
also misleading; now reports "from N unique IP(s)").

New format:
  "Auth failures: 91, from 2 unique IP(s), successful logins: 66,
   top: [198.51.100.1(35), 161.132.4.167(6)]"

The top-5 IPs (by count, descending) appear in the string, and
extract_public_ips() picks them up cleanly. Verified end-to-end
on Contabo VPS with simulated brute force.

Refs: deployment bug #12
…note

Two issues addressed:

1. Some incident metadata (snapshot dumps, large auth-log excerpts)
   can exceed the kernel's argv+envp limit (MAX_ARG_STRLEN, ~128 KB
   on most kernels). Without a cap, an oversized PANIC_DETAILS made
   exec() fail with E2BIG, dropping the whole script invocation
   silently.

2. Documentation gap: PANIC_DESCRIPTION / PANIC_DETAILS may contain
   attacker-influenced text (e.g. usernames extracted from
   /var/log/auth.log). Command::env() does NOT pass values through
   a shell, so they are NOT command-injection vulnerabilities at
   the panicmode boundary. But if a user's script does something
   like `eval "\"`, the user reintroduces the
   injection. The risk wasn't documented anywhere.

Cap each env var at 8 KB (MAX_ENV_BYTES) with a char-boundary-safe
truncation, and add a prominent comment on the constant explaining
the eval-is-unsafe expectation for user scripts.

Refs: review bug #5
The self-check task and the disk-metrics cache had their thresholds
hardcoded as compile-time constants. Tuning them required a rebuild
— painful in production where operators want to ssh in, edit YAML,
'systemctl restart panicmode' and move on.

Move into PerformanceConfig:

  self_fd_threshold      (default 1000)  — was hardcoded 100
  self_thread_threshold  (default 200)   — was hardcoded 20
  self_alert_cooldown    (default 5 min) — was implicit (no cooldown)
  disk_cache_ttl         (default 5 s)   — was hardcoded 60 s

Notes on the new defaults:
- The old FD threshold (100) and thread threshold (20) produced
  immediate false-positive 'leak' alerts on a fresh start: tokio's
  multi-threaded runtime alone runs ~num_cpus*2 worker threads, and
  reqwest+tokio+tracing easily holds 100+ open FDs at steady state.
- self_alert_cooldown is consumed in run_self_check_task to prevent
  the same condition from re-alerting every check_interval.
- disk_cache_ttl: 60s was wildly too long for a monitoring tool —
  a runaway log file could fill the disk while the cached '7%'
  reading was still being served.

All fields use serde defaults — existing configs continue to load
without edits.

Refs: deployment bugs #8 (self-check), #11 (disk cache)
MonitorEngine had a hardcoded 60-second cache on disk metrics. Calling
sysinfo::Disks::new_with_refreshed_list() is mildly expensive (parses
/proc/mounts, stats each mount), so caching makes sense — but a full
minute is wildly too long for a monitoring tool.

A runaway log file or an attacker filling the disk could push usage
from 50% to 95% within seconds, but PanicMode would keep serving the
stale 7% reading from cache for up to 60 s before the Disk Almost
Full incident could fire. Verified: filled the test VPS to 81%, no
incident fired in 15 s — only after the cache expired.

Read disk_cache_ttl from config.performance (introduced in the
previous commit) instead of the hardcoded constant. Default is
5 s which matches the standard check_interval, so a real disk
spike is caught the next tick.

Refs: deployment bug #11
…p own/kernel threads

Four issues in the process freeze action, all surfaced by stress testing:

1) sshd / systemd / init protection was config-only (#2 review).
   If a user's mass_freeze.yaml omitted sshd from the whitelist, the
   freeze action would happily SIGSTOP it under load — locking the
   operator out of the box mid-incident. Add HARDCODED_PROTECTED
   (sshd, systemd, init, kthreadd, dbus, getty, panicmode) and merge
   it on top of the user list in ProcessAction::new(). Even a hostile
   YAML can't reduce the safety floor.

2) sysinfo::Process::cpu_usage() returned 0 on first read (#15).
   sysinfo needs two refreshes spaced by an interval to compute a
   delta. The action did a single refresh, so every process showed
   0% CPU and the threshold filter (>= 1.0%) skipped everything.
   Verified: "No processes to freeze (none above threshold)" on a
   stress-ng-saturated box. Fix: refresh_processes(), sleep 200 ms,
   refresh_processes() again, then read.

3) Tokio runtime threads of panicmode itself got SIGSTOP'd (#16).
   On Linux, sysinfo enumerates threads as separate Process entries
   (each TID has its own /proc/<tid> dir). The pid == own_pid check
   only matched the main thread, so worker threads were eligible
   for freeze. We watched a tokio-rt-worker get SIGSTOP'd during
   a stress test. Fix: read /proc/<pid>/status, skip if Tgid ==
   own_pid (i.e. it's a thread within our own thread group).

4) Kernel threads got SIGSTOP'd (#17).
   Names like kcompactd0, kworker/u8:1, ksoftirqd/0 are kernel
   threads — SIGSTOP can wedge the kernel. The substring whitelist
   for "kthreadd" only matched PID 2 itself, not its many kernel-
   thread children. Fix: read /proc/<pid>/stat, skip if PPID == 2
   (parent is kthreadd). Tried Process::cmd().is_empty() first, but
   that returned empty for normal userspace processes too on this
   sysinfo version, breaking the whole freeze action.

Refs: review bug #2; deployment bugs #15, #16, #17
…idate flag

Three changes in main.rs:

1) Self-check task spammed alerts every check_interval (#8).
   With cpu_limit=5%, memory_limit_mb=50, FD threshold 100, thread
   threshold 20 — and zero cooldown — a fresh deploy produced 10+
   Telegram alerts per minute on idle: 'PanicMode CPU high: 50%',
   'FD leak: 175', etc. The thresholds were unrealistic for a
   modern Rust async app, and the lack of dedup made any persistent
   condition cascade.

   Each condition now tracks its own last-alert Instant in a
   HashMap, and only fires if cooldown has elapsed. Thresholds and
   cooldown all come from config.performance (added in earlier
   commit) so operators can tune without rebuilding.

2) restore_blocked_ips silent-failure (#4 review).
   Per-IP failures used warn! and the function returned Ok with
   no aggregate error log. Operators saw 'restored 5 blocks'
   incident DB rows but the firewall was empty — attackers were
   back in. Pre-flight: error+return if block_script doesn't
   exist (catches the common 'I renamed the script and forgot to
   update config' case). Aggregate failures into an error! log
   at the end with the full list of failed IPs and reasons.

3) Operator UX: --validate / --check / --help.
   The standard 'edit yaml -> systemctl restart' workflow needed
   a way to verify a config before bouncing the daemon. Adding
   'panicmode --validate /etc/panicmode/config.yaml' parses and
   validates, exits 0 on OK / 1 on error. --help prints usage.

Refs: review bug #4; deployment bug #8; ux
…rewall scripts

Three changes to the unit, all surfaced by stress testing on the
Contabo VPS deploy:

1) RuntimeDirectory=panicmode (#7).
   ctl_socket lives at /run/panicmode/ctl.sock by default. With
   ReadOnlyPaths=/ and only /var/lib/panicmode etc. in ReadWritePaths,
   panicmode could not create the parent dir or write the socket
   file. After a crash, the stale socket from the previous instance
   couldn't be removed either, so restarts hit EADDRINUSE in a loop.
   RuntimeDirectory= asks systemd to create /run/panicmode at start
   with the configured mode and remove it on stop — clean lifecycle.

2) AF_NETLINK in RestrictAddressFamilies (#14).
   iptables/ip6tables (used by the standard block_ip / unblock_ip
   scripts) talk to kernel netfilter over an AF_NETLINK socket.
   Without it the scripts fail with cryptic errors like "Operation
   not permitted" on rule insertion, and block_ip is silently a
   no-op — brute-force detection fires but no IP is actually banned.

3) /run added to ReadWritePaths (#14).
   UFW takes /run/ufw.lock, iptables takes /run/xtables.lock —
   neither of which fit under /run/panicmode (and we don't want
   to bind-mount each lock). Granting /run keeps the rest of the
   hardening intact (still no /etc, no /var beyond panicmode's
   own dirs).

Refs: deployment bugs #7, #14
…hardening

The repo previously shipped no example block_ip / unblock_ip scripts
even though config.firewall references them by path. Operators wrote
their own — typically the obvious "ufw deny from $IP", which
silently fails under panicmode's systemd hardening (UFW takes
/etc/ufw/user.rules and /run/ufw.lock; both are unwritable under
ReadOnlyPaths=/).

Add reference scripts that:
- use iptables/ip6tables directly (no /etc state, no UFW lock file)
- pick the right tool by address family (":" → ip6tables)
- are idempotent: `-C` (check) before `-I` (insert) on block_ip
  prevents restore_blocked_ips from compounding duplicate rules on
  every restart; `-D || true` on unblock makes the script safe to
  call on an IP that was never blocked
- are documented inline so the operator knows WHY iptables and not
  UFW, and what hardening flags need to be relaxed for it to work

Persistence across reboots is the daemon's responsibility — the
SQLite blocked_ips table + restore_blocked_ips on startup re-runs
block_ip.sh for every stored IP. Verified end-to-end on Contabo VPS
(reboot, all 3 stored blocks restored, no duplicates).

Refs: deployment bugs #13, #18
Three monitors maintained per-tick state (last sample, file cursor) but
were cloned by MonitorEngine on every spawn_blocking — each clone updated
its OWN copy and was dropped, while the original inside MonitorEngine
never moved. The visible effect for an operator: disk_io always reported
0% util, connection_rate always 0, and auth_monitor re-read the entire
auth.log every tick instead of only the new tail. All three are fixed by
wrapping the state in Arc<Mutex<>> so clones share one cell.

For auth_monitor we go further. Reading /var/log/auth.log line-by-line
let any local non-root user write fake entries via `logger -p auth.warn
-t sshd` and bait PanicMode into iptables-banning arbitrary public IPs
— verified on a real VPS with a UID 1000 attacker and 16 forged failures
naming 192.0.2.123. Switching to `journalctl -u ssh.service` makes the
kernel's cgroup-attributed _SYSTEMD_UNIT field the gate: messages from a
random user's logger never carry that attribution and silently drop
out, while genuine sshd events flow through unchanged. Re-verified post-
fix: same 30 forged entries from a non-root user, no incident, no ban;
real Failed password from sshd still triggers SSH Brute Force as before.

The journald rewrite also obsoletes the file-cursor (last_position)
state, so the auth piece of the Arc<Mutex<>> family-fix is moot — but
the disk_io and network instances of the same pattern still need it.

Refs: deployment bugs #19 (CRITICAL log injection), #20 (disk_io util),
#21 (network rate), #22 (auth incremental — obsoleted by journald)
…s under watched directory

The file_monitor monitor type was non-functional in two ways. First,
MonitorEngine::start_file_monitoring(paths) existed but was never called
from main, so notify never registered any inotify watches — the rule
could read its threshold but get_file_event_count always saw an empty
HashMap. Second, even with the watcher started, get_event_count looked
up exact-match keys on the path; notify keyed events under the FILE
that changed (/etc/nginx/nginx.conf), while operators configure the
parent directory (/etc/nginx/), so the lookup never matched.

Fix both ends:

  * main: after MonitorEngine is constructed, walk config.monitors,
    pick rules of MonitorType::FileMonitor, and pass their paths to
    start_file_monitoring. Log a warning rather than aborting if the
    watcher cannot register a path (path is missing or unreadable
    under systemd hardening) so unrelated rules still fire.

  * file_watcher::get_event_count: switch to a containment match —
    sum events for any stored event_path that is the configured path
    OR sits under it as a subdirectory. Keeps the file path as the
    storage key (preserves per-file event detail for future queries).

Verified end-to-end on the test VPS: a file_monitor rule watching
/etc/panicmode/watched fires Watched Files at current_value=30 after
ten file modifications (each modification produces a couple of
inotify events) — was always 0 before.

Refs: deployment bugs #23 (start_file_monitoring never called),
#24 (directory-vs-file path mismatch in get_event_count)
…SIGPIPE in CLI

Two small operational papercuts:

#10 — On every startup the action-executor builder logged a warning
per monitor saying "missing actions [AlertCritical] (degraded, 1 of
2 available)". Alerts were actually firing fine — AlertCritical /
AlertWarning / AlertInfo route through AlertDispatcher (the alert_tx
channel), not through ActionExecutor. The validator iterates the
ActionExecutor registry and naively flagged anything it didn't own.
Filter the three Alert* variants out of the missing-actions check
so genuine misconfiguration is still flagged, but the noise stops.

#25 — `panicmode-ctl list | head -5` panicked with "failed printing
to stdout: Broken pipe (os error 32)" because Rust's default stdio
treats EPIPE as a panic. Restore the kernel's default SIGPIPE handler
in main(), which terminates the process silently with exit 141 — the
expected Unix CLI behavior. Also makes piping to grep/awk/jq feel
ordinary in shell scripts.

Trivial cosmetic fixes; included so a fresh `systemctl restart
panicmode` startup is clean and panicmode-ctl behaves like a regular
Unix utility.

Refs: deployment bugs #10, #25
…spot tests

Three fixes from a deeper round of channel testing on the live VPS:

#26 — Several action variants are documented in examples/config.yaml,
accepted by the parser, and routed by the detector as "protective",
but never registered in ActionExecutor (mass_freeze, mass_freeze_top,
mass_freeze_cluster:<name>, kill_process, rate_limit). An operator
copying from the example would see a generic "missing actions"
warning and reasonably assume they typed a name wrong, when in fact
the feature itself isn't shipped yet. Builder now reports those as
"NOT YET IMPLEMENTED — will silently no-op" with explicit guidance
to remove them or substitute freeze_top_process / run_script. Genuine
"missing" warnings are still emitted separately for typo'd action
names so legitimate misconfigurations stay visible.

#27 — Discord can be configured at two places: integrations.discord
(with an .enabled flag) and channel.webhook_url at the alerts list.
Validation requires channel.webhook_url, but is_integration_enabled
ONLY checked integrations.discord.enabled — so a config with just
channel.webhook_url passed validation and then silently dropped every
Discord alert. Two changes: is_integration_enabled now accepts
EITHER source; send_discord prefers channel.webhook_url and falls
back to integrations.discord.webhook_url. Verified end-to-end
against a mock localhost webhook receiver; the Discord-shaped
payload (\"content\": text JSON) arrives correctly.

#28 — EmailConfig has smtp_username/smtp_password as Option<String>,
but YAML \"smtp_username: \\\"\\\"\" deserializes as Some(\"\")
not None. lettre then attempts PLAIN/LOGIN with an empty user and
fails with \"No compatible authentication mechanism was found\".
A user who legitimately wants unauthenticated send (dev relay,
internal SMTP that allows local clients without auth) had to omit
the fields entirely — easy to miss when copying the example. Filter
empty strings so an empty value behaves the same as absent.

Refs: deployment bugs #26, #27, #28
@BorisYamp BorisYamp force-pushed the production-hardening-2026-04 branch from d378ff6 to 0a1111f Compare April 28, 2026 23:50
@BorisYamp BorisYamp marked this pull request as ready for review April 29, 2026 00:02
@BorisYamp BorisYamp merged commit 0ca5d0d into main Apr 29, 2026
2 checks passed
BorisYamp added a commit that referenced this pull request Apr 29, 2026
UnixListener::bind() creates the socket file using the process umask
(typically 0o022 -> world-readable). set_permissions(0o600) only ran
afterwards, leaving a race window where a local non-root user could
connect to the ctl socket and issue commands like 'panicmode-ctl
unblock <IP>'.

Force umask=0o077 around bind() so the socket is created 0o600
atomically. The explicit set_permissions afterwards stays as
defense-in-depth in case some kernel/filesystem ignores umask for
AF_UNIX sockets.

Also tighten the parent directory (typically /run/panicmode) to 0o700
so non-root users cannot traverse to the socket path even if perms
on the socket itself are loose for any reason.

Refs: review bug #1
@BorisYamp BorisYamp deleted the production-hardening-2026-04 branch April 29, 2026 00:02
BorisYamp added a commit that referenced this pull request May 5, 2026
Earlier draft fabricated several details that didn't match what the
shop hit:

- Primary cause was a *DDoS attack lasting a week*, not "junior
  pushed a regression". The juniors were a separate compounding
  factor on top of the DDoS, not the lead failure mode.
- The mid-level engineer wasn't a formal on-call hire — he was
  someone the manager called "по дружбе", as a personal favour.
- The "they tried fail2ban + monit + a Telegram bot, after the
  third 'thanks but it broke again' I sat down" beat was
  invented for narrative tempo. What actually happened: I was
  asked outright to find a real solution; that's how this got
  built.
- The freeze rationale lost its specific technical edge.
  Reframed to its actual point: the original failure typically
  doesn't get a chance to flush its logs to disk before a
  restart cycle would have wiped them — keeping the process
  suspended in RAM keeps the evidence accessible.
- Restored the "without third-party servers and additional
  costs" framing on priority #1; previous draft had abstracted
  that away.

Same rewrite mirrored in docs/show-hn.md so the HN first-comment
matches the README narrative beat-for-beat.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant