fix(instance.controller): emit remove.instance even when logout fails (zombie cleanup)#2520
Conversation
When a Baileys WebSocket dies but the in-memory `waInstances[name]`
entry still exists (a "zombie" instance), `deleteInstance()` calls
`await this.logout()` which throws "Connection Closed". The throw
causes the outer try/catch to skip the `eventEmitter.emit('remove.instance')`
call — which is the only mechanism that purges the zombie from
`waInstances`.
Result: zombies persist in memory until the entire `evo2_api`
container is restarted, affecting ALL instances on the host (not
just the broken one). Operators have no per-instance recovery path
in v2.3.x — their only option is `docker restart`, which forces
every connected user to re-scan the QR code.
Fix: wrap the inner `logout()` call in its own try/catch. Log a
warning when it fails but continue to the cleanup emit. The
in-memory entry must be removed regardless of whether logout
completed cleanly — `remove.instance` is the canonical way to
purge a stuck instance, and DB/cache cleanup happens in the same
event handler.
This makes `DELETE /instance/:name` idempotent against zombies: a
caller can always recover a single instance without nuking the
whole host.
Refs:
- EvolutionAPI#693 (instance/restart closes the session)
- EvolutionAPI#1286 (Connection Closed in v2.2.3)
- EvolutionAPI#2026 (Sync lost after reboot)
- EvolutionAPI#2027 (Loss of synchronization on reboot)
Tested in production at Rigarr (14 instances, ~25k msgs/day) by
overlaying this patch on v2.3.7 via Docker. Before: any zombie
forced a full container restart. After: per-instance cleanup
works cleanly while other vendors stay connected.
Signed-off-by: Bruno Cavalcante Sgarbi <bcsgarbi@gmail.com>
Reviewer's guide (collapsed on small PRs)Reviewer's GuideMakes instance deletion resilient to Baileys WebSocket "zombie" instances by catching logout failures and always proceeding to emit the internal remove.instance cleanup event. Sequence diagram for instance deletion with zombie cleanupsequenceDiagram
actor Client
participant InstanceController
participant BaileysSocket
participant Logger
participant EventEmitter
participant MonitorService
participant WaInstances
Client->>InstanceController: DELETE /instance/:name
InstanceController->>InstanceController: load instance by name
InstanceController->>InstanceController: clearCacheChatwoot if enabled
InstanceController->>InstanceController: check instance.state
alt instance.state is connecting or open
InstanceController->>BaileysSocket: logout(instanceName)
BaileysSocket-->>InstanceController: Connection Closed error
InstanceController->>InstanceController: catch logoutError
InstanceController->>Logger: warn [ZOMBIE-CLEANUP] logout failed
else instance.state is not connecting or open
InstanceController->>InstanceController: skip logout
end
InstanceController->>EventEmitter: emit remove.instance(instanceName)
EventEmitter->>MonitorService: remove.instance(instanceName)
MonitorService->>MonitorService: cleaningUp()
MonitorService->>WaInstances: delete entry for instanceName
MonitorService->>MonitorService: finalize instance cleanup
InstanceController-->>Client: 200 SUCCESS Instance deleted
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've left some high level feedback:
- The in-line comment and log message referencing
RIGARR PATCH/[ZOMBIE-CLEANUP]are quite vendor-specific; consider rephrasing them in neutral, upstream-focused terms so the intent is clear without tying the behavior to a particular deployment. - In the
catchblock forlogout, it may be helpful to log the full error object or stack (e.g.,this.logger.warn('... ', logoutError);) rather than onlytoString(), to preserve diagnostic details if this path is hit unexpectedly.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The in-line comment and log message referencing `RIGARR PATCH`/`[ZOMBIE-CLEANUP]` are quite vendor-specific; consider rephrasing them in neutral, upstream-focused terms so the intent is clear without tying the behavior to a particular deployment.
- In the `catch` block for `logout`, it may be helpful to log the full error object or stack (e.g., `this.logger.warn('... ', logoutError);`) rather than only `toString()`, to preserve diagnostic details if this path is hit unexpectedly.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Per EvolutionAPI#2520 review: 1. Drop vendor-specific markers in code comment and log message (was '[ZOMBIE-CLEANUP]' and 'RIGARR PATCH'). Comment now describes the bug in upstream-friendly terms. 2. Pass the full error object to logger.warn instead of toString(), following the existing convention in monitor.service.ts ('no.connection' handler) where structured object logging is used to preserve diagnostic detail. No behavior change.
|
Thanks for the review @sourcery-ai. Pushed a follow-up commit (
Diff: 93b9081a Happy to squash to one commit before merge if that's the project preference. |
Summary
When a Baileys WebSocket dies but the in-memory
waInstances[name]entry still exists (a "zombie" instance),deleteInstance()insrc/api/controllers/instance.controller.tscallsawait this.logout()which throwsConnection Closed. The outertry/catchthen propagates aBadRequestExceptionandeventEmitter.emit('remove.instance')is never reached — leaving the zombie inwaInstancesforever, only fixable by restarting the entire process.This is the practical reason behind several long-standing reports about reconnect/sync issues:
instance/restart/{$instanceName}close the session #693 —instance/restartcloses the sessionIn production, operators today have no per-instance recovery when a Baileys socket dies in a way that prevents
logout()from completing. Their only escape isdocker restart, which kicks every connected user off the host and forces a full QR re-scan for everyone.Fix
Wrap the inner
await this.logout(...)call in its own try/catch. Log the failure but proceed to the cleanup emit soremove.instancealways fires. The in-memory entry must be purged regardless of whether logout completed cleanly —cleaningUp()(DB / Session / cache) anddelete this.waInstances[name]both happen inside theremove.instancehandler inmonitor.service.ts.if (instance.state === 'connecting' || instance.state === 'open') { - await this.logout({ instanceName }); + try { + await this.logout({ instanceName }); + } catch (logoutError) { + this.logger.warn( + `[ZOMBIE-CLEANUP] logout failed for "${instanceName}" (likely zombie socket): ${logoutError?.toString?.() || logoutError}. Proceeding with cleanup.`, + ); + } }12 added, 1 changed line. No public API change.
DELETE /instance/:namebecomes idempotent against zombies: a caller can always recover a single instance without affecting other instances on the same host.Why this is safe
remove.instanceis the canonical, internal cleanup path. The handler inmonitor.service.tsalready runscleaningUp()(deletes theSessionrow, setsconnectionStatus='close'inInstance,rmSyncof the on-disk session dir, clears Redis cache, removes provider session). All of that is independent of whetherlogout()completed gracefully on the WhatsApp side. If the socket is truly dead, talking to WhatsApp wasn't going to happen anyway — what we need is the local cleanup, and this fix unblocks it.Tested
Patch deployed in production at Rigarr (Brazilian distributor, 14 active vendor instances, ~25k msgs/day) via Docker image overlay on v2.3.7:
docker restart evo2_api(~30s downtime, all 14 vendors needed to re-scan QR).DELETE /instance/:namereturns{"status":"SUCCESS","response":{"message":"Instance deleted"}}even when the underlying socket is dead. Other vendors stay connected.Smoke test before submitting: applied patch on
2.3.7clone, rannpm ci && npm run build, no compile errors. Container boots cleanly. Tested DELETE on a known-zombie instance (pedro-rca340-sp, ~115h idle) — instance was correctly removed from memory (Instance "X" - REMOVEDlog line) and fromInstancetable.Related context
We're maintaining a downstream Docker overlay (
ghcr.io/inovacaoecrescimento/evolution-api-rigarr:2.3.7-zombiefix-1) until this lands upstream. We'd love to drop the overlay and go back to using the official image — happy to address review feedback or split the change differently if useful.Refs:
Signed-off-by: Bruno Cavalcante Sgarbi
Summary by Sourcery
Bug Fixes: