Skip to content

fix(conversations): bounded ThreadPoolExecutor for background work#4827

Closed
beastoin wants to merge 4 commits intomainfrom
fix/process-conversation-thread-pool
Closed

fix(conversations): bounded ThreadPoolExecutor for background work#4827
beastoin wants to merge 4 commits intomainfrom
fix/process-conversation-thread-pool

Conversation

@beastoin
Copy link
Copy Markdown
Collaborator

Summary

  • Replace 7+ raw threading.Thread().start() per conversation completion with ThreadPoolExecutor(max_workers=32)
  • Affected functions: save_structured_vector, _extract_memories, _extract_trends, _save_action_items, _update_goal_progress, conversation_created_webhook, update_personas_async, _run_auto_sync
  • Under sustained load, the old pattern spawned hundreds of threads per minute with no pooling or rate limiting
  • The bounded pool queues work when all workers are busy instead of spawning unlimited threads

Part of #4825 (Fix 2/3). Follow-up to PR #4784.

Test plan

  • Verify conversation processing still completes (memories extracted, trends saved, action items created, goals updated)
  • Verify webhook notifications still fire on conversation creation
  • Verify persona updates still happen after conversations
  • Load test: confirm thread count stays bounded under sustained conversation volume

🤖 Generated with Claude Code

…olExecutor

Each conversation completion was spawning 7+ raw threading.Thread() calls
(save_structured_vector, _extract_memories, _extract_trends, _save_action_items,
_update_goal_progress, conversation_created_webhook, update_personas_async).
No pooling, no rate limiting. Under sustained load this creates hundreds of
concurrent threads, each holding full Conversation objects in memory.

Replaced with a bounded ThreadPoolExecutor(max_workers=32) that queues work
when all workers are busy instead of spawning unlimited threads.

Found during deep memory leak audit (follow-up to PR #4784).
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a solid improvement for the application's stability and resource management. Replacing unbounded threading.Thread creation with a bounded ThreadPoolExecutor is the correct approach to handle background tasks under sustained load. My review focuses on ensuring this new pattern is implemented robustly, with particular attention to exception handling and resource lifecycle management to prevent silent failures and resource leaks. All original comments are valid and have been kept.

Comment on lines +690 to +695
if not is_reprocess:
_conversation_bg_executor.submit(save_structured_vector, uid, conversation)
_conversation_bg_executor.submit(_extract_memories, uid, conversation)
_conversation_bg_executor.submit(_extract_trends, uid, conversation)
_conversation_bg_executor.submit(_save_action_items, uid, conversation)
_conversation_bg_executor.submit(_update_goal_progress, uid, conversation)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

By switching to ThreadPoolExecutor.submit, any exceptions that occur within the submitted tasks will be swallowed silently if the returned Future object is not checked. This is a critical issue as it can lead to silent failures of background processing, causing data inconsistencies or incomplete conversation processing without any logs.

Several of the submitted functions do not have top-level exception handling, including:

  • save_structured_vector
  • _extract_memories
  • _save_action_items

To fix this, you should add a try...except block within each of these functions to catch all exceptions and log them properly. For example, in save_structured_vector:

def save_structured_vector(uid: str, conversation: Conversation, update_only: bool = False):
    try:
        # ... function body ...
    except Exception as e:
        logging.error(f"Error in background task save_structured_vector for conv {conversation.id}: {e}", exc_info=True)

Please apply this pattern to all background tasks submitted to the executor that currently lack robust exception handling.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 9f21832. Added try/except Exception with logging.exception() to all executor-submitted functions:

  • save_structured_vector
  • _extract_memories
  • _extract_trends
  • _save_action_items (refactored to _save_action_items_save_action_items_inner)

All exceptions are now logged with full traceback and conversation ID for debugging.


# Bounded thread pool for post-conversation background work.
# Prevents thread explosion under sustained load (was spawning 7+ raw threads per conversation).
_conversation_bg_executor = ThreadPoolExecutor(max_workers=32, thread_name_prefix="conv-bg")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The ThreadPoolExecutor is created as a global instance, but there's no corresponding call to shutdown() when the application exits. This can lead to tasks in the queue being lost during application restarts or deployments, causing incomplete background work. It's important to ensure a graceful shutdown.

You can register a shutdown function using atexit to ensure the executor waits for running tasks to complete before the process exits. You will also need to add import atexit at the top of the file.

Suggested change
_conversation_bg_executor = ThreadPoolExecutor(max_workers=32, thread_name_prefix="conv-bg")
_conversation_bg_executor = ThreadPoolExecutor(max_workers=32, thread_name_prefix="conv-bg")
atexit.register(_conversation_bg_executor.shutdown, wait=True)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 9f21832. Added import atexit and atexit.register(_conversation_bg_executor.shutdown, wait=True) immediately after executor creation.

# Update persona prompts with new conversation
threading.Thread(target=update_personas_async, args=(uid,)).start()
_conversation_bg_executor.submit(conversation_created_webhook, uid, conversation)
_conversation_bg_executor.submit(update_personas_async, uid)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The function update_personas_async is being submitted to the thread pool, but its implementation in utils/apps.py creates and manages its own threads using threading.Thread and join(). This is an anti-pattern when using a thread pool.

A worker from _conversation_bg_executor will be blocked waiting for the new threads inside update_personas_async to complete. This negates the benefits of the thread pool for this task and can still lead to an uncontrolled number of threads if many conversations are processed concurrently.

It's recommended to refactor update_personas_async to submit its individual persona update tasks (sync_update_persona_prompt) directly to the shared _conversation_bg_executor instead of creating new threads internally.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 9f21832. Replaced the call to update_personas_async (which spawns inner threads) with a new _update_personas_via_pool function that:

  1. Checks rate limiting via can_update_persona(uid)
  2. Sets update timestamp via set_persona_update_timestamp(uid)
  3. Submits each individual persona update to the shared _conversation_bg_executor pool

This eliminates the nested thread anti-pattern — no more raw threading.Thread creation inside a pool worker.

@beastoin
Copy link
Copy Markdown
Collaborator Author

Chaos Engineering Test Results — Thread Explosion

Test: 50 rapid conversation completions, each spawning 7 background tasks with 2-5s sleep (simulating slow LLM/DB work).

Metric Vulnerable (main) Fixed (this PR)
Peak total threads 352 34
Peak background threads 350 32 (pool cap)
Thread creation pattern 50 × 7 = 350 unbounded Queued at 32 workers

Verdict: PASS — Vulnerable explodes to 350 threads, fixed caps at 32.

Reproducer:

cd backend/testing/chaos-threadpool/
./run_chaos_test.sh

Test harness at backend/testing/chaos-threadpool/ — standalone Python, no Docker.

Kelvin (AI Agent) and others added 3 commits February 15, 2026 04:36
…read fix

- Add try/except to all executor-submitted functions to prevent silent
  failures: save_structured_vector, _extract_memories, _extract_trends,
  _save_action_items
- Register atexit.shutdown(wait=True) for graceful executor cleanup
- Replace update_personas_async (spawns raw threads inside pool worker)
  with _update_personas_via_pool that submits individual persona updates
  to the shared executor, eliminating the nested thread anti-pattern

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wraps asyncio.run(auto_sync_action_items_batch) with try/except
to prevent silent swallowing of exceptions when submitted via
_conversation_bg_executor.submit().

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
12 tests covering:
- Executor setup: max_workers cap, atexit registration, submit returns Future
- Exception handling: logged vs swallowed, wrapped functions don't propagate
- Persona pool: rate limiting, empty list, per-persona submission, fault isolation
- Boundary: concurrent tasks capped at max_workers, contrast with raw threads

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@beastoin
Copy link
Copy Markdown
Collaborator Author

Closing for now — will revisit and review later.

@beastoin beastoin closed this Feb 21, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Hey @beastoin 👋

Thank you so much for taking the time to contribute to Omi! We truly appreciate you putting in the effort to submit this pull request.

After careful review, we've decided not to merge this particular PR. Please don't take this personally — we genuinely try to merge as many contributions as possible, but sometimes we have to make tough calls based on:

  • Project standards — Ensuring consistency across the codebase
  • User needs — Making sure changes align with what our users need
  • Code best practices — Maintaining code quality and maintainability
  • Project direction — Keeping aligned with our roadmap and vision

Your contribution is still valuable to us, and we'd love to see you contribute again in the future! If you'd like feedback on how to improve this PR or want to discuss alternative approaches, please don't hesitate to reach out.

Thank you for being part of the Omi community! 💜

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant