Skip to content

fix(wren-ai-service): make DDLChunker synchronous to fix asyncio Task passed to embedding node#2194

Open
octo-patch wants to merge 2 commits intoCanner:mainfrom
octo-patch:fix/issue-2138-sync-ddl-chunker
Open

fix(wren-ai-service): make DDLChunker synchronous to fix asyncio Task passed to embedding node#2194
octo-patch wants to merge 2 commits intoCanner:mainfrom
octo-patch:fix/issue-2138-sync-ddl-chunker

Conversation

@octo-patch
Copy link
Copy Markdown

@octo-patch octo-patch commented Apr 19, 2026

Fixes #2138

Problem

When Hamilton's AsyncDriver executes the DB schema indexing DAG, it wraps async nodes in asyncio Tasks. Under complex MDL schemas (many tables with many relationships), the async chunk node's Task was sometimes passed unawaited to the downstream embedding node.

This caused embedding to receive an asyncio Task object (e.g. "<Task finished name='Task-1479' coro=<AsyncGraphAd...") instead of the actual {"documents": [...]} dict, which then caused embedder.run() to send invalid content to the embedding endpoint, returning a 400 Bad Request from Ollama.

The bug was consistently reproducible with complex MDL schemas (20+ tables, 30+ relationships) but did not appear with simpler schemas, matching the issue description.

Solution

Make DDLChunker.run() and its internal helpers synchronous, eliminating the async Task from the Hamilton DAG. This matches the pattern used by all other indexing pipelines (HistoricalQuestion, TableDescription, ProjectMeta, SqlPairs) where the chunk node is already synchronous.

The async machinery in _model_preprocessor (using asyncio.gather) was unnecessary because:

  • MODEL_PREPROCESSORS is empty by default — the await in the dict comprehension never executes
  • All actual work (string manipulation, dict building) is CPU-bound and has no I/O to parallelize

Changes:

  • DDLChunker.run(): async defdef, removes await
  • DDLChunker._get_ddl_commands(): async defdef, removes await
  • DDLChunker._model_preprocessor(): async defdef, removes asyncio.gather
  • DDLChunker._preprocessor() (inner): async defdef, removes await in dict comprehension
  • chunk() Hamilton node: async defdef, removes await
  • Removes now-unused import asyncio
  • Updates tests to call chunker.run() synchronously

Testing

  • All existing unit tests in test_db_schema.py updated and pass with the sync interface
  • The test_pipeline_run integration test (which uses the full DBSchema Hamilton pipeline) continues to work since the downstream embedding, clean, and write nodes remain async

Summary by CodeRabbit

  • Refactor
    • Converted the database schema indexing pipeline from asynchronous to synchronous execution, simplifying the DDL chunking and preprocessing flow while preserving indexing behavior.
  • Tests
    • Updated test suite to match the synchronous pipeline changes, converting async tests to synchronous equivalents.

…to embedding node

When Hamilton's AsyncDriver executes the indexing DAG, it wraps async
nodes in asyncio Tasks. Under complex MDL schemas with many relationships,
the async chunk node's Task was being passed unawaited to the downstream
embedding node instead of the actual dict result, causing the embedder to
receive an asyncio Task repr string rather than the document chunks.

This makes DDLChunker.run() and its helpers synchronous, matching the
pattern used by all other indexing pipelines (historical_question,
table_description, project_meta). The async machinery in
_model_preprocessor was unnecessary since MODEL_PREPROCESSORS is empty
by default and all helper operations are CPU-bound string manipulations.

Update tests to call chunker.run() synchronously accordingly.

Fixes Canner#2138
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 19, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: ff467508-971a-41c7-b1f2-8b310bf8faa0

📥 Commits

Reviewing files that changed from the base of the PR and between 166e4af and 33999d1.

📒 Files selected for processing (1)
  • wren-ai-service/tests/pytest/pipelines/indexing/test_db_schema.py

Walkthrough

Convert the DB schema DDL chunking pipeline from asynchronous to synchronous execution by removing async/await from DDLChunker methods and the chunk pipeline stage, and update tests to call the synchronous APIs.

Changes

Cohort / File(s) Summary
DB Schema Pipeline Synchronization
wren-ai-service/src/pipelines/indexing/db_schema.py
Changed DDLChunker.run, DDLChunker._model_preprocessor, DDLChunker._get_ddl_commands, and the chunk stage from async def to def; removed asyncio.gather and replaced awaited helper calls with direct synchronous calls.
Test Updates (sync conversion)
wren-ai-service/tests/pytest/pipelines/indexing/test_db_schema.py
Converted asyncio tests to synchronous: removed @pytest.mark.asyncio/async def where applicable and replaced await chunker.run(...) with chunker.run(...) (one test still awaits pipe.run(...)).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 I nibbled at async threads all night,

Now syncy carrots gleam in morning light.
No Task-shaped shadows left to fear,
Chunks are plain text — the path is clear. 🥕✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(wren-ai-service): make DDLChunker synchronous to fix asyncio Task passed to embedding node' accurately describes the main change: converting DDLChunker from async to synchronous to resolve the issue where asyncio Tasks were incorrectly passed to the embedding node.
Linked Issues check ✅ Passed The PR addresses all coding requirements from issue #2138: converting DDLChunker.run(), _get_ddl_commands(), _model_preprocessor(), and chunk() from async to sync functions, removing asyncio.gather and await statements, updating call sites, and updating tests accordingly.
Out of Scope Changes check ✅ Passed All changes are within scope: the PR only modifies DDLChunker and related test functions to convert from async to sync, directly addressing issue #2138 without introducing unrelated modifications.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
wren-ai-service/tests/pytest/pipelines/indexing/test_db_schema.py (1)

11-547: Optional: drop async/@pytest.mark.asyncio from chunker-only tests.

After this change, every test_* from test_empty_mdl through test_metric contains no await and only exercises the now-synchronous chunker.run(...). The async def signature and @pytest.mark.asyncio marker are redundant for these and can be converted to plain def to make the sync contract of DDLChunker.run explicit in the tests. test_pipeline_run (Line 550) must stay async since it awaits pipe.run(...) at Line 619.

♻️ Example diff for one test (apply the same pattern to the other chunker-only tests)
-@pytest.mark.asyncio
-async def test_empty_mdl():
+def test_empty_mdl():
     chunker = DDLChunker()
     mdl = {"models": [], "views": [], "relationships": [], "metrics": []}

     document = chunker.run(mdl, column_batch_size=1)
     assert document == {"documents": []}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wren-ai-service/tests/pytest/pipelines/indexing/test_db_schema.py` around
lines 11 - 547, The tests that only call the synchronous DDLChunker.run
(test_empty_mdl, test_single_model, test_multiple_models,
test_column_is_primary_key, test_column_with_properties,
test_column_with_nested_columns, test_column_with_calculated_property,
test_column_with_relationship, test_column_batch_size, test_view,
test_view_with_properties, test_metric) should be converted from async def +
`@pytest.mark.asyncio` to plain def without the marker; remove the
`@pytest.mark.asyncio` decorator and change each async def test_* to def so they
clearly reflect that DDLChunker.run is synchronous, but leave test_pipeline_run
(which awaits pipe.run(...)) as async with its marker intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@wren-ai-service/tests/pytest/pipelines/indexing/test_db_schema.py`:
- Around line 11-547: The tests that only call the synchronous DDLChunker.run
(test_empty_mdl, test_single_model, test_multiple_models,
test_column_is_primary_key, test_column_with_properties,
test_column_with_nested_columns, test_column_with_calculated_property,
test_column_with_relationship, test_column_batch_size, test_view,
test_view_with_properties, test_metric) should be converted from async def +
`@pytest.mark.asyncio` to plain def without the marker; remove the
`@pytest.mark.asyncio` decorator and change each async def test_* to def so they
clearly reflect that DDLChunker.run is synchronous, but leave test_pipeline_run
(which awaits pipe.run(...)) as async with its marker intact.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 926a368c-bf58-4f8e-850d-9e90e31c8e5a

📥 Commits

Reviewing files that changed from the base of the PR and between b6d5932 and 166e4af.

📒 Files selected for processing (2)
  • wren-ai-service/src/pipelines/indexing/db_schema.py
  • wren-ai-service/tests/pytest/pipelines/indexing/test_db_schema.py

DDLChunker.run() is now synchronous, so the chunker test cases no longer
need pytest.mark.asyncio or async def. Only test_pipeline_run keeps
async because it still awaits DBSchema.run.
@octo-patch
Copy link
Copy Markdown
Author

Thanks for the careful review @coderabbitai!

Pushed 33999d1a which drops @pytest.mark.asyncio and async from the chunker-only tests now that DDLChunker.run() is synchronous. test_pipeline_run keeps the async decorator because it still awaits DBSchema.run(...).

Please re-review when you have a chance.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 19, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

Let me pull up the current state of the files to verify everything looks good.


🏁 Script executed:

#!/bin/bash
cat wren-ai-service/tests/pytest/pipelines/indexing/test_db_schema.py

Length of output: 18048


🏁 Script executed:

#!/bin/bash
cat wren-ai-service/src/pipelines/indexing/db_schema.py

Length of output: 12861


@octo-patch: The changes in commit 33999d1 look great! Here's a summary of what I verified:

Test file (test_db_schema.py)

  • All 12 chunker-only tests (test_empty_mdl through test_metric) are now plain synchronous def with no @pytest.mark.asyncio decorator — exactly right since DDLChunker.run() is now synchronous.
  • test_pipeline_run correctly retains @pytest.mark.asyncio + async def + await pipe.run(...) because it exercises the full DBSchema async pipeline.

Source file (db_schema.py)

  • DDLChunker.run(), _model_preprocessor(), _get_ddl_commands(), and the chunk() Hamilton node are all plain synchronous def — no await, no asyncio.
  • The downstream Hamilton nodes (embedding, clean, write) and DBSchema.run() correctly remain async def, so the AsyncDriver graph still works end-to-end.
  • No import asyncio remains in the file.

The sync/async boundary is clean and the fix directly addresses the root cause: Hamilton's AsyncDriver no longer wraps chunk() output in an unawaited asyncio.Task before it reaches the embedding node. LGTM 🐇

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

400 on /api/embed: db_schema embedding receives asyncio Task repr instead of chunk text (Ollama embedder)

1 participant