Description improvements - Unified #366

galshubeli · 2026-01-07T13:09:54Z

Summary by CodeRabbit

New Features
- Database columns now include sample values automatically extracted alongside column metadata
- Column descriptions are enhanced with actual sample data for improved context

_{✏️ Tip: You can customize this high-level summary in your review settings.}

overcut-ai · 2026-01-07T13:09:58Z

Completed Working on "Code Review"

✅ Workflow completed successfully.

👉 View complete log

coderabbitai · 2026-01-07T13:10:00Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

The pull request refactors the data sampling and column description strategy across loaders. It replaces distinct/count query approaches with a single-query sampling method (_execute_sample_query) and introduces LLM-based table description generation via a new create_combined_description function. Column metadata now includes sampled values alongside existing fields.

Changes

Cohort / File(s)	Summary
Base Loader Abstraction `api/loaders/base_loader.py`	Consolidates two abstract methods (`_execute_count_query`, `_execute_distinct_query`) into one (`_execute_sample_query`); renames `extract_distinct_values_for_column` to `extract_sample_values_for_column` with updated return type (List[Any]); removes Config dependency.
Database Loader Implementations `api/loaders/mysql_loader.py`, `api/loaders/postgres_loader.py`	Both files implement new `_execute_sample_query` method with database-specific syntax (MySQL RAND(), PostgreSQL RANDOM()); update `extract_columns_info` to populate 'sample_values' field in returned column descriptors; query logic shifts from aggregate counting to random sampling.
Graph Loader Enhancement `api/loaders/graph_loader.py`	Imports and invokes `create_combined_description` early in load flow; augments column node descriptions by appending sample values to original descriptions before node creation.
LLM Description Generation `api/utils.py`	Introduces `create_combined_description` function that batches table metadata through LLM (via litellm batch\_completion) to generate enhanced table descriptions with configurable batch size and error handling.

Sequence Diagram(s)

sequenceDiagram
    participant Loader as Loader<br/>(Graph/MySQL/Postgres)
    participant DB as Database
    participant LLM as LLM API
    participant Graph as Graph<br/>Storage

    Loader->>Loader: load_to_graph() / extract_tables_info()
    
    rect rgb(220, 240, 255)
    Note over Loader,DB: Sampling Phase (NEW)
    Loader->>DB: _execute_sample_query<br/>(for each column)
    DB-->>Loader: List[sample_values]
    Loader->>Loader: Build column_info with<br/>sample_values field
    end
    
    rect rgb(240, 255, 220)
    Note over Loader,LLM: Description Enhancement Phase (NEW)
    Loader->>LLM: create_combined_description<br/>(table_info, batch_size)
    LLM-->>Loader: Enhanced table descriptions
    Loader->>Loader: Update table_info with<br/>LLM descriptions
    end
    
    rect rgb(255, 240, 220)
    Note over Loader,Graph: Graph Building Phase
    Loader->>Loader: Build column nodes<br/>with augmented<br/>descriptions<br/>(original + sample_values)
    Loader->>Graph: Create nodes & relationships
    Graph-->>Loader: Confirmation
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 Hops of joy through sampling flows,
Where random data gently glows,
LLM whispers descriptions true,
Distinct no more—a fresh debut!
Sample by sample, the graph now grows. 🌿✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title 'Description improvements - Unified' is vague and generic, using non-descriptive terms that don't convey specific information about the substantial architectural changes in the changeset.	Consider a more specific title that reflects the main change, such as 'Replace distinct/count queries with sampling approach for column values' or 'Unify column value extraction to use sampling instead of distinct counts'.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

railway-app · 2026-01-07T13:10:00Z

🚅 Deployed to the QueryWeaver-pr-366 environment in queryweaver

Service	Status	Web	Updated (UTC)
QueryWeaver	✅ Success (View Logs)	Web	Jan 7, 2026 at 2:04 pm

github-actions · 2026-01-07T13:10:03Z

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

Copilot

Pull request overview

This PR refactors the table and column description generation system to use batch completion for table descriptions and switches from distinct values to sample values for column descriptions.

Key Changes:

Introduces create_combined_description() function in api/utils.py that uses batch completion to generate AI-powered table descriptions
Replaces distinct value extraction with random sample value extraction across all database loaders
Separates sample values from column descriptions, storing them in a dedicated sample_values field that gets appended during graph loading

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
api/utils.py	Adds new `create_combined_description()` function to generate table descriptions via batch completion API
api/loaders/postgres_loader.py	Replaces `_execute_count_query` and `_execute_distinct_query` with `_execute_sample_query` for random sampling
api/loaders/mysql_loader.py	Replaces count/distinct queries with sample query implementation using MySQL's RAND()
api/loaders/graph_loader.py	Integrates table description generation and formats sample values into final column descriptions
api/loaders/base_loader.py	Updates abstract base class to reflect new sample-based approach instead of distinct value logic

api/loaders/graph_loader.py

api/utils.py

api/loaders/base_loader.py

overcut-ai

Summary of findings:

Importance counts: BLOCKER 0, CRITICAL 0, MAJOR 4, MINOR 0, SUGGESTION 0, PRAISE 0 across api/utils.py and api/loaders/graph_loader.py. No blocker-level items were raised, but several major regressions were identified.

Key themes:

The new create_combined_description flow is fragile: serialization fails on Decimal/datetime defaults, in-place mutation of col_descriptions disables batching, and batch LLM errors now bubble up without fallback, collectively threatening pipeline reliability and performance.
Graph loader column embeddings are now inconsistent with the augmented descriptions once sample values are appended, undermining search relevance.

Recommended next steps:

Sanitize or deep-copy table_prop before json.dumps, work on non-mutating copies so batching metadata survives, and wrap the batch completion call with retry/fallback behavior.
Ensure embeddings are generated from the same final description text that users see after sample values are appended (or append before embedding) so vectors and descriptions stay aligned.

api/utils.py

api/loaders/graph_loader.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

gkorland · 2026-01-07T13:38:16Z

@coderabbitai review

coderabbitai · 2026-01-07T13:38:24Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 4

🤖 Fix all issues with AI agents

In @api/loaders/base_loader.py:
- Around line 42-67: The method extract_sample_values_for_column currently
converts sample values to strings despite the docstring saying it returns raw
values; change the implementation to return raw values instead of stringifying
them: call cls._execute_sample_query(...) as before, then validate that all
items in sample_values are primitive types (e.g., isinstance(v, (str, int,
float))) and if so return sample_values directly (List[Any]); otherwise return
[]. Update only the logic in extract_sample_values_for_column (and leave
_execute_sample_query untouched).

In @api/utils.py:
- Around line 41-55: Do not mutate the original table_prop and make JSON
serialization robust: create a shallow copy (e.g., table_prop_copy =
table_prop.copy()), remove the "col_descriptions" key from that copy
(table_prop_copy.pop("col_descriptions", None)) so downstream code that reads
table_prop still has the original data, and use json.dumps(table_prop_copy,
default=str) when formatting the user_prompt_template to safely serialize
non-JSON types; keep appending the original table_prop (unaltered) to any lists
like messages_list/table_keys as before.
- Around line 57-76: Wrap the call to batch_completion (the call using
Config.COMPLETION_MODEL with messages=batch_messages) in a try/except so
transient API/network errors don't abort the loop; on exception, iterate the
current batch's table keys (using batch_start, batch_size and table_keys) and
set each table_info[table_name]["description"] = table_name (or a fallback
string) and optionally log the exception, then continue to the next batch; keep
the existing per-response Exception handling for successful responses.
- Around line 10-77: create_combined_description currently exceeds pylint's
local-variable limit; extract the per-table message construction into a small
helper (e.g., build_table_messages(table_name, table_prop, system_prompt,
user_prompt_template)) that returns the messages list and use
table_prop.pop("col_descriptions", None) to avoid extra locals; replace the
inline construction in create_combined_description with calls to that helper and
adjust the caller to append the returned messages and table_name, which will
reduce locals in create_combined_description to satisfy pylint.

🧹 Nitpick comments (1)

api/loaders/mysql_loader.py (1)
57-74: Consider using parameterized identifiers for SQL construction.

Static analysis flags potential SQL injection at lines 64-70. While table_name and col_name originate from INFORMATION_SCHEMA queries (not user input), using string formatting for SQL construction is generally discouraged.

Consider using a SQL builder library that supports parameterized identifiers, or at minimum, add identifier validation to ensure they match expected patterns.
♻️ Optional: Add identifier validation
     @staticmethod
     def _execute_sample_query(
         cursor, table_name: str, col_name: str, sample_size: int = 3
     ) -> List[Any]:
         """
         Execute query to get random sample values for a column.
         MySQL implementation using ORDER BY RAND() for random sampling.
         """
+        # Validate identifiers to prevent injection (identifiers from INFORMATION_SCHEMA)
+        if not re.match(r'^[a-zA-Z0-9_]+$', table_name) or not re.match(r'^[a-zA-Z0-9_]+$', col_name):
+            return []
+        
         query = f"""
             SELECT DISTINCT `{col_name}`
             FROM `{table_name}`
Note: This is defensive coding; the current implementation is likely safe given the data source.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between be15e06 and 6aeb699.

📒 Files selected for processing (5)

api/loaders/base_loader.py
api/loaders/graph_loader.py
api/loaders/mysql_loader.py
api/loaders/postgres_loader.py
api/utils.py

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

All Python code should pass pylint (use make lint)

Files:

api/loaders/graph_loader.py
api/utils.py
api/loaders/mysql_loader.py
api/loaders/base_loader.py
api/loaders/postgres_loader.py

🧬 Code graph analysis (3)

api/loaders/graph_loader.py (1)

api/utils.py (1)

create_combined_description (10-77)

api/utils.py (1)

api/config.py (1)

Config (62-148)

api/loaders/postgres_loader.py (2)

api/loaders/base_loader.py (2)

_execute_sample_query (26-40)

extract_sample_values_for_column (43-67)

api/loaders/mysql_loader.py (1)

_execute_sample_query (57-74)

🪛 GitHub Actions: Pylint

api/utils.py

[error] 10-10: pylint: Too many local variables (R0914) in api/utils.py (16/15)

🪛 Ruff (0.14.10)

api/utils.py

23-23: Avoid specifying long messages outside the exception class

(TRY003)

api/loaders/mysql_loader.py

64-70: Possible SQL injection vector through string-based query construction

(S608)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: test

🔇 Additional comments (1)

api/loaders/postgres_loader.py (1)

54-76: LGTM: Proper SQL injection protection.

The implementation correctly uses psycopg2.sql.Identifier() for table_name and col_name (lines 71-72), providing robust protection against SQL injection. The query structure with parameterized LIMIT (line 74) follows best practices.

api/loaders/base_loader.py

api/utils.py

unified-descriptions

19958a5

galshubeli requested review from Copilot and gkorland January 7, 2026 13:09

galshubeli self-assigned this Jan 7, 2026

Merge branch 'staging' into description-improvements

bb121bb

railway-app bot requested a deployment to queryweaver / QueryWeaver-pr-366 January 7, 2026 13:10 In progress

Copilot AI reviewed Jan 7, 2026

View reviewed changes

api/loaders/graph_loader.py Outdated Show resolved Hide resolved

api/utils.py Outdated Show resolved Hide resolved

api/utils.py Outdated Show resolved Hide resolved

api/loaders/base_loader.py Show resolved Hide resolved

overcut-ai bot reviewed Jan 7, 2026

View reviewed changes

api/utils.py Outdated Show resolved Hide resolved

api/utils.py Outdated Show resolved Hide resolved

api/utils.py Show resolved Hide resolved

api/loaders/graph_loader.py Show resolved Hide resolved

fix-lint

db206cc

railway-app bot temporarily deployed to queryweaver / QueryWeaver-pr-366 January 7, 2026 13:30 Destroyed

Update api/utils.py

6aeb699

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

railway-app bot temporarily deployed to queryweaver / QueryWeaver-pr-366 January 7, 2026 13:33 Destroyed

coderabbitai bot reviewed Jan 7, 2026

View reviewed changes

api/loaders/base_loader.py Show resolved Hide resolved

api/utils.py Outdated Show resolved Hide resolved

api/utils.py Show resolved Hide resolved

api/utils.py Show resolved Hide resolved

fix-any-usage

a3d8d95

railway-app bot temporarily deployed to queryweaver / QueryWeaver-pr-366 January 7, 2026 13:43 Destroyed

Merge branch 'staging' into description-improvements

552c24a

railway-app bot temporarily deployed to queryweaver / QueryWeaver-pr-366 January 7, 2026 13:46 Destroyed

fix-docs

3f45eb1

railway-app bot temporarily deployed to queryweaver / QueryWeaver-pr-366 January 7, 2026 13:47 Destroyed

fix-mu-error

cdea661

railway-app bot temporarily deployed to queryweaver / QueryWeaver-pr-366 January 7, 2026 13:51 Destroyed

gkorland approved these changes Jan 7, 2026

View reviewed changes

galshubeli merged commit f0d79f7 into staging Jan 7, 2026
11 checks passed

galshubeli deleted the description-improvements branch January 7, 2026 14:36

Description improvements - Unified #366

Description improvements - Unified #366

Uh oh!

Conversation

galshubeli commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

overcut-ai bot commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Completed Working on "Code Review"

Uh oh!

coderabbitai bot commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

railway-app bot commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dependency Review

Scanned Files

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes:

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

overcut-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gkorland commented Jan 7, 2026

Uh oh!

coderabbitai bot commented Jan 7, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

galshubeli commented Jan 7, 2026 •

edited

Loading

overcut-ai bot commented Jan 7, 2026 •

edited

Loading

coderabbitai bot commented Jan 7, 2026 •

edited

Loading

railway-app bot commented Jan 7, 2026 •

edited

Loading

github-actions bot commented Jan 7, 2026 •

edited

Loading