Skip to content

Conversation

@galshubeli
Copy link
Collaborator

@galshubeli galshubeli commented Jan 7, 2026

Summary by CodeRabbit

  • New Features
    • Database columns now include sample values automatically extracted alongside column metadata
    • Column descriptions are enhanced with actual sample data for improved context

✏️ Tip: You can customize this high-level summary in your review settings.

@galshubeli galshubeli requested review from Copilot and gkorland January 7, 2026 13:09
@galshubeli galshubeli self-assigned this Jan 7, 2026
@overcut-ai
Copy link

overcut-ai bot commented Jan 7, 2026

Completed Working on "Code Review"

✅ Workflow completed successfully.


👉 View complete log

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 7, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

The pull request refactors the data sampling and column description strategy across loaders. It replaces distinct/count query approaches with a single-query sampling method (_execute_sample_query) and introduces LLM-based table description generation via a new create_combined_description function. Column metadata now includes sampled values alongside existing fields.

Changes

Cohort / File(s) Summary
Base Loader Abstraction
api/loaders/base_loader.py
Consolidates two abstract methods (_execute_count_query, _execute_distinct_query) into one (_execute_sample_query); renames extract_distinct_values_for_column to extract_sample_values_for_column with updated return type (List[Any]); removes Config dependency.
Database Loader Implementations
api/loaders/mysql_loader.py, api/loaders/postgres_loader.py
Both files implement new _execute_sample_query method with database-specific syntax (MySQL RAND(), PostgreSQL RANDOM()); update extract_columns_info to populate 'sample_values' field in returned column descriptors; query logic shifts from aggregate counting to random sampling.
Graph Loader Enhancement
api/loaders/graph_loader.py
Imports and invokes create_combined_description early in load flow; augments column node descriptions by appending sample values to original descriptions before node creation.
LLM Description Generation
api/utils.py
Introduces create_combined_description function that batches table metadata through LLM (via litellm batch\_completion) to generate enhanced table descriptions with configurable batch size and error handling.

Sequence Diagram(s)

sequenceDiagram
    participant Loader as Loader<br/>(Graph/MySQL/Postgres)
    participant DB as Database
    participant LLM as LLM API
    participant Graph as Graph<br/>Storage

    Loader->>Loader: load_to_graph() / extract_tables_info()
    
    rect rgb(220, 240, 255)
    Note over Loader,DB: Sampling Phase (NEW)
    Loader->>DB: _execute_sample_query<br/>(for each column)
    DB-->>Loader: List[sample_values]
    Loader->>Loader: Build column_info with<br/>sample_values field
    end
    
    rect rgb(240, 255, 220)
    Note over Loader,LLM: Description Enhancement Phase (NEW)
    Loader->>LLM: create_combined_description<br/>(table_info, batch_size)
    LLM-->>Loader: Enhanced table descriptions
    Loader->>Loader: Update table_info with<br/>LLM descriptions
    end
    
    rect rgb(255, 240, 220)
    Note over Loader,Graph: Graph Building Phase
    Loader->>Loader: Build column nodes<br/>with augmented<br/>descriptions<br/>(original + sample_values)
    Loader->>Graph: Create nodes & relationships
    Graph-->>Loader: Confirmation
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 Hops of joy through sampling flows,
Where random data gently glows,
LLM whispers descriptions true,
Distinct no more—a fresh debut!
Sample by sample, the graph now grows. 🌿✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'Description improvements - Unified' is vague and generic, using non-descriptive terms that don't convey specific information about the substantial architectural changes in the changeset. Consider a more specific title that reflects the main change, such as 'Replace distinct/count queries with sampling approach for column values' or 'Unify column value extraction to use sampling instead of distinct counts'.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@railway-app
Copy link

railway-app bot commented Jan 7, 2026

🚅 Deployed to the QueryWeaver-pr-366 environment in queryweaver

Service Status Web Updated (UTC)
QueryWeaver ✅ Success (View Logs) Web Jan 7, 2026 at 2:04 pm

@github-actions
Copy link

github-actions bot commented Jan 7, 2026

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the table and column description generation system to use batch completion for table descriptions and switches from distinct values to sample values for column descriptions.

Key Changes:

  • Introduces create_combined_description() function in api/utils.py that uses batch completion to generate AI-powered table descriptions
  • Replaces distinct value extraction with random sample value extraction across all database loaders
  • Separates sample values from column descriptions, storing them in a dedicated sample_values field that gets appended during graph loading

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
api/utils.py Adds new create_combined_description() function to generate table descriptions via batch completion API
api/loaders/postgres_loader.py Replaces _execute_count_query and _execute_distinct_query with _execute_sample_query for random sampling
api/loaders/mysql_loader.py Replaces count/distinct queries with sample query implementation using MySQL's RAND()
api/loaders/graph_loader.py Integrates table description generation and formats sample values into final column descriptions
api/loaders/base_loader.py Updates abstract base class to reflect new sample-based approach instead of distinct value logic

Copy link

@overcut-ai overcut-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of findings:

  • Importance counts: BLOCKER 0, CRITICAL 0, MAJOR 4, MINOR 0, SUGGESTION 0, PRAISE 0 across api/utils.py and api/loaders/graph_loader.py. No blocker-level items were raised, but several major regressions were identified.

Key themes:

  1. The new create_combined_description flow is fragile: serialization fails on Decimal/datetime defaults, in-place mutation of col_descriptions disables batching, and batch LLM errors now bubble up without fallback, collectively threatening pipeline reliability and performance.
  2. Graph loader column embeddings are now inconsistent with the augmented descriptions once sample values are appended, undermining search relevance.

Recommended next steps:

  • Sanitize or deep-copy table_prop before json.dumps, work on non-mutating copies so batching metadata survives, and wrap the batch completion call with retry/fallback behavior.
  • Ensure embeddings are generated from the same final description text that users see after sample values are appended (or append before embedding) so vectors and descriptions stay aligned.

@railway-app railway-app bot temporarily deployed to queryweaver / QueryWeaver-pr-366 January 7, 2026 13:30 Destroyed
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@railway-app railway-app bot temporarily deployed to queryweaver / QueryWeaver-pr-366 January 7, 2026 13:33 Destroyed
@gkorland
Copy link
Contributor

gkorland commented Jan 7, 2026

@coderabbitai review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 7, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In @api/loaders/base_loader.py:
- Around line 42-67: The method extract_sample_values_for_column currently
converts sample values to strings despite the docstring saying it returns raw
values; change the implementation to return raw values instead of stringifying
them: call cls._execute_sample_query(...) as before, then validate that all
items in sample_values are primitive types (e.g., isinstance(v, (str, int,
float))) and if so return sample_values directly (List[Any]); otherwise return
[]. Update only the logic in extract_sample_values_for_column (and leave
_execute_sample_query untouched).

In @api/utils.py:
- Around line 41-55: Do not mutate the original table_prop and make JSON
serialization robust: create a shallow copy (e.g., table_prop_copy =
table_prop.copy()), remove the "col_descriptions" key from that copy
(table_prop_copy.pop("col_descriptions", None)) so downstream code that reads
table_prop still has the original data, and use json.dumps(table_prop_copy,
default=str) when formatting the user_prompt_template to safely serialize
non-JSON types; keep appending the original table_prop (unaltered) to any lists
like messages_list/table_keys as before.
- Around line 57-76: Wrap the call to batch_completion (the call using
Config.COMPLETION_MODEL with messages=batch_messages) in a try/except so
transient API/network errors don't abort the loop; on exception, iterate the
current batch's table keys (using batch_start, batch_size and table_keys) and
set each table_info[table_name]["description"] = table_name (or a fallback
string) and optionally log the exception, then continue to the next batch; keep
the existing per-response Exception handling for successful responses.
- Around line 10-77: create_combined_description currently exceeds pylint's
local-variable limit; extract the per-table message construction into a small
helper (e.g., build_table_messages(table_name, table_prop, system_prompt,
user_prompt_template)) that returns the messages list and use
table_prop.pop("col_descriptions", None) to avoid extra locals; replace the
inline construction in create_combined_description with calls to that helper and
adjust the caller to append the returned messages and table_name, which will
reduce locals in create_combined_description to satisfy pylint.
🧹 Nitpick comments (1)
api/loaders/mysql_loader.py (1)

57-74: Consider using parameterized identifiers for SQL construction.

Static analysis flags potential SQL injection at lines 64-70. While table_name and col_name originate from INFORMATION_SCHEMA queries (not user input), using string formatting for SQL construction is generally discouraged.

Consider using a SQL builder library that supports parameterized identifiers, or at minimum, add identifier validation to ensure they match expected patterns.

♻️ Optional: Add identifier validation
     @staticmethod
     def _execute_sample_query(
         cursor, table_name: str, col_name: str, sample_size: int = 3
     ) -> List[Any]:
         """
         Execute query to get random sample values for a column.
         MySQL implementation using ORDER BY RAND() for random sampling.
         """
+        # Validate identifiers to prevent injection (identifiers from INFORMATION_SCHEMA)
+        if not re.match(r'^[a-zA-Z0-9_]+$', table_name) or not re.match(r'^[a-zA-Z0-9_]+$', col_name):
+            return []
+        
         query = f"""
             SELECT DISTINCT `{col_name}`
             FROM `{table_name}`

Note: This is defensive coding; the current implementation is likely safe given the data source.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between be15e06 and 6aeb699.

📒 Files selected for processing (5)
  • api/loaders/base_loader.py
  • api/loaders/graph_loader.py
  • api/loaders/mysql_loader.py
  • api/loaders/postgres_loader.py
  • api/utils.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

All Python code should pass pylint (use make lint)

Files:

  • api/loaders/graph_loader.py
  • api/utils.py
  • api/loaders/mysql_loader.py
  • api/loaders/base_loader.py
  • api/loaders/postgres_loader.py
🧬 Code graph analysis (3)
api/loaders/graph_loader.py (1)
api/utils.py (1)
  • create_combined_description (10-77)
api/utils.py (1)
api/config.py (1)
  • Config (62-148)
api/loaders/postgres_loader.py (2)
api/loaders/base_loader.py (2)
  • _execute_sample_query (26-40)
  • extract_sample_values_for_column (43-67)
api/loaders/mysql_loader.py (1)
  • _execute_sample_query (57-74)
🪛 GitHub Actions: Pylint
api/utils.py

[error] 10-10: pylint: Too many local variables (R0914) in api/utils.py (16/15)

🪛 Ruff (0.14.10)
api/utils.py

23-23: Avoid specifying long messages outside the exception class

(TRY003)

api/loaders/mysql_loader.py

64-70: Possible SQL injection vector through string-based query construction

(S608)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: test
🔇 Additional comments (1)
api/loaders/postgres_loader.py (1)

54-76: LGTM: Proper SQL injection protection.

The implementation correctly uses psycopg2.sql.Identifier() for table_name and col_name (lines 71-72), providing robust protection against SQL injection. The query structure with parameterized LIMIT (line 74) follows best practices.

@railway-app railway-app bot temporarily deployed to queryweaver / QueryWeaver-pr-366 January 7, 2026 13:43 Destroyed
@railway-app railway-app bot temporarily deployed to queryweaver / QueryWeaver-pr-366 January 7, 2026 13:46 Destroyed
@railway-app railway-app bot temporarily deployed to queryweaver / QueryWeaver-pr-366 January 7, 2026 13:47 Destroyed
@railway-app railway-app bot temporarily deployed to queryweaver / QueryWeaver-pr-366 January 7, 2026 13:51 Destroyed
@galshubeli galshubeli merged commit f0d79f7 into staging Jan 7, 2026
11 checks passed
@galshubeli galshubeli deleted the description-improvements branch January 7, 2026 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants