-
Notifications
You must be signed in to change notification settings - Fork 30
Description improvements - Unified #366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Completed Working on "Code Review"✅ Workflow completed successfully. |
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the 📝 WalkthroughWalkthroughThe pull request refactors the data sampling and column description strategy across loaders. It replaces distinct/count query approaches with a single-query sampling method ( Changes
Sequence Diagram(s)sequenceDiagram
participant Loader as Loader<br/>(Graph/MySQL/Postgres)
participant DB as Database
participant LLM as LLM API
participant Graph as Graph<br/>Storage
Loader->>Loader: load_to_graph() / extract_tables_info()
rect rgb(220, 240, 255)
Note over Loader,DB: Sampling Phase (NEW)
Loader->>DB: _execute_sample_query<br/>(for each column)
DB-->>Loader: List[sample_values]
Loader->>Loader: Build column_info with<br/>sample_values field
end
rect rgb(240, 255, 220)
Note over Loader,LLM: Description Enhancement Phase (NEW)
Loader->>LLM: create_combined_description<br/>(table_info, batch_size)
LLM-->>Loader: Enhanced table descriptions
Loader->>Loader: Update table_info with<br/>LLM descriptions
end
rect rgb(255, 240, 220)
Note over Loader,Graph: Graph Building Phase
Loader->>Loader: Build column nodes<br/>with augmented<br/>descriptions<br/>(original + sample_values)
Loader->>Graph: Create nodes & relationships
Graph-->>Loader: Confirmation
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
🚅 Deployed to the QueryWeaver-pr-366 environment in queryweaver
|
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors the table and column description generation system to use batch completion for table descriptions and switches from distinct values to sample values for column descriptions.
Key Changes:
- Introduces
create_combined_description()function inapi/utils.pythat uses batch completion to generate AI-powered table descriptions - Replaces distinct value extraction with random sample value extraction across all database loaders
- Separates sample values from column descriptions, storing them in a dedicated
sample_valuesfield that gets appended during graph loading
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| api/utils.py | Adds new create_combined_description() function to generate table descriptions via batch completion API |
| api/loaders/postgres_loader.py | Replaces _execute_count_query and _execute_distinct_query with _execute_sample_query for random sampling |
| api/loaders/mysql_loader.py | Replaces count/distinct queries with sample query implementation using MySQL's RAND() |
| api/loaders/graph_loader.py | Integrates table description generation and formats sample values into final column descriptions |
| api/loaders/base_loader.py | Updates abstract base class to reflect new sample-based approach instead of distinct value logic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of findings:
- Importance counts: BLOCKER 0, CRITICAL 0, MAJOR 4, MINOR 0, SUGGESTION 0, PRAISE 0 across api/utils.py and api/loaders/graph_loader.py. No blocker-level items were raised, but several major regressions were identified.
Key themes:
- The new
create_combined_descriptionflow is fragile: serialization fails on Decimal/datetime defaults, in-place mutation ofcol_descriptionsdisables batching, and batch LLM errors now bubble up without fallback, collectively threatening pipeline reliability and performance. - Graph loader column embeddings are now inconsistent with the augmented descriptions once sample values are appended, undermining search relevance.
Recommended next steps:
- Sanitize or deep-copy
table_propbeforejson.dumps, work on non-mutating copies so batching metadata survives, and wrap the batch completion call with retry/fallback behavior. - Ensure embeddings are generated from the same final description text that users see after sample values are appended (or append before embedding) so vectors and descriptions stay aligned.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In @api/loaders/base_loader.py:
- Around line 42-67: The method extract_sample_values_for_column currently
converts sample values to strings despite the docstring saying it returns raw
values; change the implementation to return raw values instead of stringifying
them: call cls._execute_sample_query(...) as before, then validate that all
items in sample_values are primitive types (e.g., isinstance(v, (str, int,
float))) and if so return sample_values directly (List[Any]); otherwise return
[]. Update only the logic in extract_sample_values_for_column (and leave
_execute_sample_query untouched).
In @api/utils.py:
- Around line 41-55: Do not mutate the original table_prop and make JSON
serialization robust: create a shallow copy (e.g., table_prop_copy =
table_prop.copy()), remove the "col_descriptions" key from that copy
(table_prop_copy.pop("col_descriptions", None)) so downstream code that reads
table_prop still has the original data, and use json.dumps(table_prop_copy,
default=str) when formatting the user_prompt_template to safely serialize
non-JSON types; keep appending the original table_prop (unaltered) to any lists
like messages_list/table_keys as before.
- Around line 57-76: Wrap the call to batch_completion (the call using
Config.COMPLETION_MODEL with messages=batch_messages) in a try/except so
transient API/network errors don't abort the loop; on exception, iterate the
current batch's table keys (using batch_start, batch_size and table_keys) and
set each table_info[table_name]["description"] = table_name (or a fallback
string) and optionally log the exception, then continue to the next batch; keep
the existing per-response Exception handling for successful responses.
- Around line 10-77: create_combined_description currently exceeds pylint's
local-variable limit; extract the per-table message construction into a small
helper (e.g., build_table_messages(table_name, table_prop, system_prompt,
user_prompt_template)) that returns the messages list and use
table_prop.pop("col_descriptions", None) to avoid extra locals; replace the
inline construction in create_combined_description with calls to that helper and
adjust the caller to append the returned messages and table_name, which will
reduce locals in create_combined_description to satisfy pylint.
🧹 Nitpick comments (1)
api/loaders/mysql_loader.py (1)
57-74: Consider using parameterized identifiers for SQL construction.Static analysis flags potential SQL injection at lines 64-70. While
table_nameandcol_nameoriginate fromINFORMATION_SCHEMAqueries (not user input), using string formatting for SQL construction is generally discouraged.Consider using a SQL builder library that supports parameterized identifiers, or at minimum, add identifier validation to ensure they match expected patterns.
♻️ Optional: Add identifier validation
@staticmethod def _execute_sample_query( cursor, table_name: str, col_name: str, sample_size: int = 3 ) -> List[Any]: """ Execute query to get random sample values for a column. MySQL implementation using ORDER BY RAND() for random sampling. """ + # Validate identifiers to prevent injection (identifiers from INFORMATION_SCHEMA) + if not re.match(r'^[a-zA-Z0-9_]+$', table_name) or not re.match(r'^[a-zA-Z0-9_]+$', col_name): + return [] + query = f""" SELECT DISTINCT `{col_name}` FROM `{table_name}`Note: This is defensive coding; the current implementation is likely safe given the data source.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
api/loaders/base_loader.pyapi/loaders/graph_loader.pyapi/loaders/mysql_loader.pyapi/loaders/postgres_loader.pyapi/utils.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
All Python code should pass pylint (use make lint)
Files:
api/loaders/graph_loader.pyapi/utils.pyapi/loaders/mysql_loader.pyapi/loaders/base_loader.pyapi/loaders/postgres_loader.py
🧬 Code graph analysis (3)
api/loaders/graph_loader.py (1)
api/utils.py (1)
create_combined_description(10-77)
api/utils.py (1)
api/config.py (1)
Config(62-148)
api/loaders/postgres_loader.py (2)
api/loaders/base_loader.py (2)
_execute_sample_query(26-40)extract_sample_values_for_column(43-67)api/loaders/mysql_loader.py (1)
_execute_sample_query(57-74)
🪛 GitHub Actions: Pylint
api/utils.py
[error] 10-10: pylint: Too many local variables (R0914) in api/utils.py (16/15)
🪛 Ruff (0.14.10)
api/utils.py
23-23: Avoid specifying long messages outside the exception class
(TRY003)
api/loaders/mysql_loader.py
64-70: Possible SQL injection vector through string-based query construction
(S608)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: test
🔇 Additional comments (1)
api/loaders/postgres_loader.py (1)
54-76: LGTM: Proper SQL injection protection.The implementation correctly uses
psycopg2.sql.Identifier()fortable_nameandcol_name(lines 71-72), providing robust protection against SQL injection. The query structure with parameterizedLIMIT(line 74) follows best practices.
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.