Skip to content

Conversation

@sciapanCA
Copy link
Contributor

No description provided.

@gemini-code-assist
Copy link

Summary of Changes

Hello @sciapanCA, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant refactoring to how data sources are identified and utilized within the system. Instead of relying on internal, opaque data source IDs, the system now leverages more user-friendly and descriptive data source names. This change streamlines the user experience when interacting with tools that require specifying data sources, such as for code consultation and search. The update includes modifications to function signatures, internal processing logic, and comprehensive test coverage, while also ensuring backward compatibility for existing integrations.

Highlights

  • Refactored Data Source Handling: The codebase now primarily uses 'data source names' instead of 'data source IDs' for specifying repositories and workspaces in tools like codebase_consultant and codebase_search.
  • Updated Tool Signatures and Logic: The codebase_consultant and codebase_search functions have been updated to accept a data_sources parameter (expecting names), and their internal logic now processes these names for API requests.
  • New Utility Functions for Name Normalization: New utility functions, normalize_data_source_names and format_data_source_names, have been introduced to handle various input formats for data source names, including strings, lists, and dictionaries, prioritizing the 'name' field.
  • Backward Compatibility: Aliases have been added in src/utils/errors.py to ensure that older calls using normalize_data_source_ids and format_data_source_ids continue to function by mapping to the new name-based implementations.
  • Enhanced Documentation and Tests: Docstrings, example usage, and test cases across multiple files have been updated to reflect the shift to data source names and to validate the new normalization logic.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the codebase to use data source names instead of IDs, which is a good improvement for usability. The changes are mostly consistent across the tools and tests. However, I've found a critical issue with how backward compatibility is handled, which could lead to silent breaking changes, and an incorrect test case that should be fixed.

Comment on lines 115 to 117
# Backward compatibility aliases for legacy imports
normalize_data_source_ids = normalize_data_source_names
format_data_source_ids = format_data_source_names

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Aliasing format_data_source_ids to format_data_source_names introduces a significant and silent breaking change. The original format_data_source_ids returned a list of dictionaries (e.g., [{"id": "id1"}]), while the new function returns a flat list of strings (e.g., ["id1"]).

Any part of the codebase that still imports and uses format_data_source_ids will now receive an unexpected data structure, which will likely cause runtime errors in downstream code. This is a very risky change.

To avoid this, I recommend one of the following approaches:

  1. If format_data_source_ids is no longer used anywhere, it should be completely removed along with its export from utils/__init__.py.
  2. If it is still used, you should perform a full repository-wide refactor to replace all usages with the new function, and then remove the old one.
  3. If a gradual migration is needed, the old function should be kept and marked as deprecated, rather than being silently replaced by an alias.

@codealive-github-integration
Copy link

🔍 Code Review

Hello @sciapanCA, thank you for your contribution to the 'Feature/mcp names utilize' pull request!

Overall, this pull request moves towards utilizing data source names, which is a positive architectural direction. However, the review has identified critical architectural and security concerns that require immediate attention to ensure system stability and data integrity.

🎯 Key Changes & Highlights

✨ **Data Source Name Transition**

The shift towards utilizing data source names instead of generic IDs is a beneficial architectural evolution. This change enhances clarity and semantic meaning, improving readability and maintainability for future development and debugging efforts.

🚨 Issues & Suggestions

🏗️ Breaking Alias Contract

A backward compatibility alias for `format_data_source_ids` introduces a breaking change in its return type, risking runtime errors across the system.

Problem:
The change introduces a backward compatibility alias format_data_source_ids = format_data_source_names in src/utils/errors.py. However, the original format_data_source_ids function was designed to return a list of dictionaries (e.g., [{"id": "id1"}]), while the new format_data_source_names function returns a simple list of strings (e.g., ["id1"]). This fundamental change in return type for an aliased function constitutes a breaking change in the API contract for any existing code that still imports and uses format_data_source_ids. This will lead to runtime errors in consumers expecting the old dictionary format, severely compromising system integrity during the migration period.

Fix:
The current alias format_data_source_ids = format_data_source_names should be removed. Instead, a clear migration strategy is needed to ensure backward compatibility or a controlled breaking change:

  1. Option A (Recommended for backward compatibility): Keep the original format_data_source_ids function (or a wrapper that preserves its original return type) for backward compatibility, and introduce format_data_source_names as a distinct new function. All call sites of format_data_source_ids must then be explicitly refactored to use format_data_source_names when the new behavior is desired.
  2. Option B (Forcing explicit migration): If format_data_source_ids must be deprecated or removed, then all its existing call sites must be identified and refactored before this change is deployed. No alias should be used if it fundamentally alters the function's contract.

Specific Actionable Steps:
Modify src/utils/errors.py (lines 119-120 in the modified code) to remove or correctly implement backward compatibility for format_data_source_ids:

# Remove these aliases if they break existing contracts
# normalize_data_source_ids = normalize_data_source_names
# format_data_source_ids = format_data_source_names

# OR, if backward compatibility is critical, provide a wrapper that preserves the old contract:
# def format_data_source_ids(data_sources: Optional[list]) -> list:
#     # This would need to convert the output of format_data_source_names back to the old dict format
#     names = format_data_source_names(data_sources)
#     return [{"id": name} for name in names]

🔐 Unsafe Data Source Handling

The `format_data_source_names` function implicitly converts unexpected types, potentially leading to malformed data and security risks.

Problem:
The format_data_source_names function in src/utils/errors.py implicitly converts non-string, non-dictionary types (or dictionary values that are not strings) into strings using str(). This can lead to complex string representations (e.g., "{ 'key': 'value' }", "[1, 2]") being passed as data source names to the backend API. While the immediate impact might be a backend error, this flexibility introduces a risk of unexpected backend behavior, potential parsing vulnerabilities, or unintended data source resolution if the backend's name resolution logic is overly permissive or has parsing quirks. Data source names are security-sensitive as they control access to codebases.

Fix:
The format_data_source_names function should explicitly validate and only accept expected types (strings or dictionaries that contain string names/ids). Any other types should be explicitly rejected or handled with a clear error/warning, rather than implicitly converted. This ensures that only well-formed data source names are sent to the backend, reducing the attack surface and improving predictability.

Specific Actionable Steps:
Modify src/utils/errors.py in the format_data_source_names function (lines 106-117 in the modified code) to explicitly handle expected types and skip or log unexpected ones:

def format_data_source_names(data_sources: Optional[list]) -> list:
    formatted: list[str] = []
    if not data_sources:
        return formatted

    for ds in data_sources:
        if isinstance(ds, str):
            name = ds.strip()
            if name:
                formatted.append(name)
        elif isinstance(ds, dict):
            name = ds.get("name") or ds.get("id")
            if isinstance(name, str):
                name = name.strip()
                if name:
                    formatted.append(name)
            else:
                # Log a warning or raise an error for unexpected type in dict value
                # For now, skip to avoid sending malformed names
                # Example: logging.warning(f"Skipping data source with unexpected name type in dict: {type(name)}")
                pass
        else:
            # Log a warning or raise an error for unexpected data source type
            # This prevents complex objects (like lists, numbers, booleans directly as ds) from being stringified).
            # Example: logging.warning(f"Skipping data source with unexpected type: {type(ds)}")
            pass
    return formatted

🚦 Merge Recommendation

❌ Request Changes - Critical architectural and security issues require resolution before this change can be safely merged.

Reasoning

The identified breaking API contract due to the alias for format_data_source_ids and the insecure handling of data source name types pose significant risks to system stability and security. Addressing these issues is paramount to prevent runtime errors and potential vulnerabilities in production.


🤖 This review was generated by CodeAlive AI

AI can make mistakes, so please verify all suggestions and use with caution. We value your feedback about this review - please contact us at support@codealive.ai.

💡 Tip: Comment /codealive review to retrigger reviews. You can also manage reviews at https://app.codealive.ai/pull-requests.

@sciapanCA sciapanCA merged commit 4c0aa5f into main Oct 16, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants