Skip to content

RAG: DocumentCollections and VectorIndices#86

Merged
JeanKaddour merged 71 commits intomainfrom
feat/rag
Jan 15, 2025
Merged

RAG: DocumentCollections and VectorIndices#86
JeanKaddour merged 71 commits intomainfrom
feat/rag

Conversation

@JeanKaddour
Copy link
Copy Markdown
Contributor

@JeanKaddour JeanKaddour commented Jan 15, 2025

TL;DR

We now support the creation of

  1. DocumentCollection objects: a set of documents to be chunked
  2. VectorIndex objects: given 1), we embed these chunks and upsert them into a vector DB

Details

Provider Configurations and API Updates:

  • backend/app/api/key_management.py: Added new provider configurations for various LLM and vector store providers. Introduced new API endpoints to get provider configurations, embedding models, and vector stores. Updated the mask_key_value function to handle different parameter types. [1] [2] [3] [4] [5]

Database Schema Modifications:

Documentation and Miscellaneous:

These changes enhance the project's provider management capabilities, improve database schema handling, and update documentation and configurations.


Important

This pull request adds support for managing document collections and vector indices, including frontend components, backend API endpoints, and updates to the Nginx configuration for larger file uploads.

  • Frontend:
    • Add DocumentCollectionWizard, VectorIndexWizard, KnowledgeBases, DocumentCollectionDetails, and VectorIndexDetails components for managing document collections and vector indices.
    • Update Header to include a new "RAG" page.
    • Add pages for creating and viewing document collections and vector indices.
  • Backend:
    • Add API endpoints for managing document collections and vector indices in rag_management.py.
    • Implement models for DocumentCollectionModel, VectorIndexModel, and ProcessingProgressModel in dc_and_vi_model.py.
    • Add functions for handling document collections and vector indices in api.py.
  • Nginx:
    • Update default.conf to increase client_max_body_size to 100M for larger file uploads.

This description was created by Ellipsis for b3b9027. It will automatically update as commits are pushed.

Copy link
Copy Markdown
Contributor

@ellipsis-dev ellipsis-dev Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Incremental review on 3d9422a in 40 seconds

More details
  • Looked at 1218 lines of code in 2 files
  • Skipped 0 files when reviewing.
  • Skipped posting 0 drafted comments based on config settings.

Workflow ID: wflow_q9FgcYDnCjSOPbaB


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

Copy link
Copy Markdown
Contributor

@ellipsis-dev ellipsis-dev Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Reviewed everything up to 8555f5f in 2 minutes and 20 seconds

More details
  • Looked at 11214 lines of code in 58 files
  • Skipped 4 files when reviewing.
  • Skipped posting 1 drafted comments based on config settings.
1. backend/app/api/key_management.py:213
  • Draft comment:
    The mask_key_value function should explicitly handle non-password types by returning the full value. Consider adding a comment or clarifying the logic to ensure this behavior is clear.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The mask_key_value function has been updated to handle different parameter types, but the logic for masking non-password types is not clear. It should return the full value for non-password types, but this is not explicitly stated in the code.

Workflow ID: wflow_GC1TZKohXhODJo76


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

Comment thread backend/app/api/rag_management.py
Copy link
Copy Markdown
Contributor

@ellipsis-dev ellipsis-dev Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Incremental review on fd402ac in 1 minute and 9 seconds

More details
  • Looked at 608 lines of code in 5 files
  • Skipped 0 files when reviewing.
  • Skipped posting 6 drafted comments based on config settings.
1. backend/app/api/rag_management.py:70
  • Draft comment:
    The use of cast here is unnecessary since collection.status is already a string type. You can directly assign the value without casting.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The code is using cast to change the type of collection.status, but this is unnecessary since collection.status is already a string type. The cast function is redundant here.
2. backend/app/models/dc_and_vi_model.py:113
  • Draft comment:
    Consider using func.now() from SQLAlchemy for the default value of created_at to ensure the timestamp is set at the time of record creation.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The created_at and updated_at fields in the ProcessingProgressModel are using datetime.now(UTC) for their default values. This is not ideal because it sets the time at the moment the server starts, not when the record is created. Using func.now() from SQLAlchemy would be more appropriate.
3. backend/app/models/dc_and_vi_model.py:119
  • Draft comment:
    Consider using func.now() from SQLAlchemy for the default value of updated_at to ensure the timestamp is set at the time of record creation and updated correctly.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The created_at and updated_at fields in the ProcessingProgressModel are using datetime.now(UTC) for their default values. This is not ideal because it sets the time at the moment the server starts, not when the record is created. Using func.now() from SQLAlchemy would be more appropriate.
4. backend/app/schemas/rag_schemas.py:25
  • Draft comment:
    Using os.getenv to fetch API keys can lead to security issues. Consider using a more secure method to handle sensitive information.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The get_vision_config method in TextProcessingConfig uses os.getenv to fetch API keys. This can lead to security issues if the environment variables are not properly managed. Consider using a more secure method to handle sensitive information.
5. backend/app/schemas/rag_schemas.py:27
  • Draft comment:
    Using os.getenv to fetch API keys can lead to security issues. Consider using a more secure method to handle sensitive information.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The get_vision_config method in TextProcessingConfig uses os.getenv to fetch API keys. This can lead to security issues if the environment variables are not properly managed. Consider using a more secure method to handle sensitive information.
6. frontend/src/utils/api.ts:888
  • Draft comment:
    Returning null for a 404 error is good practice, but consider handling other errors more gracefully, possibly by logging them or providing a default fallback.
  • Reason this comment was not posted:
    Confidence changes required: 33%
    The getIndexProgress function in api.ts returns null for a 404 error, which is a good practice. However, it should also handle other potential errors gracefully, possibly by logging them or providing a default fallback.

Workflow ID: wflow_f9LTjKfQcpcRvfEM


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

Copy link
Copy Markdown
Contributor

@ellipsis-dev ellipsis-dev Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Incremental review on f7fda46 in 1 minute and 21 seconds

More details
  • Looked at 126 lines of code in 1 files
  • Skipped 0 files when reviewing.
  • Skipped posting 2 drafted comments based on config settings.
1. frontend/src/components/rag/VectorIndexWizard.tsx:169
  • Draft comment:
    The useEffect hook at line 131 duplicates data fetching already done in the useEffect at line 94. Consider removing it to avoid redundant network requests.
  • Reason this comment was not posted:
    Comment was not on a valid diff hunk.
2. frontend/src/components/rag/VectorIndexWizard.tsx:617
  • Draft comment:
    Ensure that the 'Next' button is disabled if the name is only whitespace by checking !config.name.trim() in the condition.
  • Reason this comment was not posted:
    Comment did not seem useful.

Workflow ID: wflow_oyGpNXslPcZprk7S


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

Copy link
Copy Markdown
Collaborator

@srijanpatel srijanpatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all the new and modified frontend code uses two spaces for tab, move it to four spaces to match with the rest of it.

Comment thread backend/app/models/dc_and_vi_model.py
Comment thread backend/app/models/management/alembic/versions/004_update_dc_and_vi.py Outdated
Comment thread backend/app/rag/models/datastore_api_schemas.py Outdated
Comment thread backend/app/rag/models/document_schemas.py
Comment thread backend/app/schemas/rag_schemas.py
Comment thread frontend/src/pages/rag/[id].tsx Outdated
Comment thread frontend/src/utils/api.ts
Comment thread frontend/src/utils/api.ts
Comment thread requirements.txt
Copy link
Copy Markdown
Contributor

@ellipsis-dev ellipsis-dev Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Incremental review on fb86ac9 in 2 minutes and 9 seconds

More details
  • Looked at 9174 lines of code in 30 files
  • Skipped 1 files when reviewing.
  • Skipped posting 4 drafted comments based on config settings.
1. frontend/src/hooks/useWorkflowExecution.ts:29
  • Draft comment:
    Consider using a ref to store currentStatusInterval instead of a top-level let variable. This ensures each hook instance has its own interval and avoids potential conflicts when multiple components use this hook simultaneously.
  • Reason this comment was not posted:
    Comment was on unchanged code.
2. frontend/src/components/rag/AddDocumentsWizard.tsx:27
  • Draft comment:
    Consider using a ref to store pollingInterval instead of state. This avoids unnecessary re-renders when the interval is set or cleared.
  • Reason this comment was not posted:
    Comment was on unchanged code.
3. frontend/src/components/rag/DocumentCollectionWizard.tsx:124
  • Draft comment:
    Remove the console.log statement used for debugging purposes.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    In DocumentCollectionWizard.tsx, the handleSubmit function logs the config to the console. This is likely a leftover from debugging and should be removed before merging.
4. frontend/src/components/rag/VectorIndexDetails.tsx:63
  • Draft comment:
    Consider using a ref to store pollInterval instead of a let variable. This ensures the interval is properly managed across renders and avoids potential issues with stale closures.
  • Reason this comment was not posted:
    Comment was on unchanged code.

Workflow ID: wflow_ARqXdzs2dHEotp5n


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

Copy link
Copy Markdown
Contributor

@ellipsis-dev ellipsis-dev Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Incremental review on 73227d9 in 44 seconds

More details
  • Looked at 61 lines of code in 2 files
  • Skipped 0 files when reviewing.
  • Skipped posting 1 drafted comments based on config settings.
1. backend/app/models/management/alembic/versions/004_add_progress_status.py:3
  • Draft comment:
    Ensure that the revision and down_revision IDs are unique and correctly ordered to maintain migration integrity.
  • Reason this comment was not posted:
    Comment did not seem useful.

Workflow ID: wflow_8G8NEoRhhELux5rS


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

Copy link
Copy Markdown
Contributor

@ellipsis-dev ellipsis-dev Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Incremental review on b3b9027 in 1 minute and 4 seconds

More details
  • Looked at 401 lines of code in 10 files
  • Skipped 0 files when reviewing.
  • Skipped posting 1 drafted comments based on config settings.
1. backend/app/api/rag_management.py:257
  • Draft comment:
    Consider adding db.rollback() in the exception handling to ensure database consistency in case of errors.
  • Reason this comment was not posted:
    Comment was not on a valid diff hunk.

Workflow ID: wflow_SR4mepy9zUtT2SNP


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

@JeanKaddour JeanKaddour merged commit 0ac2e0e into main Jan 15, 2025
@JeanKaddour JeanKaddour deleted the feat/rag branch January 15, 2025 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants