Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add better handling for ingesting duplicates #450

Merged
merged 2 commits into from
Jun 13, 2024
Merged

Conversation

NolanTrem
Copy link
Collaborator

@NolanTrem NolanTrem commented Jun 13, 2024

🚀 This description was created by Ellipsis for commit 2658f60

Summary:

Improved duplicate handling in document and file ingestion functions in r2r/main/r2r_app.py, ensuring duplicates are skipped and appropriate messages are logged and returned.

Key points:

  • Enhanced duplicate handling in r2r/main/r2r_app.py for document and file ingestion.
  • Updated aingest_documents to skip duplicates and log appropriate messages.
  • Modified aingest_files to handle duplicate files similarly.
  • Added checks for existing document IDs and raised HTTPException for conflicts.
  • Updated return values to include lists of processed and skipped documents/files.
  • Adjusted ingest_documents_app and ingest_files_app to propagate HTTPException correctly.

Generated with ❤️ by ellipsis.dev

Copy link

vercel bot commented Jun 13, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
r2r-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jun 13, 2024 11:17pm

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested. Reviewed everything up to 2658f60 in 43 seconds

More details
  • Looked at 245 lines of code in 1 files
  • Skipped 0 files when reviewing.
  • Skipped posting 0 drafted comments based on config settings.

Workflow ID: wflow_6L6hAdlRx9Vm6gYh


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

@@ -289,7 +289,27 @@ async def aingest_documents(
)

document_infos = []
skipped_documents = []
processed_documents = []
existing_document_ids = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current approach of fetching all document IDs from the database to check for duplicates can be inefficient, especially with a large number of documents. Consider querying the database for each document ID directly or using a batch query to improve performance and scalability.

@NolanTrem NolanTrem merged commit 536429c into main Jun 13, 2024
2 of 3 checks passed
@NolanTrem NolanTrem deleted the Nolan/DoubleIngestion branch June 13, 2024 23:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant