Skip to content

Conversation

@lpi-tn
Copy link
Collaborator

@lpi-tn lpi-tn commented Nov 19, 2025

This pull request updates the document collection logic in the DocumentHubCollector workflow to ensure that each corpus plugin processes only its relevant documents. The main change improves data handling accuracy by passing the correct set of documents to each collector.

Data extraction logic improvement:

  • Updated the call to corpus_collector.run in document_collector.py to use batch_docs[corpus_name] instead of welearn_documents, ensuring each corpus processes only its own documents.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a data processing bug where all corpus plugins were incorrectly receiving the complete set of documents instead of their specific subset. The change ensures each corpus collector processes only its relevant documents, improving data handling accuracy.

Key Changes:

  • Updated the document filtering logic to pass corpus-specific documents to each collector

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@lpi-tn lpi-tn merged commit 02a1150 into main Nov 19, 2025
7 checks passed
@lpi-tn lpi-tn deleted the Fix/mix-corpus-collectors branch November 19, 2025 14:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants