-
Notifications
You must be signed in to change notification settings - Fork 0
feat: Implement batch processing for document ingestion in MediawikiETL #52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Caution Review failedThe pull request is closed. """ WalkthroughThe Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant MediawikiETL
participant ThreadPoolExecutor
participant IngestionPipeline
User->>MediawikiETL: load(documents)
MediawikiETL->>ThreadPoolExecutor: submit batch ingestion tasks
ThreadPoolExecutor->>IngestionPipeline: run_pipeline(batch)
IngestionPipeline-->>ThreadPoolExecutor: result or exception
ThreadPoolExecutor-->>MediawikiETL: future completion
MediawikiETL->>MediawikiETL: log success or error
MediawikiETL->>MediawikiETL: log total documents loaded
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
hivemind_etl/mediawiki/etl.py (2)
103-103: Consider making batch size configurable.The hardcoded batch size of 1000 might not be optimal for all use cases. Consider making it configurable through the constructor or as a parameter.
For example, add a batch_size parameter to the constructor:
def __init__( self, community_id: str, namespaces: list[int], platform_id: str, delete_dump_after_load: bool = True, + batch_size: int = 1000, ) -> None:Then use
self.batch_sizein the load method.
104-106: Consider adding error handling for individual batches.If one batch fails during ingestion, the entire process will stop. Consider adding error handling to log failures and continue with remaining batches, or at least provide better error context.
Example implementation:
for i in range(0, len(documents), batch_size): batch_num = (i // batch_size) + 1 total_batches = (len(documents) + batch_size - 1) // batch_size logging.info("Loading batch %d/%d into Qdrant!", batch_num, total_batches) + try: ingestion_pipeline.run_pipeline(documents[i : i + batch_size]) + except Exception as e: + logging.error("Failed to load batch %d/%d: %s", batch_num, total_batches, e) + raise🧰 Tools
🪛 Pylint (3.3.7)
[warning] 105-105: Use lazy % formatting in logging functions
(W1203)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
hivemind_etl/mediawiki/etl.py(1 hunks)
🧰 Additional context used
🪛 Pylint (3.3.7)
hivemind_etl/mediawiki/etl.py
[warning] 105-105: Use lazy % formatting in logging functions
(W1203)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: ci / build-push / Build + Push Image
Summary by CodeRabbit