-
Notifications
You must be signed in to change notification settings - Fork 0
feat: limit batch size to 1! #61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
As a temporary fix to llama-index first loading into vectorstore issue, we limit the batch size to 1. The issue described: In llama-index pipeline when trying to load documents into vectorstore, it first loads into docstore and then into vectorstore. In any case problems raised while loading into docstore the data would be missed to be loaded into vectorstore. So we limit the batch size to 1 meaning the data will be 1 by 1 loaded into docstore + vectorstore.
WalkthroughThe batch size for document processing in the Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant MediawikiETL
participant ThreadPoolExecutor
Caller->>MediawikiETL: load(documents)
MediawikiETL->>MediawikiETL: Split documents into batches of 1
loop For each document
MediawikiETL->>ThreadPoolExecutor: submit(process, [document])
end
ThreadPoolExecutor-->>MediawikiETL: Return results as completed
MediawikiETL-->>Caller: Return processing results
Possibly related PRs
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
hivemind_etl/mediawiki/etl.py (1)
119-124: Improve error handling for single-document processing.With batch_size=1, any single document failure now stops the entire ETL process. This makes the pipeline more fragile than necessary. Consider implementing partial failure handling to skip problematic documents while continuing to process the rest.
try: future.result() # This will raise any exceptions that occurred logging.info(f"Successfully loaded batch {batch_idx} of {len(batches)} documents into Qdrant!") except Exception as e: - logging.error(f"Error processing batch {batch_idx}: {e}") - raise # Re-raise the exception to stop the process + logging.error(f"Error processing document {batch_idx}: {e}") + # Continue processing other documents instead of failing the entire batch + continueThis aligns with the retrieved learning that "Errors should be caught in ETL processes, especially in the
loadmethod."
🧹 Nitpick comments (1)
hivemind_etl/mediawiki/etl.py (1)
109-114: Consider reducing ThreadPoolExecutor workers for single-document batches.With batch_size=1, having max_workers=10 may create unnecessary overhead since each worker now processes only one document. Consider reducing the worker count or implementing a different concurrency strategy.
- with ThreadPoolExecutor(max_workers=10) as executor: + # Reduced workers for single-document processing to minimize overhead + max_workers = min(10, len(documents), 5) # Cap at 5 for single-document batches + with ThreadPoolExecutor(max_workers=max_workers) as executor:
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
hivemind_etl/mediawiki/etl.py(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: amindadgar
PR: TogetherCrew/temporal-worker-python#15
File: hivemind_etl/mediawiki/llama_xml_reader.py:77-95
Timestamp: 2025-04-13T06:12:17.600Z
Learning: Error handling for XML parsing in the `XMLReader.load_data` method in `hivemind_etl/mediawiki/llama_xml_reader.py` has been deferred for a future implementation.
Learnt from: amindadgar
PR: TogetherCrew/temporal-worker-python#1
File: hivemind_etl/website/website_etl.py:84-94
Timestamp: 2024-11-25T11:29:38.063Z
Learning: Errors should be caught in ETL processes, especially in the `load` method of the `WebsiteETL` class in `hivemind_etl/website/website_etl.py`.
hivemind_etl/mediawiki/etl.py (1)
Learnt from: amindadgar
PR: TogetherCrew/temporal-worker-python#15
File: hivemind_etl/mediawiki/llama_xml_reader.py:77-95
Timestamp: 2025-04-13T06:12:17.600Z
Learning: Error handling for XML parsing in the `XMLReader.load_data` method in `hivemind_etl/mediawiki/llama_xml_reader.py` has been deferred for a future implementation.
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: ci / test / Test
As a temporary fix to llama-index first loading into vectorstore issue, we limit the batch size to 1.
Summary by CodeRabbit