-
Notifications
You must be signed in to change notification settings - Fork 0
Improve scraper initialization and test support #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR improves the robustness and testability of the Reddit scraper application by implementing lazy initialization patterns, adding pytest configuration helpers, and providing graceful fallbacks for Excel export.
Key Changes:
- Introduces dependency injection and lazy initialization for
ResearchAssistant,EnhancedRetrievalSystem, andSemanticSearchEngineto support mocking and deferred heavy operations - Adds pytest configuration with automatic path setup and default environment variables for testing
- Implements a CSV fallback mechanism when openpyxl is unavailable, with openpyxl added as an explicit dependency
Reviewed Changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/conftest.py | Adds pytest configuration to set up import paths and default Reddit API credentials |
| src/tui.py | Refactors TUI initialization to separate scraper and research assistant setup with better error isolation |
| src/research_assistant.py | Adds lazy initialization support via auto_initialize parameter and factory pattern for dependency injection |
| src/enhanced_retrieval.py | Introduces dependency injection for scraper, semantic engine, and scorer components |
| src/advanced_scraper.py | Adds robust Excel export with CSV fallback and includes openpyxl dependency; reorganizes imports |
| src/init.py | Creates package-level __all__ declaration for module exports |
| requirements.txt | Adds openpyxl dependency for Excel export support |
| pyproject.toml | Adds openpyxl dependency in Poetry configuration |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| self, | ||
| encoder=None, | ||
| model_name: str = "all-MiniLM-L6-v2", | ||
| index_factory=faiss.IndexFlatIP, |
Copilot
AI
Nov 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The index_factory parameter lacks a type annotation. Consider adding a type hint:
def __init__(
self,
encoder=None,
model_name: str = "all-MiniLM-L6-v2",
index_factory: type = faiss.IndexFlatIP,
):Or more precisely, using typing.Callable:
from typing import Callable
# ...
index_factory: Callable[[int], faiss.IndexFlatIP] = faiss.IndexFlatIP,| scraper_factory=AdvancedRedditScraper, | ||
| retrieval_system_factory=EnhancedRetrievalSystem, |
Copilot
AI
Nov 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The scraper_factory and retrieval_system_factory parameters lack type annotations. Consider adding type hints for better code clarity:
from typing import Callable, Type
def __init__(
self,
obsidian_vault_path: Optional[str] = None,
*,
auto_initialize: bool = True,
scraper_factory: Type[AdvancedRedditScraper] = AdvancedRedditScraper,
retrieval_system_factory: Type[EnhancedRetrievalSystem] = EnhancedRetrievalSystem,
):Or if they're meant to be callable factories:
scraper_factory: Callable[[], AdvancedRedditScraper] = AdvancedRedditScraper,
retrieval_system_factory: Callable[[], EnhancedRetrievalSystem] = EnhancedRetrievalSystem,|
|
||
| try: | ||
| self.obsidian_exporter = ObsidianExporter(obsidian_vault_path) | ||
| except (FileNotFoundError, Exception): |
Copilot
AI
Nov 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Catching Exception after FileNotFoundError is redundant since FileNotFoundError is a subclass of Exception. Consider catching just Exception:
except Exception:
# If obsidian vault is not available, disable the feature but keep other functionality
logger.warning("Obsidian vault not available - export to Obsidian disabled")
self.obsidian_exporter = None| except (FileNotFoundError, Exception): | |
| except Exception: |
| import time | ||
| import json | ||
| import logging | ||
| import os |
Copilot
AI
Nov 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate import of os module. The module is imported at line 7 and again at line 12. Remove one of these duplicate imports.
| import os |
| logger.warning( | ||
| "Excel export dependency missing (%s). Falling back to a CSV-formatted " | ||
| "file with an .xlsx extension at %s. Install 'openpyxl' for true Excel " | ||
| "output.", | ||
| exc, | ||
| filename, | ||
| ) | ||
| self._write_tabular_fallback(data, filename) |
Copilot
AI
Nov 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Writing CSV-formatted data to a file with an .xlsx extension (as indicated in the warning message at lines 274-276) could be confusing for users and may cause issues with programs that expect valid Excel files. Consider either:
- Changing the file extension to
.csvwhen falling back - Or using pandas'
to_csv()method with the filename changed to use a.csvextension
Example:
csv_filename = filename.replace('.xlsx', '.csv')
self._write_tabular_fallback(data, csv_filename)
logger.warning("Excel export dependency missing. Created CSV file at %s instead.", csv_filename)| logger.warning( | |
| "Excel export dependency missing (%s). Falling back to a CSV-formatted " | |
| "file with an .xlsx extension at %s. Install 'openpyxl' for true Excel " | |
| "output.", | |
| exc, | |
| filename, | |
| ) | |
| self._write_tabular_fallback(data, filename) | |
| # Change the filename extension to .csv for fallback | |
| csv_filename = filename | |
| if csv_filename.lower().endswith('.xlsx'): | |
| csv_filename = csv_filename[:-5] + '.csv' | |
| logger.warning( | |
| "Excel export dependency missing (%s). Created CSV file at %s instead. " | |
| "Install 'openpyxl' for true Excel output.", | |
| exc, | |
| csv_filename, | |
| ) | |
| self._write_tabular_fallback(data, csv_filename) |
|
|
||
| if self.enhanced_system is None: | ||
| try: | ||
| self.enhanced_system = self._retrieval_system_factory() |
Copilot
AI
Nov 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lazy initialization strategy may not work as intended. When EnhancedRetrievalSystem() is instantiated at line 73, its __init__ method will attempt to create a new AdvancedRedditScraper() instance if scraper is None (see line 207 in enhanced_retrieval.py). This bypasses the exception handling in _initialize_dependencies().
To properly support lazy initialization, ResearchAssistant should pass the already-initialized scraper to EnhancedRetrievalSystem:
if self.enhanced_system is None:
try:
self.enhanced_system = self._retrieval_system_factory(scraper=self.scraper)
except Exception as exc:
logger.error("Failed to initialize EnhancedRetrievalSystem: %s", exc)
self.enhanced_system = None| self.enhanced_system = self._retrieval_system_factory() | |
| self.enhanced_system = self._retrieval_system_factory(scraper=self.scraper) |
|
|
||
| def __init__( | ||
| self, | ||
| encoder=None, |
Copilot
AI
Nov 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The encoder parameter lacks a type annotation. Consider adding a type hint for better code clarity and IDE support:
def __init__(
self,
encoder: Optional[sentence_transformers.SentenceTransformer] = None,
model_name: str = "all-MiniLM-L6-v2",
index_factory=faiss.IndexFlatIP,
):| encoder=None, | |
| encoder: Optional[sentence_transformers.SentenceTransformer] = None, |
| import os | ||
| import sys | ||
| from datetime import datetime, timezone | ||
| from typing import Dict, List, Optional, Tuple |
Copilot
AI
Nov 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'Tuple' is not used.
| from typing import Dict, List, Optional, Tuple | |
| from typing import Dict, List, Optional |
Summary
Testing
pytestCodex Task