-
Notifications
You must be signed in to change notification settings - Fork 0
Feature/pressbook #47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for collecting and processing PressBooks data by introducing a new REST plugin and URL collector, along with related utilities and tests.
- Added
clean_text_keep_punctuationand updatedclean_textto preserve punctuation. - Implemented
PressBooksCollectorto fetch book metadata/content and produceScrapedWeLearnDocumentinstances. - Introduced
PressBooksURLCollectorworkflow node and unit tests for both URL collection and content extraction.
Reviewed Changes
Copilot reviewed 9 out of 12 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| welearn_datastack/utils_/scraping_utils.py | Added clean_text_keep_punctuation and updated clean_text. |
| welearn_datastack/plugins/rest_requesters/pressbooks.py | New PressBooksCollector implementation for metadata and content. |
| welearn_datastack/plugins/rest_requesters/init.py | Registered PressBooksCollector in the plugin exports. |
| welearn_datastack/collectors/press_books_collector.py | Added PressBooksURLCollector for retrieving PressBooks URLs. |
| welearn_datastack/nodes_workflow/URLCollectors/node_press_books_collect.py | New workflow node to invoke PressBooksURLCollector and insert URLs. |
| tests/url_collector/test_press_books_collector.py | Unit tests for PressBooksURLCollector. |
| tests/url_collector/resources/pb_algolia_response.json | Sample Algolia API response for URL collector tests. |
| tests/document_collector_hub/resources/pb_chapter_5_metadata.json | Sample PressBooks metadata for document collector tests. |
| tests/document_collector_hub/plugins_test/test_pressbooks.py | Unit tests for PressBooksCollector. |
Comments suppressed due to low confidence (3)
welearn_datastack/utils_/scraping_utils.py:64
- [nitpick] Consider adding a docstring to 'clean_text_keep_punctuation' to document its purpose, parameters, and behavior.
def clean_text_keep_punctuation(text):
welearn_datastack/utils_/scraping_utils.py:64
- Missing import for 're' module. Add 'import re' at the top of the file to enable the regex substitutions in clean_text_keep_punctuation.
def clean_text_keep_punctuation(text):
welearn_datastack/nodes_workflow/URLCollectors/node_press_books_collect.py:5
- Typo in import path ('press_books_coolector'); it should be 'press_books_collector'.
from welearn_datastack.collectors.press_books_coolector import PressBooksURLCollector
welearn_datastack/nodes_workflow/URLCollectors/node_press_books_collect.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This pull request introduces a new
PressBooksCollectorplugin and related functionality for collecting and processing data from PressBooks. It includes new implementations for URL collection, metadata extraction, and unit tests to ensure the functionality works as expected. Below is a breakdown of the most important changes:New Plugin Implementation
welearn_datastack/plugins/rest_requesters/pressbooks.py: Added thePressBooksCollectorclass to fetch and process book metadata and content from PressBooks. It includes methods for extracting book URLs, processing metadata, validating licenses, and generating structured documents.URL Collection Enhancements
welearn_datastack/collectors/press_books_collector.py: Introduced thePressBooksURLCollectorclass to retrieve book URLs and chapter URLs from PressBooks using Algolia API. This class handles TOC parsing and URL formatting.welearn_datastack/nodes_workflow/URLCollectors/node_press_books_collect.py: Added a workflow node to execute thePressBooksURLCollector, retrieve URLs, and insert them into the database.Unit Tests
tests/document_collector_hub/plugins_test/test_pressbooks.py: Added unit tests forPressBooksCollectorto validate metadata extraction, content parsing, and error handling for unauthorized licenses and HTTP errors.tests/url_collector/test_press_books_collector.py: Added unit tests forPressBooksURLCollectorto verify URL collection and integration with Algolia API.Test Resources
tests/document_collector_hub/resources/pb_chapter_5_metadata.json: Added a sample metadata JSON file for testing thePressBooksCollector.tests/url_collector/resources/pb_algolia_response.json: Added a sample Algolia response JSON for testing URL collection.Integration with Existing Codebase
welearn_datastack/plugins/rest_requesters/__init__.py: Registered thePressBooksCollectorplugin in the module's imports and exports. [1] [2]