Skip to content

Conversation

@lpi-tn
Copy link
Collaborator

@lpi-tn lpi-tn commented Jun 23, 2025

This pull request introduces a new PressBooksCollector plugin and related functionality for collecting and processing data from PressBooks. It includes new implementations for URL collection, metadata extraction, and unit tests to ensure the functionality works as expected. Below is a breakdown of the most important changes:

New Plugin Implementation

  • welearn_datastack/plugins/rest_requesters/pressbooks.py: Added the PressBooksCollector class to fetch and process book metadata and content from PressBooks. It includes methods for extracting book URLs, processing metadata, validating licenses, and generating structured documents.

URL Collection Enhancements

  • welearn_datastack/collectors/press_books_collector.py: Introduced the PressBooksURLCollector class to retrieve book URLs and chapter URLs from PressBooks using Algolia API. This class handles TOC parsing and URL formatting.
  • welearn_datastack/nodes_workflow/URLCollectors/node_press_books_collect.py: Added a workflow node to execute the PressBooksURLCollector, retrieve URLs, and insert them into the database.

Unit Tests

  • tests/document_collector_hub/plugins_test/test_pressbooks.py: Added unit tests for PressBooksCollector to validate metadata extraction, content parsing, and error handling for unauthorized licenses and HTTP errors.
  • tests/url_collector/test_press_books_collector.py: Added unit tests for PressBooksURLCollector to verify URL collection and integration with Algolia API.

Test Resources

  • tests/document_collector_hub/resources/pb_chapter_5_metadata.json: Added a sample metadata JSON file for testing the PressBooksCollector.
  • tests/url_collector/resources/pb_algolia_response.json: Added a sample Algolia response JSON for testing URL collection.

Integration with Existing Codebase

  • welearn_datastack/plugins/rest_requesters/__init__.py: Registered the PressBooksCollector plugin in the module's imports and exports. [1] [2]

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for collecting and processing PressBooks data by introducing a new REST plugin and URL collector, along with related utilities and tests.

  • Added clean_text_keep_punctuation and updated clean_text to preserve punctuation.
  • Implemented PressBooksCollector to fetch book metadata/content and produce ScrapedWeLearnDocument instances.
  • Introduced PressBooksURLCollector workflow node and unit tests for both URL collection and content extraction.

Reviewed Changes

Copilot reviewed 9 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
welearn_datastack/utils_/scraping_utils.py Added clean_text_keep_punctuation and updated clean_text.
welearn_datastack/plugins/rest_requesters/pressbooks.py New PressBooksCollector implementation for metadata and content.
welearn_datastack/plugins/rest_requesters/init.py Registered PressBooksCollector in the plugin exports.
welearn_datastack/collectors/press_books_collector.py Added PressBooksURLCollector for retrieving PressBooks URLs.
welearn_datastack/nodes_workflow/URLCollectors/node_press_books_collect.py New workflow node to invoke PressBooksURLCollector and insert URLs.
tests/url_collector/test_press_books_collector.py Unit tests for PressBooksURLCollector.
tests/url_collector/resources/pb_algolia_response.json Sample Algolia API response for URL collector tests.
tests/document_collector_hub/resources/pb_chapter_5_metadata.json Sample PressBooks metadata for document collector tests.
tests/document_collector_hub/plugins_test/test_pressbooks.py Unit tests for PressBooksCollector.
Comments suppressed due to low confidence (3)

welearn_datastack/utils_/scraping_utils.py:64

  • [nitpick] Consider adding a docstring to 'clean_text_keep_punctuation' to document its purpose, parameters, and behavior.
def clean_text_keep_punctuation(text):

welearn_datastack/utils_/scraping_utils.py:64

  • Missing import for 're' module. Add 'import re' at the top of the file to enable the regex substitutions in clean_text_keep_punctuation.
def clean_text_keep_punctuation(text):

welearn_datastack/nodes_workflow/URLCollectors/node_press_books_collect.py:5

  • Typo in import path ('press_books_coolector'); it should be 'press_books_collector'.
from welearn_datastack.collectors.press_books_coolector import PressBooksURLCollector

lpi-tn and others added 2 commits June 23, 2025 15:35
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@lpi-tn lpi-tn merged commit ee5993b into main Jun 23, 2025
7 checks passed
@lpi-tn lpi-tn deleted the Feature/pressbook branch June 23, 2025 15:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants