Feature/pressbook #47

lpi-tn · 2025-06-23T13:31:41Z

This pull request introduces a new PressBooksCollector plugin and related functionality for collecting and processing data from PressBooks. It includes new implementations for URL collection, metadata extraction, and unit tests to ensure the functionality works as expected. Below is a breakdown of the most important changes:

New Plugin Implementation

welearn_datastack/plugins/rest_requesters/pressbooks.py: Added the PressBooksCollector class to fetch and process book metadata and content from PressBooks. It includes methods for extracting book URLs, processing metadata, validating licenses, and generating structured documents.

URL Collection Enhancements

welearn_datastack/collectors/press_books_collector.py: Introduced the PressBooksURLCollector class to retrieve book URLs and chapter URLs from PressBooks using Algolia API. This class handles TOC parsing and URL formatting.
welearn_datastack/nodes_workflow/URLCollectors/node_press_books_collect.py: Added a workflow node to execute the PressBooksURLCollector, retrieve URLs, and insert them into the database.

Unit Tests

tests/document_collector_hub/plugins_test/test_pressbooks.py: Added unit tests for PressBooksCollector to validate metadata extraction, content parsing, and error handling for unauthorized licenses and HTTP errors.
tests/url_collector/test_press_books_collector.py: Added unit tests for PressBooksURLCollector to verify URL collection and integration with Algolia API.

Test Resources

tests/document_collector_hub/resources/pb_chapter_5_metadata.json: Added a sample metadata JSON file for testing the PressBooksCollector.
tests/url_collector/resources/pb_algolia_response.json: Added a sample Algolia response JSON for testing URL collection.

Integration with Existing Codebase

welearn_datastack/plugins/rest_requesters/__init__.py: Registered the PressBooksCollector plugin in the module's imports and exports. [1] [2]

Copilot

Pull Request Overview

This PR adds support for collecting and processing PressBooks data by introducing a new REST plugin and URL collector, along with related utilities and tests.

Added clean_text_keep_punctuation and updated clean_text to preserve punctuation.
Implemented PressBooksCollector to fetch book metadata/content and produce ScrapedWeLearnDocument instances.
Introduced PressBooksURLCollector workflow node and unit tests for both URL collection and content extraction.

Reviewed Changes

Copilot reviewed 9 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
welearn_datastack/utils_/scraping_utils.py	Added `clean_text_keep_punctuation` and updated `clean_text`.
welearn_datastack/plugins/rest_requesters/pressbooks.py	New `PressBooksCollector` implementation for metadata and content.
welearn_datastack/plugins/rest_requesters/init.py	Registered `PressBooksCollector` in the plugin exports.
welearn_datastack/collectors/press_books_collector.py	Added `PressBooksURLCollector` for retrieving PressBooks URLs.
welearn_datastack/nodes_workflow/URLCollectors/node_press_books_collect.py	New workflow node to invoke `PressBooksURLCollector` and insert URLs.
tests/url_collector/test_press_books_collector.py	Unit tests for `PressBooksURLCollector`.
tests/url_collector/resources/pb_algolia_response.json	Sample Algolia API response for URL collector tests.
tests/document_collector_hub/resources/pb_chapter_5_metadata.json	Sample PressBooks metadata for document collector tests.
tests/document_collector_hub/plugins_test/test_pressbooks.py	Unit tests for `PressBooksCollector`.

Comments suppressed due to low confidence (3)

welearn_datastack/utils_/scraping_utils.py:64

[nitpick] Consider adding a docstring to 'clean_text_keep_punctuation' to document its purpose, parameters, and behavior.

def clean_text_keep_punctuation(text):

welearn_datastack/utils_/scraping_utils.py:64

Missing import for 're' module. Add 'import re' at the top of the file to enable the regex substitutions in clean_text_keep_punctuation.

def clean_text_keep_punctuation(text):

welearn_datastack/nodes_workflow/URLCollectors/node_press_books_collect.py:5

Typo in import path ('press_books_coolector'); it should be 'press_books_collector'.

from welearn_datastack.collectors.press_books_coolector import PressBooksURLCollector

welearn_datastack/plugins/rest_requesters/pressbooks.py

welearn_datastack/nodes_workflow/URLCollectors/node_press_books_collect.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

welearn_datastack/collectors/press_books_collector.py

…pressbook

welearn_datastack/collectors/press_books_collector.py

lpi-tn added 9 commits June 18, 2025 11:57

create pressbook collector

7d63b76

rename

864f4c2

url collector + test

e9346f2

pressbooks content retrievement

761f4ee

détails stuff

180ff2f

rm comment + add new cleanup method in clean text

ff7307f

start of unit test

09018c8

Tests methods

86ab77d

add plugin

5bfcd87

lpi-tn requested review from Copilot, jmsevin and sandragjacinto June 23, 2025 13:31

Copilot AI reviewed Jun 23, 2025

View reviewed changes

welearn_datastack/plugins/rest_requesters/pressbooks.py Outdated Show resolved Hide resolved

welearn_datastack/plugins/rest_requesters/pressbooks.py Show resolved Hide resolved

welearn_datastack/nodes_workflow/URLCollectors/node_press_books_collect.py Outdated Show resolved Hide resolved

lpi-tn and others added 2 commits June 23, 2025 15:35

remove useless vibecoded function

895884b

Update welearn_datastack/plugins/rest_requesters/pressbooks.py

ff44e3a

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

sandragjacinto reviewed Jun 23, 2025

View reviewed changes

welearn_datastack/collectors/press_books_collector.py Show resolved Hide resolved

lpi-tn added 3 commits June 23, 2025 15:41

apply corrections from copilot

55c86c8

Merge remote-tracking branch 'origin/Feature/pressbook' into Feature/…

f8d3490

…pressbook

lint

6a82678

sandragjacinto reviewed Jun 23, 2025

View reviewed changes

welearn_datastack/collectors/press_books_collector.py Show resolved Hide resolved

typo

509181b

sandragjacinto reviewed Jun 23, 2025

View reviewed changes

welearn_datastack/collectors/press_books_collector.py Show resolved Hide resolved

sandragjacinto approved these changes Jun 23, 2025

View reviewed changes

jmsevin approved these changes Jun 23, 2025

View reviewed changes

lpi-tn merged commit ee5993b into main Jun 23, 2025
7 checks passed

lpi-tn deleted the Feature/pressbook branch June 23, 2025 15:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature/pressbook #47

Feature/pressbook #47

Uh oh!

lpi-tn commented Jun 23, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Feature/pressbook #47

Feature/pressbook #47

Uh oh!

Conversation

lpi-tn commented Jun 23, 2025

New Plugin Implementation

URL Collection Enhancements

Unit Tests

Test Resources

Integration with Existing Codebase

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants