Fix/pressbook fix for app #51

lpi-tn · 2025-06-24T15:29:06Z

No description provided.

Copilot

Pull Request Overview

This PR enhances HTML/text cleaning utilities with type guards and HTML entity decoding, and refines the Pressbooks plugin to build composite titles and sanitize metadata fields.

Introduces unescape and renames remove_html_tags to remove_html_stuff, adding type checks and HTML entity decoding
Adds type validation in several text utilities (format_cc_license, clean_return_to_line, clean_text, get_url_without_hal_like_versionning)
Updates the Pressbooks plugin to remove dead code, compose document_title from isPartOf and name, and apply clean_text to metadata fields; test adjusted accordingly

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
welearn_datastack/utils_/scraping_utils.py	Renamed/extended HTML stripping, added `unescape`, and inserted `isinstance` guards in cleaners
welearn_datastack/plugins/rest_requesters/pressbooks.py	Removed commented-out code, implemented composite titles, cleaned author/editor/publisher fields
tests/document_collector_hub/plugins_test/test_pressbooks.py	Updated assertion to verify new composite title format

Comments suppressed due to low confidence (5)

welearn_datastack/utils_/scraping_utils.py:37

[nitpick] Function name remove_html_stuff is somewhat ambiguous. Consider renaming to remove_html_tags_and_unescape or similar to clearly convey that it also decodes HTML entities.

def remove_html_stuff(text: str) -> str:

welearn_datastack/utils_/scraping_utils.py:39

The docstring mentions removing HTML tags and special entities but doesn’t specify that html.unescape is applied. Consider updating it to document the entity decoding step.

    removes html tags and special stuff like &amp; from text

welearn_datastack/utils_/scraping_utils.py:56

Parameter name license shadows the Python built-in. Consider renaming it to license_str or cc_license to avoid confusion.

def format_cc_license(license: str) -> str:

welearn_datastack/utils_/scraping_utils.py:137

[nitpick] Function name clean_return_to_line is unclear. Consider renaming to remove_line_breaks or strip_line_breaks to better describe its behavior.

def clean_return_to_line(string: str):

welearn_datastack/plugins/rest_requesters/pressbooks.py:86

This block of commented-out code appears dead. Consider removing it to keep the codebase clean and maintainable.

                    url = self._create_pressbook_id(main_url, post_id)

welearn_datastack/utils_/scraping_utils.py

lpi-tn added 4 commits June 24, 2025 17:15

unescape

72a19f7

clean_text

24a409a

add fallback case

1d8f9ea

add fallback case

08a83c3

lpi-tn requested a review from Copilot June 24, 2025 15:29

Copilot AI reviewed Jun 24, 2025

View reviewed changes

welearn_datastack/utils_/scraping_utils.py Show resolved Hide resolved

lpi-tn merged commit aa9005b into main Jun 24, 2025
6 checks passed

lpi-tn deleted the Fix/pressbook-fix-for-app branch June 24, 2025 15:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix/pressbook fix for app #51

Fix/pressbook fix for app #51

Uh oh!

lpi-tn commented Jun 24, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix/pressbook fix for app #51

Fix/pressbook fix for app #51

Uh oh!

Conversation

lpi-tn commented Jun 24, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants