Skip to content

Conversation

@lpi-tn
Copy link
Collaborator

@lpi-tn lpi-tn commented Jun 24, 2025

No description provided.

@lpi-tn lpi-tn requested a review from Copilot June 24, 2025 15:29
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances HTML/text cleaning utilities with type guards and HTML entity decoding, and refines the Pressbooks plugin to build composite titles and sanitize metadata fields.

  • Introduces unescape and renames remove_html_tags to remove_html_stuff, adding type checks and HTML entity decoding
  • Adds type validation in several text utilities (format_cc_license, clean_return_to_line, clean_text, get_url_without_hal_like_versionning)
  • Updates the Pressbooks plugin to remove dead code, compose document_title from isPartOf and name, and apply clean_text to metadata fields; test adjusted accordingly

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
welearn_datastack/utils_/scraping_utils.py Renamed/extended HTML stripping, added unescape, and inserted isinstance guards in cleaners
welearn_datastack/plugins/rest_requesters/pressbooks.py Removed commented-out code, implemented composite titles, cleaned author/editor/publisher fields
tests/document_collector_hub/plugins_test/test_pressbooks.py Updated assertion to verify new composite title format
Comments suppressed due to low confidence (5)

welearn_datastack/utils_/scraping_utils.py:37

  • [nitpick] Function name remove_html_stuff is somewhat ambiguous. Consider renaming to remove_html_tags_and_unescape or similar to clearly convey that it also decodes HTML entities.
def remove_html_stuff(text: str) -> str:

welearn_datastack/utils_/scraping_utils.py:39

  • The docstring mentions removing HTML tags and special entities but doesn’t specify that html.unescape is applied. Consider updating it to document the entity decoding step.
    removes html tags and special stuff like & from text

welearn_datastack/utils_/scraping_utils.py:56

  • Parameter name license shadows the Python built-in. Consider renaming it to license_str or cc_license to avoid confusion.
def format_cc_license(license: str) -> str:

welearn_datastack/utils_/scraping_utils.py:137

  • [nitpick] Function name clean_return_to_line is unclear. Consider renaming to remove_line_breaks or strip_line_breaks to better describe its behavior.
def clean_return_to_line(string: str):

welearn_datastack/plugins/rest_requesters/pressbooks.py:86

  • This block of commented-out code appears dead. Consider removing it to keep the codebase clean and maintainable.
                    url = self._create_pressbook_id(main_url, post_id)

@lpi-tn lpi-tn merged commit aa9005b into main Jun 24, 2025
6 checks passed
@lpi-tn lpi-tn deleted the Fix/pressbook-fix-for-app branch June 24, 2025 15:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants