Skip to content

Conversation

@lpi-tn
Copy link
Collaborator

@lpi-tn lpi-tn commented Sep 4, 2025

This pull request introduces a new URL collector and scraping plugin for UN CC:Learn (unccelearn.org) courses, along with comprehensive tests and workflow integration. The main changes include implementing the UNCCeLearnURLCollector to gather course URLs, a detailed scraping plugin to extract course metadata and content (including PDF extraction with Tika), and corresponding tests to ensure robustness. The workflow for collecting URLs is also integrated into the nodes system.

New UN CC:Learn URL Collector and Scraper

  • Implemented UNCCeLearnURLCollector to fetch course URLs from unccelearn.org, with error handling for missing data and logging for traceability. (welearn_datastack/collectors/unccelearn_collector.py)
  • Added a new scraping plugin UNCCeLearnCollector that extracts detailed course information, including metadata from HTML and associated PDF syllabus files using Tika, with robust parsing and cleaning logic. (welearn_datastack/plugins/scrapers/unccelearn.py)
  • Updated the PDF extraction utility to support returning both extracted text and metadata, enabling richer downstream processing. (welearn_datastack/modules/pdf_extractor.py)

Testing and Workflow Integration

  • Added comprehensive tests for both the URL collector and the scraping plugin, covering normal and error scenarios, content extraction, and metadata parsing. (tests/url_collector/test_uncclearn_collector.py, tests/document_collector_hub/plugins_test/test_unccelearn.py) [1] [2]
  • Integrated the new collector into the workflow system via a new node, including logging and database session management. (welearn_datastack/nodes_workflow/URLCollectors/node_uncclearn_collect.py)

Minor Cleanup

  • Removed unused imports in the conversation scraper module. (welearn_datastack/plugins/scrapers/conversation.py)

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces comprehensive support for UN CC:Learn (unccelearn.org) course scraping, including URL collection, content extraction with PDF processing, and test coverage. The implementation adds a new URL collector to gather course URLs and a scraping plugin that extracts detailed course metadata and content from both HTML pages and associated PDF syllabus files using Tika.

  • Adds URL collection and scraping capabilities for UN CC:Learn courses
  • Implements PDF content extraction with metadata parsing using Tika
  • Integrates comprehensive test coverage and workflow configuration

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
welearn_datastack/plugins/scrapers/unccelearn.py New scraping plugin for extracting course content and metadata from UN CC:Learn
welearn_datastack/collectors/unccelearn_collector.py URL collector implementation for gathering course URLs from the site
welearn_datastack/modules/pdf_extractor.py Enhanced PDF extraction utility to optionally return metadata
welearn_datastack/nodes_workflow/URLCollectors/node_uncclearn_collect.py Workflow node for integrating the collector into the processing pipeline
tests/url_collector/test_uncclearn_collector.py Test coverage for the URL collector functionality
tests/document_collector_hub/plugins_test/test_unccelearn.py Comprehensive test suite for the scraping plugin
welearn_datastack/plugins/scrapers/conversation.py Cleanup of unused imports
k8s/welearn-datastack/templates/urlcollectors/workflow-template.yaml Kubernetes workflow template for the new collector
k8s/welearn-datastack/templates/urlcollectors/cron-workflow.yaml Cron workflow configuration for scheduled collection

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

lpi-tn and others added 5 commits September 4, 2025 17:24
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…collect.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…collect.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@lpi-tn lpi-tn merged commit 56dcb10 into main Sep 4, 2025
7 checks passed
@lpi-tn lpi-tn deleted the Features/uncclearn branch September 4, 2025 16:00
@lpi-tn lpi-tn removed the request for review from sandragjacinto September 4, 2025 16:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants