-
Notifications
You must be signed in to change notification settings - Fork 0
Features/uncclearn #62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces comprehensive support for UN CC:Learn (unccelearn.org) course scraping, including URL collection, content extraction with PDF processing, and test coverage. The implementation adds a new URL collector to gather course URLs and a scraping plugin that extracts detailed course metadata and content from both HTML pages and associated PDF syllabus files using Tika.
- Adds URL collection and scraping capabilities for UN CC:Learn courses
- Implements PDF content extraction with metadata parsing using Tika
- Integrates comprehensive test coverage and workflow configuration
Reviewed Changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| welearn_datastack/plugins/scrapers/unccelearn.py | New scraping plugin for extracting course content and metadata from UN CC:Learn |
| welearn_datastack/collectors/unccelearn_collector.py | URL collector implementation for gathering course URLs from the site |
| welearn_datastack/modules/pdf_extractor.py | Enhanced PDF extraction utility to optionally return metadata |
| welearn_datastack/nodes_workflow/URLCollectors/node_uncclearn_collect.py | Workflow node for integrating the collector into the processing pipeline |
| tests/url_collector/test_uncclearn_collector.py | Test coverage for the URL collector functionality |
| tests/document_collector_hub/plugins_test/test_unccelearn.py | Comprehensive test suite for the scraping plugin |
| welearn_datastack/plugins/scrapers/conversation.py | Cleanup of unused imports |
| k8s/welearn-datastack/templates/urlcollectors/workflow-template.yaml | Kubernetes workflow template for the new collector |
| k8s/welearn-datastack/templates/urlcollectors/cron-workflow.yaml | Cron workflow configuration for scheduled collection |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
welearn_datastack/nodes_workflow/URLCollectors/node_uncclearn_collect.py
Outdated
Show resolved
Hide resolved
welearn_datastack/nodes_workflow/URLCollectors/node_uncclearn_collect.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…collect.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…collect.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This pull request introduces a new URL collector and scraping plugin for UN CC:Learn (unccelearn.org) courses, along with comprehensive tests and workflow integration. The main changes include implementing the
UNCCeLearnURLCollectorto gather course URLs, a detailed scraping plugin to extract course metadata and content (including PDF extraction with Tika), and corresponding tests to ensure robustness. The workflow for collecting URLs is also integrated into the nodes system.New UN CC:Learn URL Collector and Scraper
UNCCeLearnURLCollectorto fetch course URLs from unccelearn.org, with error handling for missing data and logging for traceability. (welearn_datastack/collectors/unccelearn_collector.py)UNCCeLearnCollectorthat extracts detailed course information, including metadata from HTML and associated PDF syllabus files using Tika, with robust parsing and cleaning logic. (welearn_datastack/plugins/scrapers/unccelearn.py)Testing and Workflow Integration
Minor Cleanup