Description
While reviewing the backend/apps/owasp/ module in preparation for GSoC 2026, I noticed that the current OwaspScraper in scraper.py has a few limitations regarding page structure parsing and production observability.
If the scraper encounters a non-standard page layout or fails a request, it logs the exception locally but does not provide database-level visibility for maintainers to track scrape health over time.
Proposed Changes
To make the data ingestion pipeline more robust, I propose the following enhancements:
- Flexible Selectors (
scraper.py): Update get_urls() and other parsing methods to use more resilient XPath/CSS selectors, falling back to alternative content containers if div[@class='sidebar'] is missing.
- Scrape Logging Model (
scrape_log.py): Create a lightweight Django model to record the status, duration, and specific errors of automated scraping tasks.
- Admin Integration (
admin.py): Register the new logging model in the Django Admin panel so maintainers can visually monitor scraper health and identify failing project URLs without digging through server logs.
Note to Maintainers: I already have my local environment fully synced and am actively testing these XPath fallback strategies. I would love to be assigned this issue so I can submit a Pull Request with the implementation!
Description
While reviewing the
backend/apps/owasp/module in preparation for GSoC 2026, I noticed that the currentOwaspScraperinscraper.pyhas a few limitations regarding page structure parsing and production observability.If the scraper encounters a non-standard page layout or fails a request, it logs the exception locally but does not provide database-level visibility for maintainers to track scrape health over time.
Proposed Changes
To make the data ingestion pipeline more robust, I propose the following enhancements:
scraper.py): Updateget_urls()and other parsing methods to use more resilient XPath/CSS selectors, falling back to alternative content containers ifdiv[@class='sidebar']is missing.scrape_log.py): Create a lightweight Django model to record the status, duration, and specific errors of automated scraping tasks.admin.py): Register the new logging model in the Django Admin panel so maintainers can visually monitor scraper health and identify failing project URLs without digging through server logs.Note to Maintainers: I already have my local environment fully synced and am actively testing these XPath fallback strategies. I would love to be assigned this issue so I can submit a Pull Request with the implementation!