Skip to content

Proposal: Enhance Project Scraper Robustness and Observability (GSoC 2026) #4380

@Makeepan-dev

Description

@Makeepan-dev

Description

While reviewing the backend/apps/owasp/ module in preparation for GSoC 2026, I noticed that the current OwaspScraper in scraper.py has a few limitations regarding page structure parsing and production observability.

If the scraper encounters a non-standard page layout or fails a request, it logs the exception locally but does not provide database-level visibility for maintainers to track scrape health over time.

Proposed Changes

To make the data ingestion pipeline more robust, I propose the following enhancements:

  1. Flexible Selectors (scraper.py): Update get_urls() and other parsing methods to use more resilient XPath/CSS selectors, falling back to alternative content containers if div[@class='sidebar'] is missing.
  2. Scrape Logging Model (scrape_log.py): Create a lightweight Django model to record the status, duration, and specific errors of automated scraping tasks.
  3. Admin Integration (admin.py): Register the new logging model in the Django Admin panel so maintainers can visually monitor scraper health and identify failing project URLs without digging through server logs.

Note to Maintainers: I already have my local environment fully synced and am actively testing these XPath fallback strategies. I would love to be assigned this issue so I can submit a Pull Request with the implementation!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions