Skip to content

Ingest SCS page chunks from main section #4

@MathyouMB

Description

@MathyouMB

NOTE: Don't assign yourself unless you have have confirmed with Matthew you've got a working environment

🧠 Context

Pages under https://carleton.ca/scs/** follow a consistent layout where the meaningful page content is contained within a specific section of the HTML (<div id="content" or similar). However, our current ingestion logic does not account for this, and as a result, it may pick up irrelevant navigation bars, side menus, or other layout elements.

See the green section. We don't the navbar ingested each time.

Image

To improve quality and consistency, we should restrict ingestion for these pages to only the main content section.


🛠 Implementation Plan

  1. In WebpageIngestionService, detect if the source URL starts with https://carleton.ca/scs/.

  2. If it matches:

    • Parse the HTML and extract only the content within the main section (typically <div id="content">).
    • Use this content for chunking instead of the full page body.
  3. Add a test with a sample HTML page from carleton.ca/scs to verify that only the expected content is ingested.


✅ Acceptance Criteria

  • When ingesting pages from https://carleton.ca/scs/**, extract content only from the main content section of the page (e.g., <div id="content">).
  • Exclude headers, navigation, footers, sidebars, or any boilerplate elements.
  • The chunk(s) should contain only the relevant main body content.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    Status

    Ready

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions