Ingest SCS page chunks from main section

## **NOTE:** Don't assign yourself unless you have have confirmed with Matthew you've got a working environment


## 🧠 Context

Pages under `https://carleton.ca/scs/**` follow a consistent layout where the meaningful page content is contained within a specific section of the HTML (`<div id="content"` or similar). However, our current ingestion logic does not account for this, and as a result, it may pick up irrelevant navigation bars, side menus, or other layout elements.

See the green section. We don't the navbar ingested each time.

![Image](https://github.com/user-attachments/assets/9ce9bccf-b07b-49be-a104-0891abf7099e)

To improve quality and consistency, we should restrict ingestion for these pages to only the main content section.

---

## 🛠 Implementation Plan

1. In `WebpageIngestionService`, detect if the source URL starts with `https://carleton.ca/scs/`.
2. If it matches:

   * Parse the HTML and extract only the content within the main section (typically `<div id="content">`).
   * Use this content for chunking instead of the full page body.
3. Add a test with a sample HTML page from `carleton.ca/scs` to verify that only the expected content is ingested.

---

## ✅ Acceptance Criteria

* When ingesting pages from `https://carleton.ca/scs/**`, extract content only from the main content section of the page (e.g., `<div id="content">`).
* Exclude headers, navigation, footers, sidebars, or any boilerplate elements.
* The chunk(s) should contain only the relevant main body content.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ingest SCS page chunks from main section #4

NOTE: Don't assign yourself unless you have have confirmed with Matthew you've got a working environment

🧠 Context

🛠 Implementation Plan

✅ Acceptance Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Ingest SCS page chunks from main section #4

Description

NOTE: Don't assign yourself unless you have have confirmed with Matthew you've got a working environment

🧠 Context

🛠 Implementation Plan

✅ Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions