# Prepare Chunks (Slides and Labs) + Build Resource Catalog (Syllabus)

This notebook prepares our course materials so they can be used later for search and plan generation.

We treat each source differently based on its role:

- **Slides & Labs:** broken into small text “chunks” so we can embed them, store them in FAISS, and retrieve them when a student asks about a topic.

- **Syllabus:** not chunked; instead, we extract its links (such as pre-class readings, Kaggle, DataCamp, YouTube) and use them to build a resource catalog for supplemental study material.

**What happens here**

1. Collect files from `data/raw/`

- Slides (PDFs)
- Labs (Jupyter notebooks)
- Syllabus (PDF with resource links)

2. Extract and process text

- Slides -> page-level text
- Labs -> Markdown + Code cells as text
- Syllabus -> links only (added to catalog, not chunked)

3. Chunk text for slides/labs (~700 chars text).

4. Save outputs into structured tables.

**Outputs**

- `slides_chunks.parquet` / `slides_chunks.csv`
- `labs_chunks.parquet` / `labs_chunks.csv`
- `resources_catalog.csv` (built from syllabus links)

**Schema: Chunk Files (slides/labs)**

- `chunk_id` -> stable id (such as `<doc>#p<page>#c<chunk>` or `<nb>#cell<idx>#c<chunk>`)

- `doc_id` -> source document/notebook name

- `source_type` -> `slide | lab`

- `page` / `cell_index` -> where the chunk came from

- `topic_hint` -> (blank for now) placeholder for future tagging

- `text` -> the chunk text

- `citation_url` -> file path with page anchor

**Schema: Resource Catalog (syllabus)**

- `id` ->  unique numeric identifier

- `title` -> 

- `url` -> the resource link

- `category` -> course or external (such as Kaggle, DataCamp, YouTube)

- `topic_tags` -> 

- `description` -> line snippet or short description from syllabus

In [None]:
%pip install PyMuPDF

In [None]:
# Library 
from pathlib import Path # For handling file paths
import re # This is for regular expressions is used for pattern matching in strings used for text cleaning,processing
import fitz # For reading PDF files
import pandas as pd

## Find all the necessary data files

In [6]:
# Access raw data folders
raw = Path("../data/raw").resolve()
slides_folder = raw/"slides"
syllabus_folder = raw/"syllabus"
labs_folder = raw/"labs"

print("Slides dir:", slides_folder,   "| exists:", slides_folder.exists())
print("Syllabus :", syllabus_folder, "| exists:", syllabus_folder.exists())
print("Labs     :", labs_folder,     "| exists:", labs_folder.exists())

Slides dir: C:\Users\julmo\OneDrive - University of Rochester\TKH Labs\Grades2Goals_Planner\data\raw\slides | exists: True
Syllabus : C:\Users\julmo\OneDrive - University of Rochester\TKH Labs\Grades2Goals_Planner\data\raw\syllabus | exists: True
Labs     : C:\Users\julmo\OneDrive - University of Rochester\TKH Labs\Grades2Goals_Planner\data\raw\labs | exists: True


In [7]:
# Make a list of the files
slide_files = []
for file in slides_folder.rglob("*.pdf"): # look for all pdf files
    slide_files.append(file) # add them to the slide_files list

syllabus_files = []
for file in syllabus_folder.rglob("*.pdf"): # look for all pdf files
    syllabus_files.append(file) # add them to the syllabus_files list
    
lab_files = []
for file in labs_folder.rglob("*.ipynb"): # look for all ipynb files
    if "checkpoint" not in str(file): # ignore autosave checkpoints Jupyter creates
        lab_files.append(file) # add the rest to the lab_files list

# Show the files found
print("Found", len(slide_files), "slide PDFs")
print("Found", len(syllabus_files), "syllabus PDFs")
print("Found", len(lab_files), "lab notebooks")


Found 39 slide PDFs
Found 1 syllabus PDFs
Found 39 lab notebooks


# Extract Text and Chunking 

### Chunking logic

Some slide pages are very short (just a few bullets), while others may contain lots of text.  
To make retrieval effective, we use a simple rule:

- If a page is short (≤700 characters) → keep the page as one chunk.  
- If a page is long (>700 characters) → split into multiple 700-character chunks.  

This way:
- Sparse slides are not split unnecessarily.
- Dense slides are broken into smaller, more searchable pieces.


In [None]:
def create_chunks(text, max_characters=700):
    """If text is short, keep it as is. If it's long, split into 700 char chunks."""
    if len(text) <= max_characters:
        return [text] # return a list with the original text if it's short enough
    else:
        chunks = []
        for i in range(0, len(text), max_characters):
            chunk = text[i:i + max_characters] # get a chunk of max_characters
            chunks.append(chunk)  # add the chunk to the list
        return chunks      

**Slides**

In [None]:
# Extract from slides
# Each slide pdf is read and loop through its pages. For each page, the text is extracted and stored in a list of dictionaries with metadata (source, file name, page number, and text content).
# Each row  = one slide page
# Each column = source, file, page, text


# Hold the extracted text chunk 
slides_rows = []

# Loop through is slide pdf 
for file in slide_files:
    pdf = fitz.open(file.as_posix()) # open the PDF file, convert Path to string with as_posix()
    # Iterate over each page and extract text
    for page_num in range(len(pdf)): # iterate over each page
        page_text = pdf[page_num].get_text("text").strip() # extract text and add a newline for separation
        if not page_text: # skip empty pages
            continue

    # Apply chunking
    chunks = create_chunks(page_text, max_characters=700)

    for i, chunk in enumerate(chunks):
        slides_rows.append({"source": "slide", 
                            "file": file.name, # the file name 
                            "page": page_num + 1, # page number 
                            "chunk": i + 1, # chunk number
                            "text": chunk # the chunked text
        }) # store the extracted text with metadata

print(f"Extracted {len(slides_rows)} chunks from {len(slide_files)} slide PDFs") 
            
# Convert to DataFrame and show sample
slides_df = pd.DataFrame(slides_rows)
slides_df.head()

# Save to data/processed
slides_df.to_csv("../data/processed/slides_chunks.csv", index=False)
slides_df.to_parquet("../data/processed/slides_chunks.parquet", index=False)

print ("Saved to data/processed/slides_chunks.csv and slides_chunks.parquet")

Extracted 40 chunks from 39 slide PDFs
Saved to data/processed/slides_chunks.csv and slides_chunks.parquet


In [None]:
df = pd.DataFrame(slides_rows)
print(df['page'].describe())      # sanity check page numbers
print(df['text'].str.len().describe())  # avg text length
df.head()


count     40.000000
mean      60.675000
std       23.432814
min       15.000000
25%       43.750000
50%       62.000000
75%       77.500000
max      104.000000
Name: page, dtype: float64
count     40.000000
mean     213.975000
std      121.045126
min        0.000000
25%      162.500000
50%      213.500000
75%      238.250000
max      700.000000
Name: text, dtype: float64


Unnamed: 0,source,file,page,chunk,text
0,slide,(Re)-Introduction to Data Science & Control Fl...,102,1,
1,slide,AB Testing.pdf,49,1,Thursday\nOn Thursday we will be meeting for o...
2,slide,Advanced Abstraction.pptx.pdf,66,1,Tuesday\nTomorrow will entail:\n●\nFurther exp...
3,slide,Advanced Control Flow.pptx.pdf,93,1,
4,slide,Advanced Data Processing.pptx.pdf,61,1,Wednesday\nWednesday will entail:\n●\nA review...


**Labs**

In [None]:
import json

# Extract text from code and markdown cells in lab notebooks

labs_rows = [] # hold the extracted text chunks

for file in lab_files:
    try:
        with open(file, "r", encoding="utf-8") as f: # r allows reading the file
            content = f.read() # read the entire file content
            if not content.strip(): # skip empty files
                print(f"Skipping empty file: {file}")
                continue
            notebook = json.loads(content)  # load the notebook as a dict
    except Exception as e:
        print(f"Skipping file {file} due to error: {e}")
        continue

    cells = notebook.get('cells', [])  # access the cells section
    for cell in cells:  # loop through each cell
        cell_type = cell.get('cell_type')  # get the type of cell (code or markdown)
        if cell_type in ['code', 'markdown']:  # only want code and markdown cells
            source = ''.join(cell.get('source', [])).strip()  # join the list of strings into one string and strip whitespace
            if not source:  # skip empty cells
                continue

            # Apply chunking
            chunks = create_chunks(source, max_characters=700)

            for i, chunk in enumerate(chunks):
                labs_rows.append({
                    "source": "lab", # source type
                    "file": file.name, # the file name
                    "cell_type": cell_type, # code or markdown
                    "chunk": i + 1, # chunk number
                    "text": chunk # the chunked text
                })  # store the extracted text with metadata

print("Extracted", len(labs_rows), "chunks from labs")

# Convert to DataFrame and show sample
labs_df = pd.DataFrame(labs_rows)
labs_df.to_csv("../data/processed/labs_chunks.csv", index=False)
labs_df.to_parquet("../data/processed/labs_chunks.parquet", index=False)

labs_df.head()
print("Saved labs to ../data/processed/labs_chunks.csv and labs_chunks.parquet")

Skipping empty file: C:\Users\julmo\OneDrive - University of Rochester\TKH Labs\Grades2Goals_Planner\data\raw\labs\phase2\week09\transformers.ipynb
Extracted 1410 chunks from labs
Saved labs to ../data/processed/labs_chunks.csv and labs_chunks.parquet


**Syllabus**

In [23]:
SYLL_PDF_PATH = Path("../data/raw/syllabus/IF Data Science 2025 Syllabus.pdf")
OUT_ALL_LINKS  = Path("../data/processed/syllabus_all_links.csv")


In [None]:
def extract_links_from_syll(syll_pdf_path):
    """Extract all links and pre-class links from the syllabus PDF.
    Extracts:
      A) Link annotations (true clickable links)
      B) URLs found in page text via regex
    Returns a DataFrame with: page, url, source ("annotation" or "text"), line_snippet
    """
    syll_pdf = fitz.open(syll_pdf_path.as_posix())  # open the PDF file

    url_regex = re.compile(r"https?://[^\s]+")  # regex to find URLs starting with http or https

    rows = []  # hold extracted links

    for page_num in range(len(syll_pdf)):  # iterate over each page
        page = syll_pdf[page_num] # get the page
        page_number = page_num + 1 # page number

        # A) Extract link annotations
        for link in page.get_links():
            uri = (link.get("uri") or "").strip() # get the URL from the link annotation
            if uri:
                rows.append({
                    "page": page_number, 
                    "url": uri, # the URL; uri is used for web links
                    "source": "annotation",
                    "line_snippet": ""  # no line snippet for annotations
                })

    df = pd.DataFrame(rows)
    if df.empty:
        return df

    df = df.drop_duplicates(subset=["url"]).reset_index(drop=True)
    return df

# Get all links from the pdf 
all_links_df = extract_links_from_syll(SYLL_PDF_PATH)
all_links_df.to_csv(OUT_ALL_LINKS, index=False)

print(f"Extracted {len(all_links_df)} total links from {SYLL_PDF_PATH.name} and saved to {OUT_ALL_LINKS}")



Extracted 89 total links from IF Data Science 2025 Syllabus.pdf and saved to ..\data\processed\syllabus_all_links.csv


In [32]:
ALL_LINKS = Path("../data/processed/syllabus_all_links.csv")
OUT_LINKS = Path("../catalog/resources_catalog.csv")

syll_df = pd.read_csv(ALL_LINKS)

# External resources categories (Kaggle/DataCamp/YouTube)

def categorize_link(url):
    url = str(url).lower()
    if any (keyword in url for keyword in ["kaggle.com", "datacamp.com", "youtube.com"]):
        return "external"
    return "course"

# Create catalog category col

catalog_df = pd.DataFrame(
    {
        "id": range(1, len(syll_df) + 1), # unique ID
        "title": "",  # 
        "url": syll_df["url"].fillna(""),  # the URL
        "category": syll_df["url"].apply(categorize_link),  # categorize as external or course
        "topic_tags": "",  
        "description": syll_df.get("line_snippet").fillna(""),  # use line snippet as description if available
    }
)

catalog_df.to_csv(OUT_LINKS, index=False)
print(f"Saved resource catalog with {len(catalog_df)} resources to {OUT_LINKS}")

Saved resource catalog with 89 resources to ..\catalog\resources_catalog.csv


**Clean Data**


In [None]:
# Remove white spaces in Slides
# slides_df["text"] = (
#     slides_df["text"]. astype(str)  # ensure it's a string
#     .str.replace(r'\s+', ' ', regex=True)  # replace multiple whitespace with single space
#     .str.strip()  # remove leading/trailing whitespace
# )

# # Drop empty rows
# slides_before = len(slides_df)
# slides_df = slides_df[slides_df["text"] != ""].copy()
# print(f"Dropped {slides_before - len(slides_df)} empty rows from slides_df")

# # Remove Agenda slides
# # Count how many rows mention "agenda"
# print("Before filtering:", len(slides_df))
# print(slides_df[slides_df["text"].str.contains("agenda", case=False, na=False)].head(5))


# # Drop rows that contain the word "agenda" (case-insensitive)
# slides_df = slides_df[~slides_df["text"].str.contains("agenda", case=False, na=False)].copy()

# print("After filtering:", len(slides_df))

Dropped 0 empty rows from slides_df
Before filtering: 38
Empty DataFrame
Columns: [source, file, page, chunk, text]
Index: []
After filtering: 38


## Note for Future Work 

Currently, the slide data is going to be used as is after chunking. Some extra processing is needed such as removing the agendas from the slides to improve retreiving data. 