# Prepare Chunks (Syllabus, Sluides and Labs)

The **goal of this notebook** is to turn all of the course files into small chunks of text that we can search later with embeddings/ FAISS. 

**Inputs** *(from `data/raw/`)
- `syllabus/`: syllabus pdf 
- `slides/`  : lecture slide PDFs (convert PPTX : PDF if needed)
- `labs/`    : Jupyter notebooks (`.ipynb`) organized by Phase/Week/etc.

**What this notebook does**
1. Finds all PDFs (`syllabus/`, `slides/`) and all `.ipynb` files (`labs/`) recursively.
2. Extracts text:
   - PDFs → per-page text
   - Notebooks → Markdown + Code cells (kept as text)
3. Splits long text into ~900-char chunks (600 for code cells).
4. Writes a single table with chunk metadata.

**Outputs (written to `data/processed/`)**
- `chunk_meta.parquet` — canonical table of chunks + metadata
- `chunks.csv` — same as parquet, human-readable

**Schema (columns)**
- `chunk_id`  — stable id like `<doc>#p<page>#c<chunk>` or `<nb>#cell<idx>#c<chunk>`
- `doc_id`    — source document/notebook stem
- `source_type` — `syllabus | slides | lab`
- `page` / `cell_index` — where the chunk came from
- `topic_hint` — (blank for now) we can fill later
- `text` — the chunk text
- `citation_url` — file:// link with page anchor when available

**Run checklist**
- [ ] Files are local 
- [ ] Slide files are pdf, not pptx 

In [None]:
%pip install PyMuPDF
%pip install pdfplumber

Collecting PyMuPDF
  Downloading pymupdf-1.26.4-cp39-abi3-win_amd64.whl.metadata (3.4 kB)
Downloading pymupdf-1.26.4-cp39-abi3-win_amd64.whl (18.7 MB)
   ---------------------------------------- 0.0/18.7 MB ? eta -:--:--
   -------------------- ------------------- 9.7/18.7 MB 67.4 MB/s eta 0:00:01
   ---------------------------------------- 18.7/18.7 MB 47.4 MB/s eta 0:00:00
Installing collected packages: PyMuPDF
Successfully installed PyMuPDF-1.26.4
Note: you may need to restart the kernel to use updated packages.


In [None]:
# Library 
from pathlib import Path # For handling file paths
import re # This is for regular expressions is used for pattern matching in strings used for text cleaning,processing
import fitz # For reading PDF files
import pdfplumber # For reading PDF files
import pandas as pd

Collecting pdfplumber
  Downloading pdfplumber-0.11.7-py3-none-any.whl.metadata (42 kB)
Collecting pdfminer.six==20250506 (from pdfplumber)
  Downloading pdfminer_six-20250506-py3-none-any.whl.metadata (4.2 kB)
Collecting pypdfium2>=4.18.0 (from pdfplumber)
  Downloading pypdfium2-4.30.0-py3-none-win_amd64.whl.metadata (48 kB)
Collecting cryptography>=36.0.0 (from pdfminer.six==20250506->pdfplumber)
  Downloading cryptography-45.0.6-cp311-abi3-win_amd64.whl.metadata (5.7 kB)
Downloading pdfplumber-0.11.7-py3-none-any.whl (60 kB)
Downloading pdfminer_six-20250506-py3-none-any.whl (5.6 MB)
   ---------------------------------------- 0.0/5.6 MB ? eta -:--:--
   ---------------------------------------- 5.6/5.6 MB 38.3 MB/s eta 0:00:00
Downloading pypdfium2-4.30.0-py3-none-win_amd64.whl (2.9 MB)
   ---------------------------------------- 0.0/2.9 MB ? eta -:--:--
   ---------------------------------------- 2.9/2.9 MB 56.0 MB/s eta 0:00:00
Downloading cryptography-45.0.6-cp311-abi3-win_amd64

## Find all the necessary data files

In [16]:
# Access raw data folders
raw = Path("../data/raw").resolve()
slides_folder = raw/"slides"
syllabus_folder = raw/"syllabus"
labs_folder = raw/"labs"

print("Slides dir:", slides_folder,   "| exists:", slides_folder.exists())
print("Syllabus :", syllabus_folder, "| exists:", syllabus_folder.exists())
print("Labs     :", labs_folder,     "| exists:", labs_folder.exists())

Slides dir: C:\Users\julmo\OneDrive - University of Rochester\TKH Labs\Grades2Goals_Planner\data\raw\slides | exists: True
Syllabus : C:\Users\julmo\OneDrive - University of Rochester\TKH Labs\Grades2Goals_Planner\data\raw\syllabus | exists: True
Labs     : C:\Users\julmo\OneDrive - University of Rochester\TKH Labs\Grades2Goals_Planner\data\raw\labs | exists: True


In [17]:
# Make a list of the files
slide_files = []
for file in slides_folder.rglob("*.pdf"): # look for all pdf files
    slide_files.append(file) # add them to the slide_files list

syllabus_files = []
for file in syllabus_folder.rglob("*.pdf"): # look for all pdf files
    syllabus_files.append(file) # add them to the syllabus_files list
    
lab_files = []
for file in labs_folder.rglob("*.ipynb"): # look for all ipynb files
    if "checkpoint" not in str(file): # ignore autosave checkpoints Jupyter creates
        lab_files.append(file) # add the rest to the lab_files list

# Show the files found
print("Found", len(slide_files), "slide PDFs")
print("Found", len(syllabus_files), "syllabus PDFs")
print("Found", len(lab_files), "lab notebooks")


Found 39 slide PDFs
Found 1 syllabus PDFs
Found 39 lab notebooks


# Extract Text and Chunking 

### Chunking logic

Some slide pages are very short (just a few bullets), while others may contain lots of text.  
To make retrieval effective, we use a simple rule:

- If a page is short (≤700 characters) → keep the page as one chunk.  
- If a page is long (>700 characters) → split into multiple 700-character chunks.  

This way:
- Sparse slides are not split unnecessarily.
- Dense slides are broken into smaller, more searchable pieces.


In [21]:
def create_chunks(text, max_characters=700):
    """If text is short, keep it as is. If it's long, split into 700 char chunks."""
    if len(text) <= max_characters:
        return [text] # return a list with the original text if it's short enough
    else:
        chunks = []
        for i in range(0, len(text), max_characters):
            chunk = text[i:i + max_characters]
            chunks.append(chunk)  
        return chunks      

**Slides**

In [26]:
# Extract from slides
# Each slide pdf is read and loop through its pages. For each page, the text is extracted and stored in a list of dictionaries with metadata (source, file name, page number, and text content).
# Each row  = one slide page
# Each column = source, file, page, text


# Hold the extracted text chunk 
slides_rows = []

# Loop through is slide pdf 
for file in slide_files:
    pdf = fitz.open(file.as_posix()) # open the PDF file, convert Path to string with as_posix()
    # Iterate over each page and extract text
    for page_num in range(len(pdf)): # iterate over each page
        page_text = pdf[page_num].get_text("text").strip() # extract text and add a newline for separation
        if not page_text: # skip empty pages
            continue

    # Apply chunking
    chunks = create_chunks(page_text, max_characters=700)

    for i, chunk in enumerate(chunks):
        slides_rows.append({"source": "slide", 
                            "file": file.name,
                            "page": page_num + 1,
                            "chunk": i + 1,
                            "text": chunk
        }) # store the extracted text with metadata

print(f"Extracted {len(slides_rows)} chunks from {len(slide_files)} slide PDFs")
            
# Convert to DataFrame and show sample
slides_df = pd.DataFrame(slides_rows)
slides_df.head()

# Save to data/processed
slides_df.to_csv("../data/processed/slides_chunks.csv", index=False)
slides_df.to_parquet("../data/processed/slides_chunks.parquet", index=False)

print ("Saved to data/processed/slides_chunks.csv and slides_chunks.parquet")

Extracted 40 chunks from 39 slide PDFs
Saved to data/processed/slides_chunks.csv and slides_chunks.parquet


In [27]:
import pandas as pd

df = pd.DataFrame(slides_rows)
print(df['page'].describe())      # sanity check page numbers
print(df['text'].str.len().describe())  # avg text length
df.head()


count     40.000000
mean      60.675000
std       23.432814
min       15.000000
25%       43.750000
50%       62.000000
75%       77.500000
max      104.000000
Name: page, dtype: float64
count     40.000000
mean     213.975000
std      121.045126
min        0.000000
25%      162.500000
50%      213.500000
75%      238.250000
max      700.000000
Name: text, dtype: float64


Unnamed: 0,source,file,page,chunk,text
0,slide,(Re)-Introduction to Data Science & Control Fl...,102,1,
1,slide,AB Testing.pdf,49,1,Thursday\nOn Thursday we will be meeting for o...
2,slide,Advanced Abstraction.pptx.pdf,66,1,Tuesday\nTomorrow will entail:\n●\nFurther exp...
3,slide,Advanced Control Flow.pptx.pdf,93,1,
4,slide,Advanced Data Processing.pptx.pdf,61,1,Wednesday\nWednesday will entail:\n●\nA review...


**Labs**

In [31]:
import json

# Extract text from code and markdown cells in lab notebooks

labs_rows = [] # hold the extracted text chunks

for file in lab_files:
    try:
        with open(file, "r", encoding="utf-8") as f: # r allows reading the file
            content = f.read() # read the entire file content
            if not content.strip(): # skip empty files
                print(f"Skipping empty file: {file}")
                continue
            notebook = json.loads(content)  # load the notebook as a dict
    except Exception as e:
        print(f"Skipping file {file} due to error: {e}")
        continue

    cells = notebook.get('cells', [])  # access the cells section
    for cell in cells:  # loop through each cell
        cell_type = cell.get('cell_type')  # get the type of cell (code or markdown)
        if cell_type in ['code', 'markdown']:  # only want code and markdown cells
            source = ''.join(cell.get('source', [])).strip()  # join the list of strings into one string and strip whitespace
            if not source:  # skip empty cells
                continue

            # Apply chunking
            chunks = create_chunks(source, max_characters=700)

            for i, chunk in enumerate(chunks):
                labs_rows.append({
                    "source": "lab",
                    "file": file.name,
                    "cell_type": cell_type,
                    "chunk": i + 1,
                    "text": chunk
                })  # store the extracted text with metadata

print("Extracted", len(labs_rows), "chunks from labs")

# Convert to DataFrame and show sample
labs_df = pd.DataFrame(labs_rows)
labs_df.to_csv("../data/processed/labs_chunks.csv", index=False)
labs_df.to_parquet("../data/processed/labs_chunks.parquet", index=False)

labs_df.head()
print("Saved labs to ../data/processed/labs_chunks.csv and labs_chunks.parquet")

Skipping empty file: C:\Users\julmo\OneDrive - University of Rochester\TKH Labs\Grades2Goals_Planner\data\raw\labs\phase2\week09\transformers.ipynb
Extracted 1410 chunks from labs
Saved labs to ../data/processed/labs_chunks.csv and labs_chunks.parquet


**Syllabus**

In [None]:
# path to syllabus
pdf_path = Path("../data/raw/syllabus/IF Data Science 2025 Syllabus.pdf")

with pdfplumber.open(pdf_path.as_posix()) as pdf:  # open the PDF file, convert Path to string with as_posix()
    tables = []
    for page in pdf.pages:  # loop through each page
        page_tables = page.extract_tables()  # extract tables from the page
        tables.extend(page_tables)
    print("Found", len(tables), "tables in total")


Found 15 tables in total


In [46]:
def normalize_cell_text(cell_text):
    """Normalize cell text by removing extra whitespace and newlines, clean text from a pdf table cell

    Args:
    cell(str or none): The raw data content from a pdf table cell. This could be a string, dict, or None if the cell is empty.

    Returns:
    str: The cleaned and normalized text from the cell. If the input is None or not a string, returns an empty string.
    Ex.:
    normalize_cell_text("  This is   a sample \n text.  ") -> "This is a sample text."
    normalize_cell_text(None) -> ""
    """
    if isinstance(cell_text, str):
        return cell_text.strip()
    else:
        return ""

# Open the syll pdf
pdf_path = Path("../data/raw/syllabus/IF Data Science 2025 Syllabus.pdf")

table_headers = []
with pdfplumber.open(pdf_path.as_posix()) as pdf:  # open the PDF file, convert Path to string with as_posix()
    for page_num, page in enumerate(pdf.pages):  # loop through each page
        page_tables = page.extract_tables()  # extract tables from the page
        for table in page_tables:  # loop through each table
            if not table:  # skip empty tables
                continue
            # Look at the first few rows of each table to see if it has headers
            headers = [normalize_cell_text(cell) for cell in table]  # first row as headers
            if not headers: # skip empty rows 
                continue

            # Convert everything to lowercase
            lowercase_headers = [header.lower() for header in headers]

            # If row contains words like pre-class, pre-class content
            if any(re.search(r' pre-class' in cell or "pre-class content", cell) for cell in lowercase_headers):
                table_headers.append({
                    "page": page_num + 1,
                    "headers": headers
                })
        print(f"Found potential headers on page {page_num + 1}: {headers}")

Found potential headers on page 1: ['', '']
Found potential headers on page 2: ['', '', '', '']
Found potential headers on page 3: ['', '', '']
Found potential headers on page 4: ['', '', '', '']
Found potential headers on page 5: ['', '', '']
Found potential headers on page 6: ['', '', '', '']
Found potential headers on page 7: ['', '', '', '', '']
Found potential headers on page 8: ['', '', '']
Found potential headers on page 9: ['', '', '', '']
Found potential headers on page 10: ['', '', '']
Found potential headers on page 11: ['', '', '']
Found potential headers on page 12: ['', '', '', '', '', '', '', '', '', '', '', '']
Found potential headers on page 13: ['', '', '', '', '', '', '', '', '', '', '', '']
