In [50]:
from langchain_community.document_loaders import PyPDFLoader

pdf_path = "psychology_book.pdf"
docs = PyPDFLoader(pdf_path).load()   # list of Documents, one per page
print(len(docs), docs[0].page_content[:300])

753 


In [51]:
docs[19].page_content


'2002). Nash was the subject of the 2001 movie A Beautiful Mind. Why did these people have these\nexperiences? How does the human brain work? And what is the connection between the brain’s internal\nprocesses and people’s external behaviors? This textbook will introduce you to various ways that the field of\npsychology has explored these questions.\n1.1 What Is Psychology?\nLEARNING OBJECTIVES\nBy the end of this section, you will be able to:\n• Define psychology\n• Understand the merits of an education in psychology\nWhat is creativity? What are prejudice and discrimination? What is consciousness? The field of psychology\nexplores questions like these. Psychology refers to the scientific study of the mind and behavior. Psychologists\nuse the scientific method to acquire knowledge. To apply the scientific method, a researcher with a question\nabout how or why something happens will propose a tentative explanation, called a hypothesis, to explain the\nphenomenon. A hypothesis should fit

In [52]:
import spacy
import re

nlp = spacy.load("en_core_web_sm")

for doc in docs:
    text = doc.page_content
    cleaned = re.sub(r'(?<![.!?])\n(?![A-Z])', ' ', text)
    doc.page_content = cleaned

In [53]:
docs[19].page_content

'2002). Nash was the subject of the 2001 movie A Beautiful Mind. Why did these people have these experiences? How does the human brain work? And what is the connection between the brain’s internal processes and people’s external behaviors? This textbook will introduce you to various ways that the field of psychology has explored these questions.\n1.1 What Is Psychology?\nLEARNING OBJECTIVES\nBy the end of this section, you will be able to: • Define psychology • Understand the merits of an education in psychology\nWhat is creativity? What are prejudice and discrimination? What is consciousness? The field of psychology explores questions like these. Psychology refers to the scientific study of the mind and behavior. Psychologists use the scientific method to acquire knowledge. To apply the scientific method, a researcher with a question about how or why something happens will propose a tentative explanation, called a hypothesis, to explain the phenomenon. A hypothesis should fit into the

In [54]:
page_lookup = {}

for d in docs:
    page_label = d.metadata.get("page_label")
    if page_label is not None:
        page_lookup[int(page_label)] = d


## Retriving only relevant information in the text book, which includes:
### Keeping and adding:
1. All numbered structured sections (e.g., 6.1, 6.2, 14.1, 14.2, etc.)
2. Adding chapter numbers to the metadata
### Excluding:
1. Introduction
2. Key Terms
3. Summary
4. Review Questions
5. Critical Thinking Questions
6. Personal Application Questions



In [55]:
section_pages = {

    "1.1": 20,
    "1.2": 21,
    "1.3": 30,
    "1.4": "38-41",

    "2.1": 48,
    "2.2": 53,
    "2.3": 60,
    "2.4": "71-74",

    "3.1": 84,
    "3.2": 90,
    "3.3": 96,
    "3.4": 98,
    "3.5": "109-111",

    "4.1": 122,
    "4.2": 126,
    "4.3": 129,
    "4.4": 133,
    "4.5": 138,
    "4.6": "146-148",

    "5.1": 158,
    "5.2": 161,
    "5.3": 165,
    "5.4": 173,
    "5.5": 176,
    "5.6": "180-183",

    "6.1": 194,
    "6.2": 195,
    "6.3": 204,
    "6.4": "215-216",

    "7.1": 226,
    "7.2": 230,
    "7.3": 234,
    "7.4": 240,
    "7.5": 243,
    "7.6": "249-252",

    "8.1": 260,
    "8.2": 267,
    "8.3": 271,
    "8.4": "281-284",

    "9.1": 292,
    "9.2": 296,
    "9.3": 304,
    "9.4": "325-326",

    "10.1": 334,
    "10.2": 340,
    "10.3": 346,
    "10.4": "354-364",

    "11.1": 372,
    "11.2": 374,
    "11.3": 380,
    "11.4": 385,
    "11.5": 389,
    "11.6": 390,
    "11.7": 391,
    "11.8": 396,
    "11.9": "398-402",

    "12.1": 412,
    "12.2": 418,
    "12.3": 421,
    "12.4": 427,
    "12.5": 434,
    "12.6": 441,
    "12.7": "444-448",

    "13.1": 460,
    "13.2": 468,
    "13.3": 479,
    "13.4": "489-491",

    "14.1": 498,
    "14.2": 508,
    "14.3": 514,
    "14.4": 526,
    "14.5": "533-540",

    "15.1": 550,
    "15.2": 554,
    "15.3": 557,
    "15.4": 560,
    "15.5": 566,
    "15.6": 570,
    "15.7": 572,
    "15.8": 582,
    "15.9": 586,
    "15.10": 588,
    "15.11": "594-600",

    "16.1": 612,
    "16.2": 617,
    "16.3": 629,
    "16.4": 633,
    "16.5": "635-638",
}




In [56]:
section_headings = {

    "1.1": "1.1 What Is Psychology?",
    "1.2": "1.2 History of Psychology",
    "1.3": "1.3 Contemporary Psychology",
    "1.4": "1.4 Careers in Psychology",

    "2.1": "2.1 Why Is Research Important?",
    "2.2": "2.2 Approaches to Research",
    "2.3": "2.3 Analyzing Findings",
    "2.4": "2.4 Ethics",

    "3.1": "3.1 Human Genetics",
    "3.2": "3.2 Cells of the Nervous System",
    "3.3": "3.3 Parts of the Nervous System",
    "3.4": "3.4 The Brain and Spinal Cord",
    "3.5": "3.5 The Endocrine System",

    "4.1": "4.1 What Is Consciousness?",
    "4.2": "4.2 Sleep and Why We Sleep",
    "4.3": "4.3 Stages of Sleep",
    "4.4": "4.4 Sleep Problems and Disorders",
    "4.5": "4.5 Substance Use and Abuse",
    "4.6": "4.6 Other States of Consciousness",

    "5.1": "5.1 Sensation versus Perception",
    "5.2": "5.2 Waves and Wavelengths",
    "5.3": "5.3 Vision",
    "5.4": "5.4 Hearing",
    "5.5": "5.5 The Other Senses",
    "5.6": "5.6 Gestalt Principles of Perception",

    "6.1": "6.1 What Is Learning?",
    "6.2": "6.2 Classical Conditioning",
    "6.3": "6.3 Operant Conditioning",
    "6.4": "6.4 Observational Learning (Modeling)",

    "7.1": "7.1 What Is Cognition?",
    "7.2": "7.2 Language",
    "7.3": "7.3 Problem Solving",
    "7.4": "7.4 What Are Intelligence and Creativity?",
    "7.5": "7.5 Measures of Intelligence",
    "7.6": "7.6 The Source of Intelligence",

    "8.1": "8.1 How Memory Functions",
    "8.2": "8.2 Parts of the Brain Involved with Memory",
    "8.3": "8.3 Problems with Memory",
    "8.4": "8.4 Ways to Enhance Memory",

    "9.1": "9.1 What Is Lifespan Development?",
    "9.2": "9.2 Lifespan Theories",
    "9.3": "9.3 Stages of Development",
    "9.4": "9.4 Death and Dying",

    "10.1": "10.1 Motivation",
    "10.2": "10.2 Hunger and Eating",
    "10.3": "10.3 Sexual Behavior",
    "10.4": "10.4 Emotion",

    "11.1": "11.1 What Is Personality?",
    "11.2": "11.2 Freud and the Psychodynamic Perspective",
    "11.3": "11.3 Neo-Freudians: Adler, Erikson, Jung, and Horney",
    "11.4": "11.4 Learning Approaches",
    "11.5": "11.5 Humanistic Approaches",
    "11.6": "11.6 Biological Approaches",
    "11.7": "11.7 Trait Theorists",
    "11.8": "11.8 Cultural Understandings of Personality",
    "11.9": "11.9 Personality Assessment",

    "12.1": "12.1 What Is Social Psychology?",
    "12.2": "12.2 Self-presentation",
    "12.3": "12.3 Attitudes and Persuasion",
    "12.4": "12.4 Conformity, Compliance, and Obedience",
    "12.5": "12.5 Prejudice and Discrimination",
    "12.6": "12.6 Aggression",
    "12.7": "12.7 Prosocial Behavior",

    "13.1": "13.1 What Is Industrial and Organizational Psychology?",
    "13.2": "13.2 Industrial Psychology: Selecting and Evaluating Employees",
    "13.3": "13.3 Organizational Psychology: The Social Dimension of Work",
    "13.4": "13.4 Human Factors Psychology and Workplace Design",

    "14.1": "14.1 What Is Stress?",
    "14.2": "14.2 Stressors",
    "14.3": "14.3 Stress and Illness",
    "14.4": "14.4 Regulation of Stress",
    "14.5": "14.5 The Pursuit of Happiness",

    "15.1": "15.1 What Are Psychological Disorders?",
    "15.2": "15.2 Diagnosing and Classifying Psychological Disorders",
    "15.3": "15.3 Perspectives on Psychological Disorders",
    "15.4": "15.4 Anxiety Disorders",
    "15.5": "15.5 Obsessive-Compulsive and Related Disorders",
    "15.6": "15.6 Posttraumatic Stress Disorder",
    "15.7": "15.7 Mood and Related Disorders",
    "15.8": "15.8 Schizophrenia",
    "15.9": "15.9 Dissociative Disorders",
    "15.10": "15.10 Disorders in Childhood",
    "15.11": "15.11 Personality Disorders",

    "16.1": "16.1 Mental Health Treatment: Past and Present",
    "16.2": "16.2 Types of Treatment",
    "16.3": "16.3 Treatment Modalities",
    "16.4": "16.4 Substance-Related and Addictive Disorders: A Special Case",
    "16.5": "16.5 The Sociocultural Model and Therapy Utilization",
}


### PARENT LEVEL CHUNKING WHERE EVERY INDIVIDUAL CHUNK IS ONE OF THE ABOVE SECTIONS

In [57]:
from typing import Dict, List, Tuple, Optional, Union
from langchain_core.documents import Document

def page_label_int(d: Document) -> int:
    pl = d.metadata.get("page_label")
    if pl is None:
        return int(d.metadata.get("page", 0)) + 1
    return int(str(pl))

def normalize_for_search(s: str) -> str:
    # tolerate "1.1What" vs "1.1 What" etc.
    import re
    return re.sub(r"\s+", "", s).lower()

def find_heading_index(raw_text: str, heading: str) -> int:
    """
    Return character index in raw_text where the heading begins, or -1 if not found.
    Uses whitespace-insensitive matching.
    """
    if not raw_text or not heading:
        return -1

    norm_text = normalize_for_search(raw_text)
    norm_heading = normalize_for_search(heading)

    pos = norm_text.find(norm_heading)
    if pos == -1:
        return -1

    # Approximate mapping back to raw_text:
    # Find the section id token (e.g., "1.1") and validate heading near it.
    sec_id = heading.strip().split()[0]  # "1.1"
    raw_pos = raw_text.find(sec_id)
    if raw_pos == -1:
        return -1

    window = raw_text[max(0, raw_pos - 200): raw_pos + 1200]
    if normalize_for_search(window).find(norm_heading) != -1:
        return raw_pos

    return -1

def section_sort_key(sec_id: str) -> Tuple[int, int]:
    ch, sub = sec_id.split(".")
    return int(ch), int(sub)

def parse_section_value(v: Union[int, str]) -> Tuple[int, Optional[int]]:
    """
    If v is int -> (start, None)
    If v is "start-end" -> (start, end)
    """
    if isinstance(v, int):
        return v, None
    v = str(v).strip()
    if "-" in v:
        a, b = v.split("-", 1)
        return int(a.strip()), int(b.strip())
    return int(v), None

def build_page_lookup(docs: List[Document]) -> Dict[int, Document]:
    lookup = {}
    for d in docs:
        lookup[page_label_int(d)] = d
    return lookup

def extract_section_text(
    page_lookup: Dict[int, Document],
    sec_heading: str,
    start_page: int,
    next_heading: Optional[str],
    next_start_page: Optional[int],
    hard_end_page: Optional[int] = None,
) -> Tuple[str, List[int]]:
    """
    Pure jump logic (no scanning intermediate pages for next heading):

    1) On start_page: find sec_heading and start from there.
    2) Add full pages start_page+1 .. (next_start_page-1) (or .. hard_end_page if set).
    3) On next_start_page: cut everything before next_heading and include that as last part.
       (If hard_end_page is set and ends before next_start_page, we stop at hard_end_page.)
    """

    if start_page not in page_lookup:
        return "", []

    # Determine stopping boundary
    if hard_end_page is not None:
        stop_page = hard_end_page
    elif next_start_page is not None:
        stop_page = next_start_page
    else:
        raise ValueError("Need either hard_end_page or next_start_page to know where to stop.")

    collected_parts: List[str] = []
    pages_used: List[int] = []

    # 1) Start page: find current heading
    raw0 = page_lookup[start_page].page_content
    idx0 = find_heading_index(raw0, sec_heading)
    if idx0 == -1:
        # If you are 100% sure structure is correct, you can raise instead of returning empty
        return "", []

    raw0 = raw0[idx0:]
    pages_used.append(start_page)

    # If next section starts on the same page, cut immediately
    if next_heading:
        nxt0 = find_heading_index(raw0, next_heading)
        if nxt0 != -1:
            collected_parts.append(raw0[:nxt0].rstrip())
            return "\n".join(collected_parts).strip(), pages_used

    collected_parts.append(raw0.rstrip())

    # 2) Middle pages (full pages, no heading searches)
    mid_end = stop_page - 1 if (next_start_page is not None and stop_page == next_start_page) else stop_page
    for p in range(start_page + 1, mid_end + 1):
        d = page_lookup.get(p)
        if not d:
            continue
        collected_parts.append(d.page_content.rstrip())
        pages_used.append(p)

    # 3) Boundary page: next_start_page (only if stop_page is next_start_page and we have next_heading)
    if next_start_page is not None and stop_page == next_start_page and next_heading:
        dN = page_lookup.get(next_start_page)
        if dN:
            rawN = dN.page_content
            pages_used.append(next_start_page)

            nxt_idx = find_heading_index(rawN, next_heading)
            if nxt_idx != -1:
                collected_parts.append(rawN[:nxt_idx].rstrip())
            else:
                # If not found, include nothing (strict), or include whole page (lenient).
                # Strict choice:
                pass

    return "\n".join(collected_parts).strip(), pages_used


### parent chunks

In [58]:
from langchain_core.documents import Document

def parse_start_page(v):
    # v can be int or "start-end"
    if isinstance(v, int):
        return v
    return int(str(v).split("-")[0].strip())

def parse_end_page(v):
    # returns end page if "start-end", else None
    if isinstance(v, str) and "-" in v:
        return int(v.split("-")[1].strip())
    return None

def section_sort_key(sec_id: str):
    ch, sub = sec_id.split(".")
    return (int(ch), int(sub))

def build_section_documents_jump(section_pages, section_headings, page_lookup):
    """
    Returns a list of Document objects (one per section), ordered by section id.
    section_docs[0] -> 1.1, section_docs[1] -> 1.2, ...
    Uses extract_section_text_jump_only (pure jump logic).
    """

    sorted_sections = sorted(section_pages.keys(), key=section_sort_key)
    section_docs = []

    for i, sec_id in enumerate(sorted_sections):
        start_page = parse_start_page(section_pages[sec_id])
        hard_end_page = parse_end_page(section_pages[sec_id])  # only set for "range" sections

        # Next section boundary
        next_heading = None
        next_start_page = None
        if i + 1 < len(sorted_sections):
            next_sec = sorted_sections[i + 1]
            next_heading = section_headings[next_sec]
            next_start_page = parse_start_page(section_pages[next_sec])

        # Extract the section text using your jump-only extractor
        text, pages_used = extract_section_text(
            page_lookup=page_lookup,
            sec_heading=section_headings[sec_id],
            start_page=start_page,
            next_heading=next_heading,
            next_start_page=next_start_page,
            hard_end_page=hard_end_page,
        )

        chapter_id = int(sec_id.split(".")[0])

        section_docs.append(
            Document(
                page_content=text,
                metadata={
                    "chapter_id": chapter_id,
                    "section_id": sec_id,
                    "section_heading": section_headings[sec_id],
                    "start_page": start_page,
                    'source': 'psychology_book.pdf'
                },
            )
        )

    return section_docs

section_docs = build_section_documents_jump(section_pages, section_headings, page_lookup)

print(len(section_docs))
print(section_docs[0].metadata)          # should be 1.1
print(section_docs[1].metadata)          # should be 1.2
print(section_docs[0].page_content[:200])


88
{'chapter_id': 1, 'section_id': '1.1', 'section_heading': '1.1 What Is Psychology?', 'start_page': 20, 'source': 'psychology_book.pdf'}
{'chapter_id': 1, 'section_id': '1.2', 'section_heading': '1.2 History of Psychology', 'start_page': 21, 'source': 'psychology_book.pdf'}
1.1 What Is Psychology?
LEARNING OBJECTIVES
By the end of this section, you will be able to: • Define psychology • Understand the merits of an education in psychology
What is creativity? What are prej


In [59]:
section_docs[7].page_content

'2.4 Ethics\nLEARNING OBJECTIVES\nBy the end of this section, you will be able to: • Discuss how research involving human subjects is regulated • Summarize the processes of informed consent and debriefing • Explain how research involving animal subjects is regulated\nToday, scientists agree that good research is ethical in nature and is guided by a basic respect for human dignity and safety. However, as you will read in the feature box, this has not always been the case. Modern researchers must demonstrate that the research they perform is ethically sound. This section presents how ethical considerations affect the design and implementation of research conducted today.\nResearch Involving Human Participants\nAny experiment involving the participation of human subjects is governed by extensive, strict guidelines designed to ensure that the experiment does not result in harm. Any research institution that receives federal support for research involving human participants must have access

### NOW THE section_docs CONTAIN LIST OF CHUNKS WHERE EVERY CHUNK CONTAINS TEXT CONTENT OF EACH SECTION
### NOW STARTS FINAL LEVEL CHUNKING

1. chunk it by \n.(UPPERCASE CHARACTER) as your first preference
2. removing all \n alone special characters which are not followed by dot
2. second preference .(UPPERCASE CHARACTER)
3. just my cut off length

In [60]:
import spacy
import re
def make_nlp() -> "spacy.language.Language":
    """
    Lightweight sentence splitter (no model download needed).
    If you *already* have en_core_web_sm installed, you can swap to:
        nlp = spacy.load("en_core_web_sm", disable=["tagger","parser","ner"])
    """
    #nlp = spacy.blank("en")
    nlp = spacy.load('en_core_web_sm')
    '''if "sentencizer" not in nlp.pipe_names:
        nlp.add_pipe("sentencizer")'''
    return nlp

def split_keep_delims_regex(text: str, pattern: str) -> List[str]:
    """
    Split by a regex boundary pattern *between* characters, keeping everything.
    Example boundary: r'(?<=[.!?]) (?=[A-Z])'
    """
    parts = re.split(pattern, text)
    return [p for p in (p.strip() for p in parts) if p]

In [61]:
from __future__ import annotations

from dataclasses import dataclass
from typing import List, Union, Dict, Any, Optional, Tuple
import re

try:
    import spacy
except ImportError as e:
    raise ImportError("Install spaCy first: pip install spacy") from e


# -----------------------------
# Helpers
# -----------------------------

def make_nlp() -> "spacy.language.Language":
    """
    Lightweight sentence splitter (no model download needed).
    If you *already* have en_core_web_sm installed, you can swap to:
        nlp = spacy.load("en_core_web_sm", disable=["tagger","parser","ner"])
    """
    nlp = spacy.blank("en")
    if "sentencizer" not in nlp.pipe_names:
        nlp.add_pipe("sentencizer")
    return nlp


def split_keep_delims_regex(text: str, pattern: str) -> List[str]:
    """
    Split by a regex boundary pattern *between* characters, keeping everything.
    Example boundary: r'(?<=[.!?]) (?=[A-Z])'
    """
    parts = re.split(pattern, text)
    return [p for p in (p.strip() for p in parts) if p]


def recursive_split(text: str, seps: List[str], max_len: int) -> List[str]:
    """
    Recursively split text until each piece <= max_len.
    seps can include:
      - '\n' (literal)
      - '; ' or ', ' (literal)
      - regex boundaries like r'(?<=[.!?]) (?=[A-Z])'
    """
    t = text.strip()
    if not t:
        return []
    if len(t) <= max_len:
        return [t]

    if not seps:
        # Hard cut fallback
        return [t[i : i + max_len].strip() for i in range(0, len(t), max_len) if t[i : i + max_len].strip()]

    sep = seps[0]

    # Choose splitting method
    if sep.startswith("(?") or ("<=" in sep) or ("(?=" in sep) or ("(?" in sep):
        pieces = split_keep_delims_regex(t, sep)
    elif sep == "\n":
        pieces = [p.strip() for p in t.split("\n") if p.strip()]
    else:
        pieces = [p.strip() for p in t.split(sep) if p.strip()]

    # If this separator didn't actually split (or produced 1 giant piece), move to next
    if len(pieces) <= 1:
        return recursive_split(t, seps[1:], max_len)

    out: List[str] = []
    for p in pieces:
        if len(p) <= max_len:
            out.append(p)
        else:
            out.extend(recursive_split(p, seps[1:], max_len))
    return out


def sentences_from_blocks(nlp, section_text: str) -> List[str]:
    """
    Your assumption: section_text already has accurate blocks separated by '\n'.
    We keep blocks isolated, and run sentence segmentation inside each block only.
    """
    blocks = [b.strip() for b in section_text.split("\n") if b.strip()]
    sents: List[str] = []
    for b in blocks:
        doc = nlp(b)
        for s in doc.sents:
            st = s.text.strip()
            if st:
                sents.append(st)
    return sents


def build_overlap_units(
    last_units: List[str],
    overlap_sentences: int,
    max_len: int,
    overlap_fallback_seps: List[str],
) -> List[str]:
    """
    Overlap is primarily last N sentences.
    If one of those sentences is too long, split it using overlap_fallback_seps and keep
    the *last* piece(s) so overlap isn't gigantic.
    """
    if overlap_sentences <= 0 or not last_units:
        return []

    take = last_units[-overlap_sentences:]
    fixed: List[str] = []

    for u in take:
        if len(u) <= max_len:
            fixed.append(u)
        else:
            # Sentence is huge; split and keep the last fragment as overlap anchor
            frags = recursive_split(u, overlap_fallback_seps, max_len)
            if frags:
                fixed.append(frags[-1])
    return fixed


def pack_units_into_chunks(
    units: List[str],
    max_len: int,
    seps_for_oversize_unit: List[str],
    overlap_sentences: int = 2,
    overlap_fallback_seps: Optional[List[str]] = None,
) -> List[str]:
    """
    Pack sentence-units into chunks <= max_len with N-sentence overlap.
    If a single sentence exceeds max_len, split it using seps_for_oversize_unit.
    """
    overlap_fallback_seps = overlap_fallback_seps or seps_for_oversize_unit

    # Expand oversize units (single sentence too large)
    expanded: List[str] = []
    for u in units:
        u = u.strip()
        if not u:
            continue
        if len(u) <= max_len:
            expanded.append(u)
        else:
            # Split the single huge sentence into smaller pieces
            expanded.extend(recursive_split(u, seps_for_oversize_unit, max_len))

    chunks: List[str] = []
    cur: List[str] = []
    cur_len = 0

    for u in expanded:
        u_len = len(u) + (1 if cur else 0)  # +1 for joining space
        if cur and (cur_len + u_len) > max_len:
            # finalize current chunk
            chunk_text = " ".join(cur).strip()
            if chunk_text:
                chunks.append(chunk_text)

            # prepare overlap for next chunk
            overlap_units = build_overlap_units(
                last_units=cur,
                overlap_sentences=overlap_sentences,
                max_len=max_len,
                overlap_fallback_seps=overlap_fallback_seps,
            )
            cur = overlap_units[:]  # start next with overlap
            cur_len = len(" ".join(cur)) if cur else 0

        # add unit
        if not cur:
            cur = [u]
            cur_len = len(u)
        else:
            cur.append(u)
            cur_len += len(u) + 1  # + space

    # flush
    final = " ".join(cur).strip()
    if final:
        chunks.append(final)

    return chunks


# -----------------------------
# Main function you asked for
# -----------------------------

def chunk_section_docs(
    section_docs: List[Union[str, Any]],
    *,
    max_size: int,
    recursive_seps: Optional[List[str]] = None,
    overlap_sentences: int = 2,
) -> List[Dict[str, Any]]:
    """
    section_docs[i] is one section (1.1, 1.2, ...). Each can be:
      - a plain string
      - a LangChain Document-like object with .page_content and .metadata

    Returns: list of dicts with chunk text + section index + metadata.
    """
    # Your requested separator order (used only when a single sentence/overlap is too big)
    recursive_seps = recursive_seps or ["\n", r"(?<=[.!?]) (?=[A-Z])", "; ", ", "]

    # For splitting an *oversize sentence* or *oversize overlap sentence*, you said:
    overlap_splitters = [r"(?<=[.!?]) (?=[A-Z])", "; ", ", "]
    oversize_unit_splitters = overlap_splitters[:]  # same idea

    nlp = make_nlp()

    all_chunks: List[Dict[str, Any]] = []

    for sec_idx, sec in enumerate(section_docs):
        if hasattr(sec, "page_content"):
            text = sec.page_content or ""
            meta = dict(getattr(sec, "metadata", {}) or {})
        else:
            text = str(sec or "")
            meta = {}

        # 1) respect your "accurate blocks separated by \n"
        sents = sentences_from_blocks(nlp, text)

        # 2) pack into max_size with sentence overlap; only split sentences if a sentence is too big
        chunks = pack_units_into_chunks(
            units=sents,
            max_len=max_size,
            seps_for_oversize_unit=oversize_unit_splitters,
            overlap_sentences=overlap_sentences,
            overlap_fallback_seps=overlap_splitters,
        )

        # 3) ensure we never cross section boundaries (we're looping per section)
        for j, ch in enumerate(chunks):
            all_chunks.append(
                {
                    "section_index": sec_idx,         # 0 -> 1.1, 1 -> 1.2, ...
                    "chunk_index_in_section": j,
                    "text": ch,
                    "metadata": meta,
                }
            )

    return all_chunks


# -----------------------------
# Example usage
# -----------------------------
if __name__ == "__main__":
    # Suppose section_docs[0] is section 1.1 text, section_docs[1] is 1.2 text, etc.
    '''section_docs = [
        "LEARNING OBJECTIVES\nBy the end of this section, you will be able to...\nThis is a long paragraph that might wrap.",
        "Another section block.\nSecond block in the same section, still separated by newline.",
    ]'''

    results = chunk_section_docs(
        section_docs,
        max_size=1000,  # chars (you can raise/lower)
        recursive_seps=["\n", r"(?<=[.!?]) (?=[A-Z])", "; ", ", "],
        overlap_sentences=2,
    )

    for r in results[:3]:
        print("SECTION", r["section_index"], "CHUNK", r["chunk_index_in_section"])
        print(r["text"])
        print("---")

SECTION 0 CHUNK 0
1.1 What Is Psychology? LEARNING OBJECTIVES By the end of this section, you will be able to: • Define psychology • Understand the merits of an education in psychology What is creativity? What are prejudice and discrimination? What is consciousness? The field of psychology explores questions like these. Psychology refers to the scientific study of the mind and behavior. Psychologists use the scientific method to acquire knowledge. To apply the scientific method, a researcher with a question about how or why something happens will propose a tentative explanation, called a hypothesis, to explain the phenomenon. A hypothesis should fit into the context of a scientific theory, which is a broad explanation or group of explanations for some aspect of the natural world that is consistently supported by evidence over time. A theory is the best understanding we have of that part of the natural world.
---
SECTION 0 CHUNK 1
A hypothesis should fit into the context of a scientific

In [62]:
results[4]["metadata"]

{'chapter_id': 1,
 'section_id': '1.1',
 'section_heading': '1.1 What Is Psychology?',
 'start_page': 20,
 'source': 'psychology_book.pdf'}

In [63]:
len(results)

2212

In [49]:
metadatas = [
    {
        "section_index": r["section_index"],
        "chunk_index_in_section": r["chunk_index_in_section"],
        **(r.get("metadata") or {}),
    }
    for r in results
]
type(metadatas[0]['section_index'])

int

### chunks to embedding model and to chromadb

In [64]:
# pip install chromadb openai   (or: pip install chromadb sentence-transformers)

import chromadb
from chromadb.config import Settings

# --------------------------------------------
# 1) Prepare documents + metadata + ids
# --------------------------------------------
docs = [r["text"] for r in results]

metadatas = [
    {
        "section_index": r["section_index"],
        "chunk_index_in_section": r["chunk_index_in_section"],
        **(r.get("metadata") or {}),
    }
    for r in results
]

# stable ids (important for dedup / re-runs)
ids = [
    f"sec{r['section_index']}_chunk{r['chunk_index_in_section']}"
    for r in results
]

# --------------------------------------------
# Option A: Let Chroma embed for you (SentenceTransformers)
# --------------------------------------------
from chromadb.utils import embedding_functions

embedding_fn = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

client = chromadb.PersistentClient(path="./chroma_store")
collection = client.get_or_create_collection(
    name="psych_book_sections",
    embedding_function=embedding_fn,
    metadata={"hnsw:space": "cosine"},
)

# add in batches to avoid memory spikes
BATCH = 256
for i in range(0, len(docs), BATCH):
    collection.add(
        documents=docs[i:i+BATCH],
        metadatas=metadatas[i:i+BATCH],
        ids=ids[i:i+BATCH],
    )

print("Total stored:", collection.count())

Total stored: 2212


### retriving similar chunks

In [77]:
# pip install chromadb sentence-transformers openai
# (If you only want retrieval and not LLM answering, you can skip openai.)

import chromadb
from chromadb.utils import embedding_functions

# --------------------------------------------
# 0) Load the same persistent collection
#    (must use the SAME embedding model + space used during ingest)
# --------------------------------------------
embedding_fn = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

client = chromadb.PersistentClient(path="./chroma_store")

collection = client.get_collection(
    name="psych_book_sections",
    embedding_function=embedding_fn,     # important so query_texts embeds consistently
)

print("Total stored:", collection.count())

# --------------------------------------------
# 1) Retrieval function (top-k chunks)
# --------------------------------------------
def retrieve(query: str, k: int = 5, where: dict | None = None):
    """
    where: optional metadata filter, e.g. {"section_index": 12}
    """
    res = collection.query(
        query_texts=[query],
        n_results=k,
        where=where,  # None means no filter
        include=["documents", "metadatas", "distances"],
    )
    return res

# Example retrieval
q = "What is humanity?"
res = retrieve(q, k=5)

print("\nTop results:")
for rank, (doc, meta, dist, _id) in enumerate(
    zip(res["documents"][0], res["metadatas"][0], res["distances"][0], res["ids"][0]),
    start=1
):
    print(f"\n#{rank}  distance={dist:.4f}")

    # Prefer book-meaningful metadata fields
    print(
        "meta:",
        "chapter_id=", meta.get("chapter_id"),
        "section_id=", meta.get("section_id"),
        "heading=", meta.get("section_heading"),
        "start_page=", meta.get("start_page"),
    )

    print(doc[:400], "..." if len(doc) > 400 else "")

Total stored: 2212

Top results:

#1  distance=0.5297
meta: chapter_id= 1 section_id= 1.2 heading= 1.2 History of Psychology start_page= 21
Humanism is a perspective within psychology that emphasizes the potential for good that is innate to all humans. Two of the most well-known proponents of humanistic psychology are Abraham Maslow and Carl Rogers (O’Hara, n.d.). Abraham Maslow (1908–1970) was an American psychologist who is best known for proposing a hierarchy of human needs in motivating behavior (Figure 1.7). Although this concept ...

#2  distance=0.5630
meta: chapter_id= 1 section_id= 1.2 heading= 1.2 History of Psychology start_page= 21
During the early 20th century, American psychology was dominated by behaviorism and psychoanalysis. However, some psychologists were uncomfortable with what they viewed as limited perspectives being so influential to the field. They objected to the pessimism and determinism (all actions driven by the unconscious) of Freud. They also disliked the 

In [80]:
results[30]["text"]

'During the early 20th century, American psychology was dominated by behaviorism and psychoanalysis. However, some psychologists were uncomfortable with what they viewed as limited perspectives being so influential to the field. They objected to the pessimism and determinism (all actions driven by the unconscious) of Freud. They also disliked the reductionism, or simplifying nature, of behaviorism. Behaviorism is also deterministic at its core, because it sees human behavior as entirely determined by a combination of genetics and environment. Some psychologists began to form their own ideas that emphasized personal control, intentionality, and a true predisposition for “good” as important for our self-concept and our behavior. Thus, humanism emerged. Humanism is a perspective within psychology that emphasizes the potential for good that is innate to all humans. Two of the most well-known proponents of humanistic psychology are Abraham Maslow and Carl Rogers (O’Hara, n.d.).'

In [74]:
results[22]["metadata"]

{'chapter_id': 1,
 'section_id': '1.2',
 'section_heading': '1.2 History of Psychology',
 'start_page': 21,
 'source': 'psychology_book.pdf'}

### using open ai (or other llms)