### Chapter Content extraction

In [172]:
import fitz  # PyMuPDF
import re
import codecs

# Dictionary mapping example keys to PDF paths
examples = {
    "pdf_path1": "../../data/mcelreath_2020_statistical-rethinking.pdf",
    "pdf_path2": "../../data/Theory of Statistic.pdf",
    "pdf_path3": "../../data/Deep Learning with Python.pdf",
    "pdf_path4": "../../data/Natural_Image_Statistics.pdf",
    "pdf_path5": "../../data/mml-book.pdf"
}

# Dictionary mapping example keys to page ranges to extract content from
content_page_ranges = {
    "pdf_path1": range(5, 8),
    "pdf_path2": range(10, 17),
    "pdf_path3": range(7, 13),
    "pdf_path4": range(4, 13),
    "pdf_path5": range(2, 5),
}

# Select example number
n_example = 5
key = f"pdf_path{n_example}"

# Open the PDF
doc = fitz.open(examples[key])

# Extract text from the specified page range
chapters_content_list = []
for page_num in content_page_ranges[key]:
    page = doc[page_num]
    text = page.get_text("text")
    chapters_content_list.append(text)

# Join all text pages into a single string if needed
chapters_content = "\n".join(chapters_content_list)

print(chapters_content)  # or pass it to your model

Contents
Foreword
1
Part I
Mathematical Foundations
9
1
Introduction and Motivation
11
1.1
Finding Words for Intuitions
12
1.2
Two Ways to Read This Book
13
1.3
Exercises and Feedback
16
2
Linear Algebra
17
2.1
Systems of Linear Equations
19
2.2
Matrices
22
2.3
Solving Systems of Linear Equations
27
2.4
Vector Spaces
35
2.5
Linear Independence
40
2.6
Basis and Rank
44
2.7
Linear Mappings
48
2.8
Afﬁne Spaces
61
2.9
Further Reading
63
Exercises
64
3
Analytic Geometry
70
3.1
Norms
71
3.2
Inner Products
72
3.3
Lengths and Distances
75
3.4
Angles and Orthogonality
76
3.5
Orthonormal Basis
78
3.6
Orthogonal Complement
79
3.7
Inner Product of Functions
80
3.8
Orthogonal Projections
81
3.9
Rotations
91
3.10
Further Reading
94
Exercises
96
4
Matrix Decompositions
98
4.1
Determinant and Trace
99
i
This material will be published by Cambridge University Press as Mathematics for Machine Learn-
ing by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. This pre-publication version is
free to

### Check Gemma3 performance with chapter infos extraction


In [35]:
import requests
from dotenv import load_dotenv
import os
import re
import json
import time

load_dotenv()  # Loads .env file into environment

# Your endpoint ID and API key
api_key = os.getenv("RUNPOD_API_KEY")
endpoint2 = "https://api.runpod.ai/v2/4zyobam3zy2bci"
endpoint = "https://api.runpod.ai/v2/hmje50gz4lr97c"
headers = {
    "Authorization": f"Bearer {api_key}",
    "Content-Type": "application/json"
}


In [173]:
def format_messages_as_prompt(messages):
    """Convert messages list to a single prompt string for Ollama generate endpoint"""
    prompt_parts = []
    
    for message in messages:
        role = message["role"]
        content = message["content"]
        
        if role == "system":
            prompt_parts.append(f"System: {content}")
        elif role == "user":
            prompt_parts.append(f"User: {content}")
        elif role == "assistant":
            prompt_parts.append(f"Assistant: {content}")
    
    # Add final prompt for the assistant to respond
    prompt_parts.append("Assistant:")
    
    return "\n\n".join(prompt_parts)


# Your messages array
messages = [
        {
            "role": "system",
            "content": "You are a precise document parser that extracts structured information from table of contents. You NEVER hallucinate, invent, or make up information. You ONLY extract what is explicitly present in the provided text. If you cannot find clear chapter information, you return an empty array. You do not guess chapter titles or page numbers."
        },
        {
            "role": "user",
            "content": "I need to extract main chapter information from this table of contents. Only extract numbered chapters, ignore subsections. Do not make up any information."
        },
        {
            "role": "assistant",
            "content": "I understand. I will extract ONLY the main chapters that are explicitly shown in your table of contents. I will not invent, guess, or hallucinate any chapter titles or page numbers. I will only use the exact information present in the document."
        },
        {
            "role": "user",
            "content": f"""Here is the table of contents:

{chapters_content}

WARNING: DO NOT HALLUCINATE OR INVENT INFORMATION
- Do NOT make up chapter titles like "Probability", "Statistical Inference", "Linear Regression"
- Do NOT guess page numbers
- Do NOT create generic textbook chapters
- ONLY extract what you can clearly see in the provided text

CRITICAL RULES:
1. Extract ONLY main chapters that start with a number (1, 2, 3, etc.)
2. Do NOT extract subsections (like 1.1, 1.2, 2.1, etc.)
3. Use the EXACT chapter titles shown in the document
4. Use the EXACT page numbers shown in the document
5. Handle both roman numerals (i, ii, iii, v, x) and arabic numerals (1, 25, 100)
6. Calculate end pages as: next chapter's start page minus 1
7. Return ONLY valid JSON - no explanations, no markdown formatting
8. If you cannot clearly identify chapters, return empty array []

Look for patterns like:
- "1 Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1"
- "2 Distribution Theory and Statistical Models . . . . . . . . . . . . . . . . 155"
- "3 Basic Statistical Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205"

DO NOT extract lines like:
- "1.1 Some Important Music Concepts . . . . . . . . . . . 3"
- "Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v"

Use ONLY the exact titles from the document. Do not shorten or modify them.

Return JSON array: [{{"chapter_number": "X", "chapter_title": "...", "start_page": X, "end_page": X}}]

REMEMBER: Extract only what is explicitly visible in the text. Do not hallucinate. Be complete and extract all chapters that are clearly numbered.                  y chapters, return an empty array []."""
        },
        {
            "role": "assistant",
            "content": "I will carefully examine the table of contents and extract only the main chapters that are explicitly shown, using their exact titles and page numbers. I will not invent or hallucinate any information."
        }
    ]


# Convert to proper format
formatted_prompt = format_messages_as_prompt(messages)

# Build the payload for RunPod
payload = {
    "input": {
        "prompt": formatted_prompt  # Now it's a single string
    }
}

# What the formatted prompt will look like:
print("Formatted prompt:")
print(f'"""{formatted_prompt}"""')



Formatted prompt:
"""System: You are a precise document parser that extracts structured information from table of contents. You NEVER hallucinate, invent, or make up information. You ONLY extract what is explicitly present in the provided text. If you cannot find clear chapter information, you return an empty array. You do not guess chapter titles or page numbers.

User: I need to extract main chapter information from this table of contents. Only extract numbered chapters, ignore subsections. Do not make up any information.

Assistant: I understand. I will extract ONLY the main chapters that are explicitly shown in your table of contents. I will not invent, guess, or hallucinate any chapter titles or page numbers. I will only use the exact information present in the document.

User: Here is the table of contents:

Contents
Foreword
1
Part I
Mathematical Foundations
9
1
Introduction and Motivation
11
1.1
Finding Words for Intuitions
12
1.2
Two Ways to Read This Book
13
1.3
Exercises and

In [174]:
# 1. Start a job
start_response = requests.post(f"{endpoint}/run", json=payload, headers=headers)
job = start_response.json()
job_id = job["id"]

print(f"Job started with ID: {job_id}")

# 2. Poll until done
status = None
while status not in ("COMPLETED", "FAILED"):
    time.sleep(3)
    poll_response = requests.get(f"{endpoint}/status/{job_id}", headers=headers)
    poll_data = poll_response.json()
    status = poll_data["status"]
    print(f"Job status: {status}")

# 3. Get result if completed
if status == "COMPLETED":
    output_raw = poll_data['output']['response']
    print("Job Output:")
    print(output_raw)
else:
    print("Job failed.")

Job started with ID: 2925388e-9ade-424d-a75a-54e7811054d6-e1
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: COMPLETED
Job Output:
```json
[
  {
    "chapter_number": "1",
    "chapter_title": "Introduction and Motivation",
    "start_page": 11,
    "end_page": 15
  },
  {
    "chapter_number": "2",
    "chapter_title": "Linear Algebra",
    "start_page": 17,
    "end_page": 62
  },
  {
    "chapter_number": "3",
    "chapter_title": "Analytic Geometry",
    "start_page": 70,
    "end_page": 93
  },
  {
    "chapter_number": "4",
    "chapter_title": "Matrix Decompositions",
    "start_page": 98,
    "end_page": 136
  },
  {
    "chapter_number": "5",
    "chapter_title": "Vector Calculus",
    "start_page": 139,
    "end_page": 164
  },
  {
    "chapter_number": "6",
    "chapter_title": "Probability and Distributions",
    "start_

In [175]:
def clean_and_parse_json(raw_text):

    # Clean up triple quotes and markdown syntax
    cleaned = output_raw.strip("'").strip('```json').strip('```')

    # Unescape JSON string
    unescaped = codecs.decode(cleaned, 'unicode_escape')

    # Parse JSON
    chapters = json.loads(unescaped)
    
    return chapters


chapters = clean_and_parse_json(output_raw)
for chapter in chapters:
    print(f"{chapter['chapter_number']}: {chapter['chapter_title']}")

1: Introduction and Motivation
2: Linear Algebra
3: Analytic Geometry
4: Matrix Decompositions
5: Vector Calculus
6: Probability and Distributions
7: Continuous Optimization
8: When Models Meet Data
9: Linear Regression
10: Dimensionality Reduction with Principal Component Analysis
11: Density Estimation with Gaussian Mixture Models
12: Classiï¬cation with Support Vector Machines
