# Batch 6 documents:
CHUNK SIZE: 1000 chars

/Users/mason/Desktop/Technical_Projects/PYTHON_Projects/PSAI/raw/Bibliography on Contemporary Issues 1984 PS and Allen.pdf
- Collection: pamphlet_chunks
- Date: 1984
- Add to metadata: this is a list of books that Phyllis Schlafly recommends.


/Users/mason/Desktop/Technical_Projects/PYTHON_Projects/PSAI/raw/Disarmament - The New U.S. Initiative Sept 1962.pdf
- Collection: pamphlet_chunks
- Date: 1962
- Within the metadata for all chunks from this document, include the title of the pamphlet and the date it was published.
- This has multiple essays by multiple authors. Metadata needs to be dependent on the essay and author.

Essay 1: 
- Title: THE CALL FOR LEADERSHIP  
- Author: John J. McCloy, Chairman, General Advisory Committee of the United States Arms Control and Disarmament Agency 
- PDF pages: 4-9 inclusive

Essay 2: 
- Title:  WORKING TOWARD A WORLD WITHOUT WAR
- Author: Adlai E. Stevenson, U.S. Representative to the United Nations. 
- PDF pages: 10-18 inclusive

Essay 3: 
- Title: U.S. OUTLINES INITIAL PROPOSALS OF PROGRAM FOR GENERAL AND COMPLETE DISARMAMENT
- Author: Dean Rusk, Secretary of State 
- PDF pages: 18-23 inclusive

Essay 4:
- Title:  U.S. URGES SOVIET UNION To Join IN ENDING NUCLEAR WEAPON TESTS
- Author: Dean Rusk, Secretary of State.
- PDF pages: 23-28 inclusive

Essay 5:
- Title: THE INITIATIVE FOR PEACE
- Author: William C. Foster, Director, United States Arms Control and Disarmament Agency ..... .
- PDF pages: 28-32 inclusive

Essay 6:
- Title: THE NEW SEARCH FOR DISARMAMENT-GENEVA 1962
- Author: Arthur H. Dean, U.S. Representative to the Conference of the 18-Nation Disarmament Committee 
- PDF pages: 33-38 inclusive


/Users/mason/Desktop/Technical_Projects/PYTHON_Projects/PSAI/raw/Pamphlet Our Moral Duty to Defend Freedom 1983.pdf
- Collection: pamphlet_chunks
- Date: 1983
- This has multiple essays by multiple authors. Metadata needs to be dependent on the essay and author.
- Within the metadata for all chunks from this document, include the title of the pamphlet and the date it was published.

Introduction:
    - Title: Introduction
    - Author: Phyllis Schlafly
    - PDF pages: 4
Essay 1:
    - Title: The Ethics of the Freeze Movement.
    - Author: Dr. Ernest W. Lefever, President of the Ethics and Public Policy Center.
    - PDF pages: 5-9 inclusive
Essay 2:
    - Title: The Just War Doctrine in the Nuclear Age.
    - Author: Dr. William V. O'Brien, Professor of Government at Georgetown University.
    - PDF pages: 10-15 inclusive
Essay 3:
    - Title: The Jewish Tradition and National Defense.
    - Author: Rabbi Joshua 0. Haberman, Senior Rabbi of the Washington Hebrew Congregation.
    - PDF pages: 16-24 inclusive
Discussion Section:
    - Title: Discussion
    - Author: Phyllis Schlafly, Dr. Ernest W. Lefever, Dr. William V. O'Brien, Rabbi Joshua 0. Haberman, Dr. Ronald P. McArthur
    - PDF pages: 25-33 inclusive
Essay 4:
    - Title: The Challenge of Peace: A Theology of Defeat.
    - Author: Dr. Ronald P. McArthur, President of Thomas Aquinas College of Santa Paula, California.
    - PDF pages: 34-end inclusive


/Users/mason/Desktop/Technical_Projects/PYTHON_Projects/PSAI/raw/Pamphlet The Pastoral Letter on War and Peace We Wish the Bishops Had Written.pdf
- Collection: pamphlet_chunks
- Date: November 14, 1982
- Author: Phyllis Schlafly


/Users/mason/Desktop/Technical_Projects/PYTHON_Projects/PSAI/raw/Program Testimonial Dinner for Phyllis Schlafly 9 23 1994.pdf
- Collection: misc_chunks
- Date: September 23, 1994
- Author: Phyllis Schlafly
- Do not do chunk sizing for this document, just include the entire document in one chunk.

/Users/mason/Desktop/Technical_Projects/PYTHON_Projects/PSAI/raw/Reading List for Americans.pdf
- Collection: pamphlet_chunks
- Date: 1983
- Author: Phyllis Schlafly

/Users/mason/Desktop/Technical_Projects/PYTHON_Projects/PSAI/raw/Report of American Bar Association Special Committee on Communist Tactics March 1 1962.pdf
- Collection: pamphlet_chunks
- Date: March 1, 1962
- Author: American Bar Association

In [1]:
# Imports and helpers for batch 6 chunking
import os, json
from pathlib import Path
import fitz

OUTPUT_DIR = Path("/Users/mason/Desktop/Technical_Projects/PYTHON_Projects/PSAI/chunks/batch6")
CHUNK_SIZE = 1000

OUTPUT_DIR.mkdir(parents=True, exist_ok=True)


def extract_text_from_pdf(pdf_path: str, start_page: int | None = None, end_page: int | None = None) -> str:
    """Extract text from a PDF. start_page/end_page are 1-based and inclusive. None means start or end of document."""
    doc = fitz.open(pdf_path)
    try:
        total_pages = doc.page_count
        start_idx = 0 if start_page is None else max(0, start_page - 1)
        end_idx = (total_pages - 1) if end_page is None else min(total_pages - 1, end_page - 1)
        if start_idx > end_idx:
            return ""
        text_parts = []
        for i in range(start_idx, end_idx + 1):
            text_parts.append(doc.load_page(i).get_text("text"))
        return "\n".join(text_parts)
    finally:
        doc.close()


def chunk_text(text: str, chunk_size: int = CHUNK_SIZE) -> list[str]:
    paragraphs = [p.strip() for p in text.split("\n") if p.strip()]
    chunks: list[str] = []
    current: str = ""
    for para in paragraphs:
        if len(current) + len(para) + 1 <= chunk_size:
            current = (current + " " + para).strip() if current else para
        else:
            if current:
                chunks.append(current)
            if len(para) <= chunk_size:
                current = para
            else:
                # Hard-wrap very long paragraphs
                for i in range(0, len(para), chunk_size):
                    segment = para[i:i+chunk_size]
                    if len(segment) == chunk_size:
                        chunks.append(segment)
                    else:
                        current = segment
    if current:
        chunks.append(current)
    return chunks


def save_json(chunks: list[dict], outpath: Path) -> None:
    with open(outpath, "w") as f:
        json.dump(chunks, f, indent=2)


def safe_filename_from_path(path: str) -> str:
    return Path(path).stem.replace(" ", "_").replace("'", "").replace('"', "")


In [2]:
# Document configurations based on markdown instructions

DOCS = []

# 1) Bibliography on Contemporary Issues 1984 PS and Allen
DOCS.append({
    "path": "/Users/mason/Desktop/Technical_Projects/PYTHON_Projects/PSAI/raw/Bibliography on Contemporary Issues 1984 PS and Allen.pdf",
    "collection": "pamphlet_chunks",
    "chunk": True,
    "chunk_size": CHUNK_SIZE,
    "metadata_common": {
        "date": "1984",
        "note": "This is a list of books that Phyllis Schlafly recommends.",
        "pamphlet_title": "Bibliography on Contemporary Issues"
    },
    "sections": [
        {"title": "Bibliography on Contemporary Issues", "author": "Phyllis Schlafly and Allen", "start_page": None, "end_page": None}
    ]
})

# 2) Disarmament - The New U.S. Initiative Sept 1962 (multi-essay)
DOCS.append({
    "path": "/Users/mason/Desktop/Technical_Projects/PYTHON_Projects/PSAI/raw/Disarmament - The New U.S. Initiative Sept 1962.pdf",
    "collection": "pamphlet_chunks",
    "chunk": True,
    "chunk_size": CHUNK_SIZE,
    "metadata_common": {
        "pamphlet_title": "Disarmament - The New U.S. Initiative",
        "date": "1962"
    },
    "sections": [
        {"title": "THE CALL FOR LEADERSHIP", "author": "John J. McCloy", "start_page": 4, "end_page": 9},
        {"title": "WORKING TOWARD A WORLD WITHOUT WAR", "author": "Adlai E. Stevenson", "start_page": 10, "end_page": 18},
        {"title": "U.S. OUTLINES INITIAL PROPOSALS OF PROGRAM FOR GENERAL AND COMPLETE DISARMAMENT", "author": "Dean Rusk", "start_page": 18, "end_page": 23},
        {"title": "U.S. URGES SOVIET UNION To Join IN ENDING NUCLEAR WEAPON TESTS", "author": "Dean Rusk", "start_page": 23, "end_page": 28},
        {"title": "THE INITIATIVE FOR PEACE", "author": "William C. Foster", "start_page": 28, "end_page": 32},
        {"title": "THE NEW SEARCH FOR DISARMAMENT-GENEVA 1962", "author": "Arthur H. Dean", "start_page": 33, "end_page": 38}
    ]
})

# 3) Pamphlet Our Moral Duty to Defend Freedom 1983 (multi-essay)
DOCS.append({
    "path": "/Users/mason/Desktop/Technical_Projects/PYTHON_Projects/PSAI/raw/Pamphlet Our Moral Duty to Defend Freedom 1983.pdf",
    "collection": "pamphlet_chunks",
    "chunk": True,
    "chunk_size": CHUNK_SIZE,
    "metadata_common": {
        "pamphlet_title": "Our Moral Duty to Defend Freedom",
        "date": "1983"
    },
    "sections": [
        {"title": "Introduction", "author": "Phyllis Schlafly", "start_page": 4, "end_page": 4},
        {"title": "The Ethics of the Freeze Movement.", "author": "Dr. Ernest W. Lefever", "start_page": 5, "end_page": 9},
        {"title": "The Just War Doctrine in the Nuclear Age.", "author": "Dr. William V. O'Brien", "start_page": 10, "end_page": 15},
        {"title": "The Jewish Tradition and National Defense.", "author": "Rabbi Joshua O. Haberman", "start_page": 16, "end_page": 24},
        {"title": "Discussion", "author": "Phyllis Schlafly; Dr. Ernest W. Lefever; Dr. William V. O'Brien; Rabbi Joshua O. Haberman; Dr. Ronald P. McArthur", "start_page": 25, "end_page": 33},
        {"title": "The Challenge of Peace: A Theology of Defeat.", "author": "Dr. Ronald P. McArthur", "start_page": 34, "end_page": None}
    ]
})

# 4) The Pastoral Letter on War and Peace We Wish the Bishops Had Written
DOCS.append({
    "path": "/Users/mason/Desktop/Technical_Projects/PYTHON_Projects/PSAI/raw/Pamphlet The Pastoral Letter on War and Peace We Wish the Bishops Had Written.pdf",
    "collection": "pamphlet_chunks",
    "chunk": True,
    "chunk_size": CHUNK_SIZE,
    "metadata_common": {
        "date": "1982-11-14",
        "author": "Phyllis Schlafly",
        "pamphlet_title": "The Pastoral Letter on War and Peace We Wish the Bishops Had Written"
    },
    "sections": [
        {"title": "The Pastoral Letter on War and Peace We Wish the Bishops Had Written", "author": "Phyllis Schlafly", "start_page": None, "end_page": None}
    ]
})

# 5) Program Testimonial Dinner for Phyllis Schlafly 9 23 1994 (single chunk)
DOCS.append({
    "path": "/Users/mason/Desktop/Technical_Projects/PYTHON_Projects/PSAI/raw/Program Testimonial Dinner for Phyllis Schlafly 9 23 1994.pdf",
    "collection": "misc_chunks",
    "chunk": False,  # single chunk
    "metadata_common": {
        "date": "1994-09-23",
        "author": "Phyllis Schlafly",
        "doc_title": "Program Testimonial Dinner for Phyllis Schlafly"
    },
    "sections": [
        {"title": "Program Testimonial Dinner for Phyllis Schlafly", "author": "Phyllis Schlafly", "start_page": None, "end_page": None}
    ]
})

# 6) Reading List for Americans
DOCS.append({
    "path": "/Users/mason/Desktop/Technical_Projects/PYTHON_Projects/PSAI/raw/Reading List for Americans.pdf",
    "collection": "pamphlet_chunks",
    "chunk": True,
    "chunk_size": CHUNK_SIZE,
    "metadata_common": {
        "date": "1983",
        "author": "Phyllis Schlafly",
        "pamphlet_title": "Reading List for Americans"
    },
    "sections": [
        {"title": "Reading List for Americans", "author": "Phyllis Schlafly", "start_page": None, "end_page": None}
    ]
})

# 7) Report of American Bar Association Special Committee on Communist Tactics March 1 1962
DOCS.append({
    "path": "/Users/mason/Desktop/Technical_Projects/PYTHON_Projects/PSAI/raw/Report of American Bar Association Special Committee on Communist Tactics March 1 1962.pdf",
    "collection": "pamphlet_chunks",
    "chunk": True,
    "chunk_size": CHUNK_SIZE,
    "metadata_common": {
        "date": "1962-03-01",
        "author": "American Bar Association",
        "pamphlet_title": "Report of ABA Special Committee on Communist Tactics"
    },
    "sections": [
        {"title": "Report of American Bar Association Special Committee on Communist Tactics", "author": "American Bar Association", "start_page": None, "end_page": None}
    ]
})


In [3]:
# Process and write JSON chunks for batch 6

all_outputs: list[dict] = []

for doc in DOCS:
    path = doc["path"]
    collection = doc["collection"]
    chunk_enabled = doc.get("chunk", True)
    chunk_size = int(doc.get("chunk_size", CHUNK_SIZE))
    common_meta = doc.get("metadata_common", {})

    if not Path(path).exists():
        print(f"Warning: file missing -> {path}")
        continue

    print(f"Processing: {Path(path).name} | collection={collection} | chunk={chunk_enabled}")

    doc_outputs: list[dict] = []

    for section in doc.get("sections", []):
        title = section.get("title")
        author = section.get("author")
        start_page = section.get("start_page")
        end_page = section.get("end_page")

        text = extract_text_from_pdf(path, start_page, end_page)
        if not text.strip():
            print(f"  - Skipped empty text for section: {title}")
            continue

        if chunk_enabled:
            text_chunks = chunk_text(text, chunk_size)
        else:
            text_chunks = [text]

        for idx, chunk in enumerate(text_chunks, start=1):
            payload = {
                "collection": collection,
                "text": chunk,
                "metadata": {
                    **common_meta,
                    "section_title": title,
                    "author": author,
                    "source_file": path,
                },
            }
            doc_outputs.append(payload)

    if not doc_outputs:
        print("  - No outputs for this document")
        continue

    # Write per-document JSON
    outfile = OUTPUT_DIR / f"{safe_filename_from_path(path)}.json"
    save_json(doc_outputs, outfile)
    all_outputs.extend(doc_outputs)
    print(f"  - Wrote {len(doc_outputs)} chunks -> {outfile}")

print(f"\nTotal chunks across all documents: {len(all_outputs)}")


Processing: Bibliography on Contemporary Issues 1984 PS and Allen.pdf | collection=pamphlet_chunks | chunk=True
  - Wrote 51 chunks -> /Users/mason/Desktop/Technical_Projects/PYTHON_Projects/PSAI/chunks/batch6/Bibliography_on_Contemporary_Issues_1984_PS_and_Allen.json
Processing: Disarmament - The New U.S. Initiative Sept 1962.pdf | collection=pamphlet_chunks | chunk=True
  - Wrote 183 chunks -> /Users/mason/Desktop/Technical_Projects/PYTHON_Projects/PSAI/chunks/batch6/Disarmament_-_The_New_U.S._Initiative_Sept_1962.json
Processing: Pamphlet Our Moral Duty to Defend Freedom 1983.pdf | collection=pamphlet_chunks | chunk=True
  - Wrote 86 chunks -> /Users/mason/Desktop/Technical_Projects/PYTHON_Projects/PSAI/chunks/batch6/Pamphlet_Our_Moral_Duty_to_Defend_Freedom_1983.json
Processing: Pamphlet The Pastoral Letter on War and Peace We Wish the Bishops Had Written.pdf | collection=pamphlet_chunks | chunk=True
  - Wrote 33 chunks -> /Users/mason/Desktop/Technical_Projects/PYTHON_Projects/PSA