# Data Processing Pipeline for [Human Pin Code](https://humanpincode.com/) by *Douglas Forbes*

This notebook outlines a data processing pipeline designed to extract, convert, and summarize content from Douglas Forbes' book, "Human Pin Code". The goal is to prepare the book's content, specifically the sections related to individual "pin code" digits, for further analysis or use with large language models (LLMs).

The pipeline consists of three main stages:

1.  **Stage 0: Preparations:** Initializes constants, directory paths, and sets up the Google Gemini API client and configuration for subsequent tasks (Markdown conversion and summarization). Includes helper functions for uploading files to Gemini and waiting for them to become active.
2.  **Stage 1: Partition the book by its chapters and digits:** Uses the `pymupdf` library to open the main PDF book and partition it into smaller PDF files. Each partition corresponds to a specific chapter or "digit" section of the book, based on predefined page ranges. This step is crucial for breaking down the large document into manageable chunks relevant to specific topics (digits).
3.  **Stage 2: Convert PDF partitions to Markdown files:** Iterates through the partitioned PDF files. For each PDF, it uploads the file to the Google Gemini API, prompts the model to convert the PDF content into Markdown format using a specialized configuration (`md_content_config`), and saves the resulting Markdown text to a new file. This converts the content into a more easily parsable text format.
4.  **Stage 3: Summarization:** Processes the generated Markdown files. It combines the "initial" section of a chapter with the specific "digit" section's content. This combined text is then sent to the Google Gemini API with a summarization configuration (`summ_content_config`) designed to create a Turkish bulleted list summary of the key concepts related to that digit. The resulting summary is saved to a new Markdown file.

The overall process aims to transform the structured information within the PDF book into summarized, text-based content, organized by the book's internal structure (chapters/digits), making it suitable for downstream applications like knowledge bases or further LLM processing.

In [None]:
import os
import time

from typing import TYPE_CHECKING

import pymupdf
from google import genai
from google.genai import types

if TYPE_CHECKING:
    from io import IOBase
    from os import PathLike
    from pathlib import Path

## Stage 0: Preparations

This initial stage involves setting up the necessary environment and resources for the data processing pipeline. It includes defining constants for directory paths and file names, establishing the connection to the Google Gemini API using an API key from environment variables, and implementing helper functions for managing file uploads to Gemini and monitoring their processing status. Additionally, specific configurations for the Gemini model are defined for the Markdown conversion and summarization tasks, including system instructions, temperature settings, response format, and safety settings.

In [None]:
# Constants
DIR_TR = "./tr"
PDF_TR = f"{DIR_TR}/PDFs"
MD_TR = f"{DIR_TR}/MDs"
SUMM_TR = f"{DIR_TR}/Summarizations"
BOOK = "./forbes_2002.pdf"

TIMEOUT_IN_SECONDS = 15

In [None]:
# Initialize Gemini API client using environment variable
client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])

In [None]:
def upload_to_gemini(file_path: "str | Path | PathLike[str] | IOBase", mime_type: str | None = None) -> types.File:
    """Upload a file to the Google Gemini API for processing.

    This function takes a file path or file-like object and uploads it to the Gemini API.
    The file can be used as input for various Gemini model tasks like content generation
    or analysis.

    Args:
        file_path: Path to the file or a file-like object to upload. Can be:
            - A string path
            - A pathlib.Path object
            - A PathLike object
            - An IOBase object (file handle)
        mime_type: Optional MIME type of the file. If None, Gemini will attempt to
            detect the type automatically.

    Returns:
        types.File: A Gemini File object representing the uploaded file.
            Contains metadata like uri, name, and display_name.

    Raises:
        UploadError: If the file upload fails
        ValueError: If the file path is invalid or file cannot be read
        TypeError: If the mime_type is invalid

    Example:
        >>> file = upload_to_gemini("document.pdf", mime_type="application/pdf")
        >>> print(f"Uploaded {file.display_name} as {file.uri}")

    See Also:
        - https://ai.google.dev/gemini-api/docs/prompting_with_media
    """
    file = client.files.upload(file=file_path, config=types.UploadFileConfig(mime_type=mime_type))

    print(f"Uploaded file '{file.display_name}' as: {file.uri}")

    return file

In [None]:
def wait_for_file_active(file: types.File) -> None:
    """Waits for a Gemini API file to become active through polling.

    Continuously monitors the processing state of a file uploaded to the Gemini API
    until it becomes active or fails. Uses a simple polling approach with fixed
    intervals defined by TIMEOUT_IN_SECONDS.

    Args:
        file: A Gemini API File object returned from the upload operation. Must contain
            a valid name attribute.

    Raises:
        AssertionError: If file.name or file.state is None
        Exception: If the file processing fails (final state is not "ACTIVE")
        TypeError: If file is not a valid Gemini File object
        ApiError: If there are issues communicating with the Gemini API

    Notes:
        - Prints status updates to console with dots indicating ongoing processing
        - Current implementation uses basic polling - consider more robust approaches
          for production use like exponential backoff or async polling
        - Success/failure is determined solely by the final state being "ACTIVE"

    Example:
        >>> file = client.files.upload(file_path="document.pdf")
        >>> wait_for_file_active(file)  # Blocks until file is ready
    """
    assert file.name is not None, "File name should not be None"

    print(f"Waiting for processing file {file.name}...")

    file = client.files.get(name=file.name)

    while file.state.name == "PROCESSING":  # type: ignore[reportOptionalMemberAccess]
        print(".", end="", flush=True)

        time.sleep(TIMEOUT_IN_SECONDS)

        file = client.files.get(name=file.name)  # type: ignore[reportArgumentType]

    assert file.state is not None, "File state should not be None"

    if file.state.name != "ACTIVE":
        raise Exception(f"File {file.name} failed to process")

    print(f"File {file.name} is ready!")
    print()

In [None]:
md_content_config = types.GenerateContentConfig(
    system_instruction="""You are a specialized Markdown converter designed to process and repair corrupted Turkish text files. Your task is to take potentially corrupted Turkish text, clean it, and convert it into well-formatted Markdown.  You should perform the following steps in order:

1.  **Text Extraction and Preservation:**  Extract the complete text from the input.  Crucially, preserve the original character order and any unusual characters *even if they appear to be errors*.  Do not attempt to "correct" anything at this stage.

2.  **De-hyphenation (Turkish-Specific):** Carefully de-hyphenate the raw text, paying close attention to Turkish hyphenation rules.  This involves:
    *   Identifying hyphens at the end of lines.
    *   Determining if the hyphen represents a true word break (requiring removal and joining of the word parts) or a hyphenated word (requiring the hyphen to be retained).  Use Turkish linguistic rules and context to make this determination. *Prioritize accurately joining words that were split across lines.*  Be conservative; if unsure, it's better to leave a hyphen than to incorrectly join unrelated words.

3.  **Corruption Repair (Turkish-Specific):** This is the most complex step and requires a deep understanding of Turkish orthography and common OCR/scanning errors. Address the following types of corruption:
    *   **Typos:** Correct common Turkish typographical errors, including incorrect characters, transpositions, and omissions. Use a Turkish spellchecker or language model (internally, if possible) to assist, but *prioritize corrections that are highly likely to be accurate*.  Avoid making speculative changes.
    *   **Spacing Errors:**
        *   **Missing Spaces:** Insert spaces between words where they are missing (e.g., "kelimelerarasındaboşluk").
        *   **Extra Spaces:** Remove extraneous spaces within words (e.g., "k e l i m e") or between characters.
        *   **Incorrect Spaces around Punctuation:** Ensure correct spacing around Turkish punctuation marks (periods, commas, question marks, etc.).
    *   **Paragraph Reconstruction:**  Identify and correct incorrect paragraph breaks caused by page endings or scanning artifacts.  Use contextual clues (sentence structure, topic shifts) to determine true paragraph boundaries.  Combine fragments of sentences that were split across lines.
    *   **Character Corruption:** Correct corrupted UTF-8 or other encoding issues. The goal is to correct the text to accurate Turkish spelling.

4.  **Markdown Conversion:** Convert the cleaned and corrected Turkish text to Markdown, adhering to the following rules:
    *   **Headings:**  Identify potential headings based on context and capitalization. Use appropriate Markdown heading levels (`#`, `##`, `###`, etc.).  Be conservative; if unsure, prefer a lower heading level or plain text.
    *   **Lists:**  Identify and format bulleted or numbered lists. Look for common list indicators (e.g., numbers, bullets, dashes).
    *   **Paragraphs:** Separate paragraphs with *two* newline characters (`\n\n`). This is crucial for proper Markdown rendering.
    *   **Other Elements:** If you confidently identify other Markdown elements (e.g., bold, italics, blockquotes), format them appropriately. However, *prioritize accuracy over completeness*.  It's better to have plain text than incorrect Markdown.
    * **Do not add any elements not supported in Markdown**

5. **Output**
    *   **Markdown Only:** Output *only* the resulting Markdown text. Do not include any explanations, comments, or additional information.
    * **No additional changes:** Do not provide additional output.""",
    temperature=0.1,
    response_mime_type="text/plain",
    safety_settings=[
        types.SafetySetting(
            category=types.HarmCategory.HARM_CATEGORY_CIVIC_INTEGRITY, threshold=types.HarmBlockThreshold.BLOCK_NONE
        ),
        types.SafetySetting(
            category=types.HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT, threshold=types.HarmBlockThreshold.BLOCK_NONE
        ),
        types.SafetySetting(
            category=types.HarmCategory.HARM_CATEGORY_HARASSMENT, threshold=types.HarmBlockThreshold.BLOCK_NONE
        ),
        types.SafetySetting(
            category=types.HarmCategory.HARM_CATEGORY_HATE_SPEECH, threshold=types.HarmBlockThreshold.BLOCK_NONE
        ),
        types.SafetySetting(
            category=types.HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT, threshold=types.HarmBlockThreshold.BLOCK_NONE
        ),
        types.SafetySetting(
            category=types.HarmCategory.HARM_CATEGORY_UNSPECIFIED, threshold=types.HarmBlockThreshold.BLOCK_NONE
        ),
    ],
)

In [None]:
summ_content_config = types.GenerateContentConfig(
    system_instruction="""You are a specialist in summarizing personality typing systems, particularly those similar to and including Douglas Forbes' "Human Design System" (often referred to as the "Human Pin Code," though this isn't the official name).  Your task is to create a comprehensive summary of the provided text (which will be pasted below this prompt).  The summary should be in the form of a bulleted list.

**Specific Instructions:**

1.  **Target Audience:** Assume the reader has *some* familiarity with the general concept of personality typing (e.g., Myers-Briggs, Enneagram) but may be new to Forbes' system or similar concepts.
2.  **Focus:** Identify and summarize the *key concepts, principles, and terminology* presented in the text.  Don't get bogged down in minor details; prioritize the core ideas.  If the text describes specific types, profiles, or categories, clearly outline their defining characteristics.
3.  **Language:**  The summary must be written in **Turkish**.
4.  **Format:** Use Markdown for the bulleted list. Each bullet point should be concise but informative.  Use nested bullet points (indentation) to show hierarchical relationships between concepts where appropriate.  For example, if a main concept has several sub-components, list the main concept as a top-level bullet and the sub-components as indented bullets beneath it.
5.  **Comprehensiveness:** While concise, the summary should be comprehensive enough that someone reading it would gain a solid understanding of the main ideas presented in the original text. Avoid overly simplistic or vague summaries.
6.  **Objectivity:** Maintain a neutral and objective tone.  Do not express personal opinions about the validity or usefulness of the system being described.  Present the information as it is presented in the text.
7.  **Terminology:** Pay close attention to any specialized terminology used in the text.  If the Turkish translation of a term isn't immediately obvious, provide the English term in parentheses after the Turkish term the *first* time it appears.  (e.g., "Enerji Tipi (Energy Type)")
8. **Contextualization (if applicable):** If the provided text refers to other personality systems or authors, briefly note these connections in the summary *if they are essential to understanding the main points*.

**Example Structure (Markdown - Turkish):**

```markdown
- **Ana Kavram 1:** Kısa açıklama.
    - Alt Kavram 1.1: Daha detaylı açıklama.
    - Alt Kavram 1.2: Daha detaylı açıklama.
- **Ana Kavram 2:** Kısa açıklama (İngilizce Terim).
    - Alt Kavram 2.1: Daha detaylı açıklama.
- **Ana Kavram 3:** ...
```

Do not provide additional output.""",
    temperature=0.5,
    response_mime_type="text/plain",
    safety_settings=[
        types.SafetySetting(
            category=types.HarmCategory.HARM_CATEGORY_CIVIC_INTEGRITY, threshold=types.HarmBlockThreshold.BLOCK_NONE
        ),
        types.SafetySetting(
            category=types.HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT, threshold=types.HarmBlockThreshold.BLOCK_NONE
        ),
        types.SafetySetting(
            category=types.HarmCategory.HARM_CATEGORY_HARASSMENT, threshold=types.HarmBlockThreshold.BLOCK_NONE
        ),
        types.SafetySetting(
            category=types.HarmCategory.HARM_CATEGORY_HATE_SPEECH, threshold=types.HarmBlockThreshold.BLOCK_NONE
        ),
        types.SafetySetting(
            category=types.HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT, threshold=types.HarmBlockThreshold.BLOCK_NONE
        ),
        types.SafetySetting(
            category=types.HarmCategory.HARM_CATEGORY_UNSPECIFIED, threshold=types.HarmBlockThreshold.BLOCK_NONE
        ),
    ],
)

## Stage 1: Partition the book by its chapters and digits

This stage partitions the book "Human Pin Code" into smaller PDF files based on predefined page ranges for each chapter and its associated "pin code" digits. This process is essential for breaking down the large document into manageable, topic-specific chunks, which helps in providing large language models with focused content, thereby reducing the likelihood of generating inaccurate or irrelevant information.

In [None]:
DIGITS_TO_PAGE_RANGE = {
    1: {
        "initial": (59, 61),
        1: (61, 65),
        2: (65, 67),
        3: (67, 69),
        4: (69, 71),
        5: (71, 73),
        6: (73, 76),
        7: (76, 78),
        8: (78, 80),
        9: (80, 83),
    },
    2: {
        "initial": (85, 87),
        1: (87, 89),
        2: (89, 91),
        3: (91, 93),
        4: (93, 96),
        5: (98, 101),
        6: (101, 103),
        7: (103, 105),
        8: (105, 107),
        9: (107, 109),
    },
    3: {
        "initial": (111, 112),
        1: (112, 115),
        2: (115, 117),
        3: (117, 119),
        4: (119, 122),
        5: (122, 124),
        6: (124, 126),
        7: (126, 129),
        8: (129, 131),
        9: (131, 133),
    },
    4: {
        "initial": (135, 136),
        1: (136, 138),
        2: (138, 140),
        3: (140, 142),
        4: (142, 144),
        5: (144, 146),
        6: (146, 148),
        7: (148, 150),
        8: (150, 152),
        9: (152, 155),
    },
    5: {
        "initial": (157, 159),
        1: (159, 161),
        2: (161, 163),
        3: (163, 165),
        4: (165, 167),
        5: (167, 169),
        6: (169, 171),
        7: (171, 173),
        8: (173, 175),
        9: (175, 177),
    },
    6: {
        "initial": (180, 182),
        1: (182, 184),
        2: (184, 186),
        3: (186, 188),
        4: (188, 191),
        5: (191, 193),
        6: (193, 195),
        7: (195, 198),
        8: (198, 201),
        9: (201, 203),
    },
    7: {
        "initial": (204, 205),
        1: (205, 207),
        2: (207, 209),
        3: (209, 211),
        4: (211, 213),
        5: (213, 214),
        6: (214, 216),
        7: (216, 218),
        8: (218, 220),
        9: (220, 222),
    },
    8: {
        "initial": (224, 226),
        1: (226, 228),
        2: (228, 230),
        3: (230, 232),
        4: (232, 234),
        5: (234, 236),
        6: (236, 238),
        7: (238, 240),
        8: (240, 242),
        9: (242, 244),
    },
    9: {
        "initial": (246, 247),
        1: (247, 248),
        2: (248, 249),
        3: (249, 250),
        4: (250, 251),
        5: (251, 252),
        6: (252, 253),
        7: (253, 254),
        8: (254, 255),
        9: (255, 256),
    },
}

In [None]:
with pymupdf.open(BOOK) as forbes_book:
    for place, digits in DIGITS_TO_PAGE_RANGE.items():
        for digit, (start, end) in digits.items():
            with pymupdf.open() as part_pdf:
                part_pdf.insert_pdf(forbes_book, from_page=start, to_page=end - 1)
                part_pdf.save(f"{PDF_TR}/{place}_{digit}.pdf")

## Stage 2: Convert PDF partitions to Markdown files

Converting the partitioned PDF files into Markdown format facilitates easier text extraction and processing compared to binary PDF data. This step prepares the content for subsequent analysis or summarization tasks.

In [None]:
def pdf_to_md(pdf_path: str, md_path: str) -> None:
    """Converts a PDF file to Markdown format using the Google Gemini API.

    This function takes a PDF file, uploads it to the Gemini API for processing,
    and converts its content to Markdown format using a specialized configuration.
    The resulting Markdown content is then saved to a new file.

    Args:
        pdf_path: String path to the source PDF file to be converted.
        md_path: String path where the resulting Markdown file should be saved.

    Raises:
        AssertionError: If the uploaded file name or response text is None.
        FileNotFoundError: If the source PDF file doesn't exist or output path is invalid.
        IOError: If there are issues reading the PDF or writing the Markdown file.
        Exception: If the Gemini API request fails or returns an error.

    Note:
        - The function uses the global Gemini client and md_content_config for conversion
        - The uploaded PDF file is automatically deleted from Gemini after conversion
        - The function uses UTF-8 encoding when writing the output file
        - Conversion quality depends on the PDF's text extraction quality

    Example:
        >>> pdf_to_md("chapter1.pdf", "chapter1.md")
        # Converts chapter1.pdf to Markdown and saves as chapter1.md
    """
    file = upload_to_gemini(pdf_path, mime_type="application/pdf")
    wait_for_file_active(file)

    assert file.name is not None, "File name should not be None"

    response = client.models.generate_content(
        model="gemini-2.5-flash-preview-05-20",
        contents=[file, "Convert this PDF file to Markdown."],
        config=md_content_config,
    )

    client.files.delete(name=file.name)

    assert response.text is not None, "Response text should not be None"

    with open(md_path, "w", encoding="UTF-8") as md_file:
        md_file.write(response.text)

In [None]:
for pdf_path, md_path in [
    (f"{PDF_TR}/{pdf}", f"{MD_TR}/{pdf.split('.')[0]}.md")
    for pdf in sorted(os.listdir(PDF_TR))
    if ".keepdir" not in pdf
]:
    print(f"[PROCESSING] {pdf_path}...")

    try:
        pdf_to_md(pdf_path, md_path)
    except Exception as err:
        print(f"[FAILED] {pdf_path}! Reason: {err}")
    else:
        print(f"[PROCESSED] {pdf_path}!")

    time.sleep(TIMEOUT_IN_SECONDS)

## Stage 3: Summarization

This stage focuses on generating concise summaries of the processed content. Due to the potential length of the combined initial chapter and digit-specific text, summarization is necessary to extract key concepts efficiently. The process involves combining the Markdown content from the chapter's introductory section and the specific digit's section. This combined text is then sent to the Google Gemini API, configured to produce a Turkish bulleted list summary highlighting the main ideas related to that digit. The resulting summary is saved to a new Markdown file, providing a condensed, easily digestible overview of the digit's information.

In [None]:
def summarize(initial_path: str, digit_path: str, summarization_path: str) -> None:
    """Generates a summary of Human Pin Code content by combining and processing two related text files.

    Takes the initial chapter content and a specific digit's content from Markdown files,
    combines them, and uses the Google Gemini API to generate a Turkish language summary
    in bullet-point format.

    Args:
        initial_path: Path to the Markdown file containing the chapter's introductory content.
        digit_path: Path to the Markdown file containing the specific digit's content.
        summarization_path: Path where the generated summary will be saved.

    Raises:
        FileNotFoundError: If either initial_path or digit_path doesn't exist.
        IOError: If there are issues reading input files or writing the output file.
        AssertionError: If the Gemini API response text is None.
        Exception: If the Gemini API request fails or returns an error.

    Note:
        - Uses UTF-8 encoding for all file operations
        - Depends on the global client and summ_content_config for API interaction
        - The summary is formatted according to the configuration in summ_content_config
        - Output is saved in Markdown format with Turkish language content

    Example:
        >>> summarize("./tr/MDs/1_initial.md", "./tr/MDs/1_1.md", "./tr/Summarizations/1_1.md")
        # Summarizes initial and digit content and saves as "./tr/Summarizations/1_1.md"
    """
    with (
        open(initial_path, "r", encoding="UTF-8") as initial_f,
        open(digit_path, "r", encoding="UTF-8") as digit_f,
    ):
        content = f"{initial_f.read()}\n\n{digit_f.read()}"

    response = client.models.generate_content(
        model="gemini-2.5-flash-preview-05-20",
        contents=content,
        config=summ_content_config,
    )

    assert response.text is not None, "Response text should not be None"

    with open(summarization_path, "w", encoding="UTF-8") as summ_file:
        summ_file.write(response.text)

In [None]:
for md_path, initial_path, summ_path in [
    (
        f"{MD_TR}/{md}",
        f"{MD_TR}/{md.split('.')[0].split('_')[0]}_initial.md",
        f"{SUMM_TR}/{md.split('.')[-2]}.md",
    )
    for md in sorted(os.listdir(MD_TR))
    if "initial" not in md
]:
    print(f"[PROCESSING] {md_path}...")

    try:
        summarize(initial_path, md_path, summ_path)
    except Exception as err:
        print(f"[FAILED] {md_path}! Reason: {err}.")
    else:
        print(f"[PROCESSED] {md_path}!")

    time.sleep(TIMEOUT_IN_SECONDS)