# Data Processing Pipeline for [The Life You Were Born to Live](https://www.peacefulwarrior.com/the-life-you-were-born-to-live/) by *Dan Millman*

This notebook outlines and executes a data processing pipeline for Dan Millman's book, "The Life You Were Born to Live". The primary goal is to extract structured information about the book's "life paths" and their associated traits, challenges, opportunities, etc., and prepare this data for later use.

The pipeline consists of several stages:

Partitioning the book's PDF into smaller files based on predefined page ranges. Converting each partition into Markdown format using a large language model (LLM). Extracting key structured data points from the Markdown files and formatting them into JSON objects. Generating concise summaries in English using the extracted data. Translating the full Markdown text, summarized Markdown, and JSON data into Turkish. Extending the translated Turkish JSON data to enrich the dataset with more detailed descriptions.

The notebook utilizes the Google Gemini API for the LLM-based conversion, extraction, summarization, translation, and extension tasks, managing API calls and file handling throughout the process.

In [None]:
import os
import time

from typing import TYPE_CHECKING

import pymupdf
from google import genai
from google.genai import types

if TYPE_CHECKING:
    from io import IOBase
    from os import PathLike
    from pathlib import Path

## Stage 0: Preparations

Before starting the data processing pipeline, we need to initialize necessary constants and set up the Google Gemini API client and configurations.

In [None]:
# Constants
DIR_TR = "./tr"
DIR_EN = "./en"
PDF_TR = f"{DIR_TR}/PDFs"
PDF_EN = f"{DIR_EN}/PDFs"
MD_TR = f"{DIR_TR}/MDs"
MD_EN = f"{DIR_EN}/MDs"
SUMM_TR = f"{DIR_TR}/Summarizations"
SUMM_EN = f"{DIR_EN}/Summarizations"
JSON_TR = f"{DIR_TR}/JSONs"
JSON_EN = f"{DIR_EN}/JSONs"
JSON_TR_EXT = f"{DIR_TR}/JSONs_Extended"
BOOK = "./millman_1995.pdf"

TIMEOUT_IN_SECONDS = 15

In [None]:
# Initialize Gemini API client using environment variable
client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])

In [None]:
def upload_to_gemini(file_path: "str | Path | PathLike[str] | IOBase", mime_type: str | None = None) -> types.File:
    """Upload a file to the Google Gemini API for processing.

    This function takes a file path or file-like object and uploads it to the Gemini API.
    The file can be used as input for various Gemini model tasks like content generation
    or analysis.

    Args:
        file_path: Path to the file or a file-like object to upload. Can be:
            - A string path
            - A pathlib.Path object
            - A PathLike object
            - An IOBase object (file handle)
        mime_type: Optional MIME type of the file. If None, Gemini will attempt to
            detect the type automatically.

    Returns:
        types.File: A Gemini File object representing the uploaded file.
            Contains metadata like uri, name, and display_name.

    Raises:
        UploadError: If the file upload fails
        ValueError: If the file path is invalid or file cannot be read
        TypeError: If the mime_type is invalid

    Example:
        >>> file = upload_to_gemini("document.pdf", mime_type="application/pdf")
        >>> print(f"Uploaded {file.display_name} as {file.uri}")

    See Also:
        - https://ai.google.dev/gemini-api/docs/prompting_with_media
    """
    file = client.files.upload(file=file=file_path, config=types.UploadFileConfig(mime_type=mime_type))

    print(f"Uploaded file '{file.display_name}' as: {file.uri}")

    return file

In [None]:
def wait_for_file_active(file: types.File) -> None:
    """Waits for a Gemini API file to become active through polling.

    Continuously monitors the processing state of a file uploaded to the Gemini API
    until it becomes active or fails. Uses a simple polling approach with fixed
    intervals defined by TIMEOUT_IN_SECONDS.

    Args:
        file: A Gemini API File object returned from the upload operation. Must contain
            a valid name attribute.

    Raises:
        AssertionError: If file.name or file.state is None
        Exception: If the file processing fails (final state is not "ACTIVE")
        TypeError: If file is not a valid Gemini File object
        ApiError: If there are issues communicating with the Gemini API

    Notes:
        - Prints status updates to console with dots indicating ongoing processing
        - Current implementation uses basic polling - consider more robust approaches
          for production use like exponential backoff or async polling
        - Success/failure is determined solely by the final state being "ACTIVE"

    Example:
        >>> file = client.files.upload(file_path="document.pdf")
        >>> wait_for_file_active(file)  # Blocks until file is ready
    """
    assert file.name is not None, "File name should not be None"

    print(f"Waiting for processing file {file.name}...")

    file = client.files.get(name=file.name)

    while file.state.name == "PROCESSING":  # type: ignore[reportOptionalMemberAccess]
        print(".", end="", flush=True)

        time.sleep(TIMEOUT_IN_SECONDS)

        file = client.files.get(name=file.name)  # type: ignore[reportArgumentType]

    assert file.state is not None, "File state should not be None"

    if file.state.name != "ACTIVE":
        raise Exception(f"File {file.name} failed to process")

    print(f"File {file.name} is ready!")
    print()

In [None]:
safety_settings = [
    types.SafetySetting(
        category=types.HarmCategory.HARM_CATEGORY_CIVIC_INTEGRITY, threshold=types.HarmBlockThreshold.BLOCK_NONE
    ),
    types.SafetySetting(
        category=types.HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT, threshold=types.HarmBlockThreshold.BLOCK_NONE
    ),
    types.SafetySetting(
        category=types.HarmCategory.HARM_CATEGORY_HARASSMENT, threshold=types.HarmBlockThreshold.BLOCK_NONE
    ),
    types.SafetySetting(
        category=types.HarmCategory.HARM_CATEGORY_HATE_SPEECH, threshold=types.HarmBlockThreshold.BLOCK_NONE
    ),
    types.SafetySetting(
        category=types.HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT, threshold=types.HarmBlockThreshold.BLOCK_NONE
    ),
    types.SafetySetting(
        category=types.HarmCategory.HARM_CATEGORY_UNSPECIFIED, threshold=types.HarmBlockThreshold.BLOCK_NONE
    ),
]

In [None]:
md_content_config = types.GenerateContentConfig(
    system_instruction="""You are a specialized Markdown converter designed to process and repair corrupted Turkish text files. Your task is to take potentially corrupted Turkish text, clean it, and convert it into well-formatted Markdown.  You should perform the following steps in order:

1.  **Text Extraction and Preservation:**  Extract the complete text from the input.  Crucially, preserve the original character order and any unusual characters *even if they appear to be errors*.  Do not attempt to "correct" anything at this stage.

2.  **De-hyphenation (Turkish-Specific):** Carefully de-hyphenate the raw text, paying close attention to Turkish hyphenation rules.  This involves:
    *   Identifying hyphens at the end of lines.
    *   Determining if the hyphen represents a true word break (requiring removal and joining of the word parts) or a hyphenated word (requiring the hyphen to be retained).  Use Turkish linguistic rules and context to make this determination. *Prioritize accurately joining words that were split across lines.*  Be conservative; if unsure, it's better to leave a hyphen than to incorrectly join unrelated words.

3.  **Corruption Repair (Turkish-Specific):** This is the most complex step and requires a deep understanding of Turkish orthography and common OCR/scanning errors. Address the following types of corruption:
    *   **Typos:** Correct common Turkish typographical errors, including incorrect characters, transpositions, and omissions. Use a Turkish spellchecker or language model (internally, if possible) to assist, but *prioritize corrections that are highly likely to be accurate*.  Avoid making speculative changes.
    *   **Spacing Errors:**
        *   **Missing Spaces:** Insert spaces between words where they are missing (e.g., "kelimelerarasındaboşluk").
        *   **Extra Spaces:** Remove extraneous spaces within words (e.g., "k e l i m e") or between characters.
        *   **Incorrect Spaces around Punctuation:** Ensure correct spacing around Turkish punctuation marks (periods, commas, question marks, etc.).
    *   **Paragraph Reconstruction:**  Identify and correct incorrect paragraph breaks caused by page endings or scanning artifacts.  Use contextual clues (sentence structure, topic shifts) to determine true paragraph boundaries.  Combine fragments of sentences that were split across lines.
    *   **Character Corruption:** Correct corrupted UTF-8 or other encoding issues. The goal is to correct the text to accurate Turkish spelling.

4.  **Markdown Conversion:** Convert the cleaned and corrected Turkish text to Markdown, adhering to the following rules:
    *   **Headings:**  Identify potential headings based on context and capitalization. Use appropriate Markdown heading levels (`#`, `##`, `###`, etc.).  Be conservative; if unsure, prefer a lower heading level or plain text.
    *   **Lists:**  Identify and format bulleted or numbered lists. Look for common list indicators (e.g., numbers, bullets, dashes).
    *   **Paragraphs:** Separate paragraphs with *two* newline characters (`\n\n`). This is crucial for proper Markdown rendering.
    *   **Other Elements:** If you confidently identify other Markdown elements (e.g., bold, italics, blockquotes), format them appropriately. However, *prioritize accuracy over completeness*.  It's better to have plain text than incorrect Markdown.
    * **Do not add any elements not supported in Markdown**

5. **Output**
    *   **Markdown Only:** Output *only* the resulting Markdown text. Do not include any explanations, comments, or additional information.
    * **No additional changes:** Do not provide additional output.""",
    temperature=0.1,
    response_mime_type="text/plain",
    safety_settings=safety_settings,
)

In [None]:
json_content_config = types.GenerateContentConfig(
    system_instruction="You are a Markdown to JSON converter. Your task is to parse the provided Markdown input and generate a JSON object representing the key information and structure. Focus on extracting important entities, relationships, and overall document structure. Prioritize accuracy and a logical JSON schema.",
    temperature=0.5,
    response_mime_type="application/json",
    response_schema=types.Schema(
        type=types.Type.OBJECT,
        required=[
            "challenges",
            "famous_people",
            "fulfilling_destiny",
            "health",
            "key_traits",
            "opportunities",
            "relationships",
            "talents_work_finances",
        ],
        properties={
            "challenges": types.Schema(
                type=types.Type.ARRAY,
                items=types.Schema(
                    type=types.Type.STRING,
                ),
            ),
            "famous_people": types.Schema(
                type=types.Type.ARRAY,
                items=types.Schema(
                    type=types.Type.STRING,
                ),
            ),
            "fulfilling_destiny": types.Schema(
                type=types.Type.OBJECT,
                required=["guidelines", "questions"],
                properties={
                    "guidelines": types.Schema(
                        type=types.Type.ARRAY,
                        items=types.Schema(
                            type=types.Type.STRING,
                        ),
                    ),
                    "questions": types.Schema(
                        type=types.Type.ARRAY,
                        items=types.Schema(
                            type=types.Type.STRING,
                        ),
                    ),
                },
            ),
            "health": types.Schema(
                type=types.Type.OBJECT,
                required=["advice", "negative", "positive"],
                properties={
                    "advice": types.Schema(
                        type=types.Type.ARRAY,
                        items=types.Schema(
                            type=types.Type.STRING,
                        ),
                    ),
                    "negative": types.Schema(
                        type=types.Type.ARRAY,
                        items=types.Schema(
                            type=types.Type.STRING,
                        ),
                    ),
                    "positive": types.Schema(
                        type=types.Type.ARRAY,
                        items=types.Schema(
                            type=types.Type.STRING,
                        ),
                    ),
                },
            ),
            "key_traits": types.Schema(
                type=types.Type.ARRAY,
                items=types.Schema(
                    type=types.Type.STRING,
                ),
            ),
            "opportunities": types.Schema(
                type=types.Type.ARRAY,
                items=types.Schema(
                    type=types.Type.STRING,
                ),
            ),
            "relationships": types.Schema(
                type=types.Type.OBJECT,
                required=["advice", "negative", "positive"],
                properties={
                    "advice": types.Schema(
                        type=types.Type.ARRAY,
                        items=types.Schema(
                            type=types.Type.STRING,
                        ),
                    ),
                    "negative": types.Schema(
                        type=types.Type.ARRAY,
                        items=types.Schema(
                            type=types.Type.STRING,
                        ),
                    ),
                    "positive": types.Schema(
                        type=types.Type.ARRAY,
                        items=types.Schema(
                            type=types.Type.STRING,
                        ),
                    ),
                },
            ),
            "talents_work_finances": types.Schema(
                type=types.Type.OBJECT,
                required=["advice", "negative", "positive"],
                properties={
                    "advice": types.Schema(
                        type=types.Type.ARRAY,
                        items=types.Schema(
                            type=types.Type.STRING,
                        ),
                    ),
                    "negative": types.Schema(
                        type=types.Type.ARRAY,
                        items=types.Schema(
                            type=types.Type.STRING,
                        ),
                    ),
                    "positive": types.Schema(
                        type=types.Type.ARRAY,
                        items=types.Schema(
                            type=types.Type.STRING,
                        ),
                    ),
                },
            ),
        },
    ),
    safety_settings=safety_settings,
)

In [None]:
summ_content_config = types.GenerateContentConfig(
    system_instruction="""You are a specialist in summarizing personality typing systems, particularly those similar to and including Dan Millman's "The Life You Were Born to Live".  Your task is to create a comprehensive summary of the provided text (which will be pasted below this prompt).  The summary should be in the form of a bulleted list.

**Specific Instructions:**

1.  **Target Audience:** Assume the reader has *some* familiarity with the general concept of personality typing (e.g., Myers-Briggs, Enneagram) but may be new to Forbes' system or similar concepts.
2.  **Focus:** Identify and summarize the *key concepts, principles, and terminology* presented in the text.  Don't get bogged down in minor details; prioritize the core ideas.  If the text describes specific types, profiles, or categories, clearly outline their defining characteristics.
3.  **Language:**  The summary must be written in **English**.
4.  **Format:** Use Markdown for the bulleted list. Each bullet point should be concise but informative.  Use nested bullet points (indentation) to show hierarchical relationships between concepts where appropriate.  For example, if a main concept has several sub-components, list the main concept as a top-level bullet and the sub-components as indented bullets beneath it.
5.  **Comprehensiveness:** While concise, the summary should be comprehensive enough that someone reading it would gain a solid understanding of the main ideas presented in the original text. Avoid overly simplistic or vague summaries.
6.  **Objectivity:** Maintain a neutral and objective tone.  Do not express personal opinions about the validity or usefulness of the system being described.  Present the information as it is presented in the text.
7.  **Terminology:** Pay close attention to any specialized terminology used in the text.
8. **Contextualization (if applicable):** If the provided text refers to other personality systems or authors, briefly note these connections in the summary *if they are essential to understanding the main points*.

**Example Structure (Markdown):**

```markdown
- **Main Component 1:** Short description.
    - Sub-component 1.1: More detailed description.
    - Sub-component 1.2: More detailed description.
- **Main Component 2:** Short description.
    - Sub-component 2.1: More detailed description.
- **Main Component 3:** ...
```

Do not provide additional output.""",
    temperature=0.5,
    response_mime_type="text/plain",
    safety_settings=safety_settings,
)

In [None]:
trans_md_content_config = types.GenerateContentConfig(
    system_instruction="You are a highly skilled English-to-Turkish translator with expertise in Markdown. Translate the following English text into idiomatic and accurate Turkish, preserving the original Markdown formatting. Pay close attention to maintaining the tone and style of the original text.",
    response_mime_type="text/plain",
    safety_settings=safety_settings,
)

In [None]:
trans_json_content_config = types.GenerateContentConfig(
    system_instruction="You are a highly skilled English-to-Turkish translator with expertise in JSON. Translate the following English text into idiomatic and accurate Turkish, preserving the original JSON formatting. Pay close attention to maintaining the tone and style of the original text.",
    response_mime_type="application/json",
    response_schema=types.Schema(
        type=types.Type.OBJECT,
        required=[
            "challenges",
            "famous_people",
            "fulfilling_destiny",
            "health",
            "key_traits",
            "opportunities",
            "relationships",
            "talents_work_finances",
        ],
        properties={
            "challenges": types.Schema(
                type=types.Type.ARRAY,
                items=types.Schema(
                    type=types.Type.STRING,
                ),
            ),
            "famous_people": types.Schema(
                type=types.Type.ARRAY,
                items=types.Schema(
                    type=types.Type.STRING,
                ),
            ),
            "fulfilling_destiny": types.Schema(
                type=types.Type.OBJECT,
                required=["guidelines", "questions"],
                properties={
                    "guidelines": types.Schema(
                        type=types.Type.ARRAY,
                        items=types.Schema(
                            type=types.Type.STRING,
                        ),
                    ),
                    "questions": types.Schema(
                        type=types.Type.ARRAY,
                        items=types.Schema(
                            type=types.Type.STRING,
                        ),
                    ),
                },
            ),
            "health": types.Schema(
                type=types.Type.OBJECT,
                required=["advice", "negative", "positive"],
                properties={
                    "advice": types.Schema(
                        type=types.Type.ARRAY,
                        items=types.Schema(
                            type=types.Type.STRING,
                        ),
                    ),
                    "negative": types.Schema(
                        type=types.Type.ARRAY,
                        items=types.Schema(
                            type=types.Type.STRING,
                        ),
                    ),
                    "positive": types.Schema(
                        type=types.Type.ARRAY,
                        items=types.Schema(
                            type=types.Type.STRING,
                        ),
                    ),
                },
            ),
            "key_traits": types.Schema(
                type=types.Type.ARRAY,
                items=types.Schema(
                    type=types.Type.STRING,
                ),
            ),
            "opportunities": types.Schema(
                type=types.Type.ARRAY,
                items=types.Schema(
                    type=types.Type.STRING,
                ),
            ),
            "relationships": types.Schema(
                type=types.Type.OBJECT,
                required=["advice", "negative", "positive"],
                properties={
                    "advice": types.Schema(
                        type=types.Type.ARRAY,
                        items=types.Schema(
                            type=types.Type.STRING,
                        ),
                    ),
                    "negative": types.Schema(
                        type=types.Type.ARRAY,
                        items=types.Schema(
                            type=types.Type.STRING,
                        ),
                    ),
                    "positive": types.Schema(
                        type=types.Type.ARRAY,
                        items=types.Schema(
                            type=types.Type.STRING,
                        ),
                    ),
                },
            ),
            "talents_work_finances": types.Schema(
                type=types.Type.OBJECT,
                required=["advice", "negative", "positive"],
                properties={
                    "advice": types.Schema(
                        type=types.Type.ARRAY,
                        items=types.Schema(
                            type=types.Type.STRING,
                        ),
                    ),
                    "negative": types.Schema(
                        type=types.Type.ARRAY,
                        items=types.Schema(
                            type=types.Type.STRING,
                        ),
                    ),
                    "positive": types.Schema(
                        type=types.Type.ARRAY,
                        items=types.Schema(
                            type=types.Type.STRING,
                        ),
                    ),
                },
            ),
        },
    ),
    safety_settings=safety_settings,
)

In [None]:
extend_json_content_config = types.GenerateContentConfig(
    system_instruction="""You are a professional summarization and text expansion specialist. Your primary task is to take information provided in a JSON format, and elaborate upon it, creating more detailed and nuanced text in Turkish. You should focus on natural, flowing language, as if a native Turkish speaker were explaining the concepts to a friend or colleague. Avoid overly formal or technical language unless the context specifically requires it (which will be indicated in the JSON if necessary).

**Specific Instructions:**

1.  **Input:** You will receive a JSON object.  The specific structure may vary, but it will always contain keys whose values are *strings* representing sentences or short phrases that need to be expanded.  These are the core ideas you will work with.  There may be additional contextual information within the JSON, also.
2.  **Expansion & Elaboration:** For each of these core strings (sentences/phrases):
    *   **Extend:**  Expand the sentence or phrase into one or more *paragraphs*. The goal is not just to make the text longer, but to add relevant details, examples, implications, or related information.  Think about answering the "who, what, where, when, why, and how" related to the original idea.
    *   **Rewrite for Detail:**  Rephrase the original idea with more descriptive language.  Don't just repeat the same concept; provide greater clarity and depth.  Imagine you're explaining the concept to someone who has very little background knowledge.
3.  **Turkish Language:**  All output *must* be in grammatically correct and natural-sounding Turkish. Use appropriate idioms and expressions where they fit naturally.
4.  **Natural Tone (Default):**  Use a natural, conversational tone. Imagine you are explaining the concepts to a friend or colleague in a relaxed setting.""",
    response_mime_type="application/json",
    response_schema=types.Schema(
        type=types.Type.OBJECT,
        required=[
            "challenges",
            "famous_people",
            "fulfilling_destiny",
            "health",
            "key_traits",
            "opportunities",
            "relationships",
            "talents_work_finances",
        ],
        properties={
            "challenges": types.Schema(
                type=types.Type.ARRAY,
                items=types.Schema(
                    type=types.Type.STRING,
                ),
            ),
            "famous_people": types.Schema(
                type=types.Type.ARRAY,
                items=types.Schema(
                    type=types.Type.STRING,
                ),
            ),
            "fulfilling_destiny": types.Schema(
                type=types.Type.OBJECT,
                required=["guidelines", "questions"],
                properties={
                    "guidelines": types.Schema(
                        type=types.Type.ARRAY,
                        items=types.Schema(
                            type=types.Type.STRING,
                        ),
                    ),
                    "questions": types.Schema(
                        type=types.Type.ARRAY,
                        items=types.Schema(
                            type=types.Type.STRING,
                        ),
                    ),
                },
            ),
            "health": types.Schema(
                type=types.Type.OBJECT,
                required=["advice", "negative", "positive"],
                properties={
                    "advice": types.Schema(
                        type=types.Type.ARRAY,
                        items=types.Schema(
                            type=types.Type.STRING,
                        ),
                    ),
                    "negative": types.Schema(
                        type=types.Type.ARRAY,
                        items=types.Schema(
                            type=types.Type.STRING,
                        ),
                    ),
                    "positive": types.Schema(
                        type=types.Type.ARRAY,
                        items=types.Schema(
                            type=types.Type.STRING,
                        ),
                    ),
                },
            ),
            "key_traits": types.Schema(
                type=types.Type.ARRAY,
                items=types.Schema(
                    type=types.Type.STRING,
                ),
            ),
            "opportunities": types.Schema(
                type=types.Type.ARRAY,
                items=types.Schema(
                    type=types.Type.STRING,
                ),
            ),
            "relationships": types.Schema(
                type=types.Type.OBJECT,
                required=["advice", "negative", "positive"],
                properties={
                    "advice": types.Schema(
                        type=types.Type.ARRAY,
                        items=types.Schema(
                            type=types.Type.STRING,
                        ),
                    ),
                    "negative": types.Schema(
                        type=types.Type.ARRAY,
                        items=types.Schema(
                            type=types.Type.STRING,
                        ),
                    ),
                    "positive": types.Schema(
                        type=types.Type.ARRAY,
                        items=types.Schema(
                            type=types.Type.STRING,
                        ),
                    ),
                },
            ),
            "talents_work_finances": types.Schema(
                type=types.Type.OBJECT,
                required=["advice", "negative", "positive"],
                properties={
                    "advice": types.Schema(
                        type=types.Type.ARRAY,
                        items=types.Schema(
                            type=types.Type.STRING,
                        ),
                    ),
                    "negative": types.Schema(
                        type=types.Type.ARRAY,
                        items=types.Schema(
                            type=types.Type.STRING,
                        ),
                    ),
                    "positive": types.Schema(
                        type=types.Type.ARRAY,
                        items=types.Schema(
                            type=types.Type.STRING,
                        ),
                    ),
                },
            ),
        },
    ),
    safety_settings=safety_settings,
)

## Stage 1: Partitioning the book by its chapters and life paths

"The Life You Were Born to Live" organizes life paths within chapters, each linked to specific page ranges. We partition the book based on these ranges to isolate content for each life path. This targeted approach provides LLMs with relevant information, significantly reducing the likelihood of generating inaccurate or irrelevant content.

In [None]:
PAGE_RANGE_TO_LIFE_PATHS = {
    (130, 136): [(19, 10)],
    (136, 142): [(28, 10)],
    (142, 148): [(37, 10)],
    (148, 153): [(46, 10)],
    (154, 161): [(29, 11)],
    (161, 167): [(38, 11)],
    (167, 173): [(47, 11)],
    (174, 180): [(20, 2)],
    (181, 188): [(39, 12)],
    (188, 194): [(48, 12)],
    (195, 202): [(30, 3)],
    (202, 207): [(21, 3), (12, 3)],
    (209, 215): [(40, 4)],
    (215, 221): [(22, 4)],
    (221, 228): [(31, 4), (13, 4)],
    (229, 236): [(32, 5), (23, 5)],
    (236, 242): [(41, 5), (14, 5)],
    (243, 249): [(15, 6)],
    (249, 256): [(24, 6), (42, 6)],
    (256, 263): [(33, 6)],
    (264, 269): [(16, 7)],
    (269, 276): [(25, 7)],
    (276, 282): [(34, 7), (43, 7)],
    (283, 289): [(17, 8)],
    (289, 296): [(26, 8)],
    (296, 303): [(35, 8)],
    (303, 309): [(44, 8)],
    (310, 316): [(18, 9)],
    (316, 323): [(27, 9)],
    (323, 331): [(36, 9)],
    (331, 338): [(45, 9)],
}

In [None]:
with pymupdf.open(BOOK) as millman_book:
    for (start, end), life_paths in PAGE_RANGE_TO_LIFE_PATHS.items():
        for life_path in life_paths:
            with pymupdf.open() as part_pdf:
                part_pdf.insert_pdf(millman_book, from_page=start, to_page=end)
                part_pdf.save(f"{PDF_EN}/{life_path[0]}_{life_path[1]}.pdf")

## Stage 2: Convert PDF partitions to Markdown files

Converting the partitioned PDF files into Markdown format facilitates easier text extraction and processing compared to binary PDF data. This step prepares the content for subsequent analysis or summarization tasks.

In [None]:
def pdf_to_md(pdf_path: str, md_path: str) -> None:
    """Converts a PDF file to Markdown format using the Google Gemini API.

    This function takes a PDF file, uploads it to the Gemini API for processing,
    and converts its content to Markdown format using a specialized configuration.
    The resulting Markdown content is then saved to a new file.

    Args:
        pdf_path: String path to the source PDF file to be converted.
        md_path: String path where the resulting Markdown file should be saved.

    Raises:
        AssertionError: If the uploaded file name or response text is None.
        FileNotFoundError: If the source PDF file doesn't exist or output path is invalid.
        IOError: If there are issues reading the PDF or writing the Markdown file.
        Exception: If the Gemini API request fails or returns an error.

    Note:
        - The function uses the global Gemini client and md_content_config for conversion
        - The uploaded PDF file is automatically deleted from Gemini after conversion
        - The function uses UTF-8 encoding when writing the output file
        - Conversion quality depends on the PDF's text extraction quality

    Example:
        >>> pdf_to_md("chapter1.pdf", "chapter1.md")
        # Converts chapter1.pdf to Markdown and saves as chapter1.md
    """
    file = upload_to_gemini(pdf_path, mime_type="application/pdf")
    wait_for_file_active(file)

    assert file.name is not None, "File name should not be None"

    response = client.models.generate_content(
        model="gemini-2.5-flash-preview-05-20",
        contents=[file, "Convert this PDF file to Markdown."],
        config=md_content_config,
    )

    client.files.delete(name=file.name)

    assert response.text is not None, "Response text should not be None"

    with open(md_path, "w", encoding="UTF-8") as md_file:
        md_file.write(response.text)

In [None]:
for pdf_path, md_path in [
    (f"{PDF_EN}/{pdf}", f"{MD_EN}/{pdf.split('.')[0]}.md")
    for pdf in sorted(os.listdir(PDF_EN))
    if ".keepdir" not in pdf
]:
    print(f"[PROCESSING] {pdf_path}...")

    try:
        pdf_to_md(pdf_path, md_path)
    except Exception as err:
        print(f"[FAILED] {pdf_path}! Reason: {err}")
    else:
        print(f"[PROCESSED] {pdf_path}!")

    time.sleep(TIMEOUT_IN_SECONDS)

## Stage 3: Extracting useful structured data using JSON objects

This stage focuses on extracting key structured data points from the Markdown files and formatting them into JSON objects. This process is crucial for generating coherent and concise summaries and preparing the data for further analysis or translation.

In [None]:
def md_to_json(md_path: str, json_path: str) -> None:
    """Converts a Markdown file to a structured JSON format using the Gemini API.

    This function takes a Markdown file, processes its content through the Google Gemini
    API to extract structured information, and saves the result as a JSON file. The
    conversion follows a predefined schema specified in json_content_config that includes
    sections for challenges, famous people, health details, relationships, etc.

    Args:
        md_path: String path to the source Markdown file. Must contain content compatible
            with the expected structure (life path descriptions from Dan Millman's book).
            File must be UTF-8 encoded.
        json_path: String path where the output JSON file should be saved. Will be
            created if it doesn't exist, or overwritten if it does. Parent directory
            must exist.

    Raises:
        FileNotFoundError: If md_path doesn't exist or json_path's parent dir missing
        PermissionError: If lacking read access to md_path or write access to json_path
        UnicodeError: If the Markdown file is not properly UTF-8 encoded
        AssertionError: If the Gemini API response does not contain text
        ApiError: If there are issues communicating with the Gemini API
        Exception: If JSON generation fails or response is not in expected format

    Notes:
        - Uses global client and json_content_config for API communication
        - Expects the Markdown content to follow Dan Millman's life path format
        - Generated JSON strictly follows schema defined in json_content_config
        - Both input and output use UTF-8 encoding

    Example:
        >>> # Convert a life path description from MD to JSON
        >>> md_to_json("./en/MDs/19_10.md", "./en/JSONs/19_10.json")
    """
    with open(md_path, "r", encoding="UTF-8") as md_file:
        md_content = md_file.read()

    response = client.models.generate_content(
        model="gemini-2.5-flash-preview-05-20",
        contents=md_content,
        config=json_content_config,
    )

    assert response.text is not None, "Response text should not be None"

    with open(json_path, "w", encoding="UTF-8") as json_file:
        json_file.write(response.text)

In [None]:
for md_path, json_path in [(f"{MD_EN}/{md}", f"{JSON_EN}/{md.split('.')[0]}.json") for md in os.listdir(MD_EN)]:
    print(f"[PROCESSING] {md_path}...")

    try:
        md_to_json(md_path, json_path)
    except Exception as err:
        print(f"[FAILED] {md_path}! Reason: {err}.")
    else:
        print(f"[PROCESSED] {md_path}!")

    time.sleep(TIMEOUT_IN_SECONDS)

## Stage 4: Summarization

The full extracted text is too extensive for direct use. This stage generates concise summaries to capture the essential information from each life path description.

In [None]:
def summarize(md_path: str, json_path: str, summarization_path: str) -> None:
    """Generates a structured summary from Markdown and JSON content using the Gemini API.

    Takes input Markdown and JSON files containing content about personality types and life paths
    from Dan Millman's system, processes them through Google's Gemini API to generate a concise,
    bulleted summary, and saves the result to a new file.

    The function reads both source files, combines their content for context, and uses a predefined
    summarization prompt to generate a structured summary focusing on key concepts and principles.

    Args:
        md_path: Path to the source Markdown file containing the detailed content to summarize.
        json_path: Path to a JSON file containing structured data related to the Markdown content.
        summarization_path: Output path where the generated summary will be saved.

    Raises:
        FileNotFoundError: If any of the input files don't exist or output path is invalid.
        AssertionError: If the Gemini API response is empty or invalid.
        IOError: If there are issues reading input files or writing the output file.
        Exception: If the Gemini API request fails or returns an error.

    Example:
        >>> # Generate a summary for Life Path 19/10
        >>> summarize(
        ...     md_path="./en/MDs/19_10.md",
        ...     json_path="./en/JSONs/19_10.json",
        ...     summarization_path="./en/Summarizations/19_10.md",
        ... )
        # Creates a bulleted summary in 19_10.md containing key concepts and traits
        # from both the Markdown and JSON input files.

    Note:
        - Uses the global client and summ_content_config variables
        - Input files must be UTF-8 encoded
        - The JSON content is wrapped in Markdown code fence syntax for proper formatting
        - Generated summaries follow a standardized bulleted list format in Markdown
    """
    with open(md_path, "r", encoding="UTF-8") as md_file:
        md_content = md_file.read()

    with open(json_path, "r", encoding="UTF-8") as json_file:
        json_content = f"```json\n{json_file.read()}```"

    response = client.models.generate_content(
        model="gemini-2.5-flash-preview-05-20",
        contents=[md_content, json_content],
        config=summ_content_config,
    )

    assert response.text is not None, "Response text should not be None"

    with open(summarization_path, "w", encoding="UTF-8") as summ_file:
        summ_file.write(response.text)

In [None]:
for md_path, json_path, summ_path in [
    (
        f"{MD_EN}/{md}",
        f"{JSON_EN}/{md.split('.')[-2]}.json",
        f"{SUMM_EN}/{md.split('.')[-2]}.md",
    )
    for md in sorted(os.listdir(MD_EN))
]:
    print(f"[PROCESSING] {md_path}...")

    try:
        summarize(md_path, json_path, summ_path)
    except Exception as err:
        print(f"[FAILED] {md_path}! Reason: {err}.")
    else:
        print(f"[PROCESSED] {md_path}!")

    time.sleep(TIMEOUT_IN_SECONDS)

## Stage 5: Translation

To cater to a Turkish audience, all processed data, including the full Markdown text, summarized Markdown, and structured JSON, is translated into Turkish.

In [None]:
def translate_md(en_path: str, tr_path: str) -> None:
    """Translates an English Markdown file to Turkish using the Gemini API.

    Uses the configured Gemini model to translate Markdown content while preserving
    the original formatting, headers, lists, and other Markdown elements. The
    translation aims to be idiomatic and natural-sounding in Turkish.

    Args:
        en_path: Path to the source English Markdown file. Must be UTF-8 encoded.
        tr_path: Path where the translated Turkish Markdown file will be saved.
            Will be created if it doesn't exist, or overwritten if it does.

    Raises:
        FileNotFoundError: If en_path doesn't exist or tr_path's parent dir missing
        PermissionError: If lacking read/write permissions for either path
        UnicodeError: If the source file is not properly UTF-8 encoded
        AssertionError: If the Gemini API response returns no text
        Exception: If the translation request fails

    Example:
        >>> # Translate a life path description from English to Turkish
        >>> translate_md("./en/MDs/19_10.md", "./tr/MDs/19_10.md")
        # This will translate the English MD file for Life Path 19/10
        # and save it as a Turkish version in the tr/MDs directory

    Note:
        - Uses the global client and trans_md_content_config for API settings
        - Maintains Markdown formatting from the source file
        - Both input and output files use UTF-8 encoding
    """
    with open(en_path, "r", encoding="UTF-8") as md_en_file:
        md_en_content = md_en_file.read()

    response = client.models.generate_content(
        model="gemini-2.5-flash-preview-05-20",
        contents=md_en_content,
        config=trans_md_content_config,
    )

    assert response.text is not None, "Response text should not be None"

    with open(tr_path, "w", encoding="UTF-8") as trans_md_file:
        trans_md_file.write(response.text)

In [None]:
def translate_json(en_path: str, tr_path: str) -> None:
    """Translates a JSON file from English to Turkish using the Gemini API.

    Takes an English JSON file containing personality profile data, processes it through
    the Google Gemini model with translation-specific configuration, and saves the Turkish
    translation to a new file. Maintains the original JSON structure while translating
    all string values.

    Args:
        en_path: Path to the source English JSON file. Must be UTF-8 encoded and contain
            valid JSON data following the expected schema (with challenges, traits,
            relationships, etc.).
        tr_path: Path where the translated Turkish JSON file will be saved. Parent
            directory must exist. Will be created if doesn't exist or overwritten if it does.

    Raises:
        FileNotFoundError: If en_path doesn't exist or tr_path's parent dir is missing.
        JSONDecodeError: If the source file contains invalid JSON.
        AssertionError: If the Gemini API response contains no text.
        ApiError: If there are issues with the Gemini API request/response.
        IOError: If there are file read/write permission issues.

    Examples:
        Translate a single life path personality profile:
        >>> translate_json("./en/JSONs/19_10.json", "./tr/JSONs/19_10.json")

        Process multiple profiles with error handling:
        >>> for json_file in os.listdir("./en/JSONs"):
        ...     try:
        ...         translate_json(f"./en/JSONs/{json_file}", f"./tr/JSONs/{json_file}")
        ...     except Exception as e:
        ...         print(f"Failed to translate {json_file}: {e}")

    Note:
        - Uses the global `client` for API communication and `trans_json_content_config`
          for translation settings
        - Both input and output files must use UTF-8 encoding
        - Translation preserves the exact JSON schema/structure of the original
        - Response validation ensures non-empty translation before saving
    """
    with open(en_path, "r", encoding="UTF-8") as json_en_file:
        json_en_content = json_en_file.read()

    response = client.models.generate_content(
        model="gemini-2.5-flash-preview-05-20",
        contents=json_en_content,
        config=trans_json_content_config,
    )

    assert response.text is not None, "Response text should not be None"

    with open(tr_path, "w", encoding="UTF-8") as trans_json_file:
        trans_json_file.write(response.text)

### Stage 5.1: Translating full text Markdown

In [None]:
for en_path, tr_path in [(f"{MD_EN}/{md}", f"{MD_TR}/{md}") for md in os.listdir(MD_EN)]:
    print(f"[PROCESSING] {en_path}...")

    try:
        translate_md(en_path, tr_path)
    except Exception as err:
        print(f"[FAILED] {md_path}! Reason: {err}.")
    else:
        print(f"[PROCESSED] {md_path}!")

    time.sleep(TIMEOUT_IN_SECONDS)

### Stage 5.2: Translating summarized Markdown

In [None]:
for en_path, tr_path in [(f"{SUMM_EN}/{md}", f"{SUMM_TR}/{md}") for md in os.listdir(SUMM_EN)]:
    print(f"[PROCESSING] {en_path}...")

    try:
        translate_md(en_path, tr_path)
    except Exception as err:
        print(f"[FAILED] {md_path}! Reason: {err}.")
    else:
        print(f"[PROCESSED] {md_path}!")

    time.sleep(TIMEOUT_IN_SECONDS)

### Stage 5.3: Translating JSON

In [None]:
for en_path, tr_path in [(f"{JSON_EN}/{json}", f"{JSON_TR}/{json}") for json in os.listdir(JSON_EN)]:
    print(f"[PROCESSING] {en_path}...")

    try:
        translate_json(en_path, tr_path)
    except Exception as err:
        print(f"[FAILED] {md_path}! Reason: {err}.")
    else:
        print(f"[PROCESSED] {md_path}!")

    time.sleep(TIMEOUT_IN_SECONDS)

## Stage 6: Extending JSON

To augment the dataset and provide richer detail, the LLM is utilized to elaborate upon the extracted JSON data. This process expands the initial structured information into more detailed descriptions, enhancing the depth and utility of the dataset for subsequent applications.

In [None]:
def extend_json(json_path: str, extended_json_path: str) -> None:
    """Takes a Turkish JSON file and expands its content with more detailed descriptions.

    This function processes a JSON file containing personality profile data, enhances
    and elaborates on each field's content using the Google Gemini API, and saves
    the expanded version to a new file. The expansion maintains the original JSON
    structure while providing richer, more detailed descriptions in Turkish.

    Args:
        json_path: Path to the source JSON file containing the original content.
            Must be a valid JSON file following the expected schema (with challenges,
            traits, relationships, etc.) and UTF-8 encoded.
        extended_json_path: Path where the expanded JSON file will be saved. Parent
            directory must exist. Will be created if doesn't exist or overwritten
            if it does.

    Raises:
        FileNotFoundError: If json_path doesn't exist or extended_json_path's parent
            directory is missing.
        JSONDecodeError: If the source file contains invalid JSON.
        AssertionError: If the Gemini API response contains no text.
        ApiError: If there are issues with the Gemini API request/response.
        IOError: If there are file read/write permission issues.

    Notes:
        - Uses the global `client` variable for API communication and
          `extend_json_content_config` for content generation settings.
        - Both input and output files must use UTF-8 encoding.
        - The expansion preserves the exact JSON schema/structure of the original.
        - Content is expanded using natural, conversational Turkish language.
        - Response validation ensures non-empty content before saving.

    Example:
        >>> # Extend a life path personality profile with richer descriptions
        >>> extend_json(json_path="./tr/JSONs/19_10.json", extended_json_path="./tr/JSONs_Extended/19_10.json")
    """
    with open(json_path, "r", encoding="UTF-8") as json_file:
        json_content = json_file.read()

    response = client.models.generate_content(
        model="gemini-2.5-flash-preview-05-20",
        contents=f"```json\n{json_content}\n```",
        config=extend_json_content_config,
    )

    assert response.text is not None, "Response text should not be None"

    with open(extended_json_path, "w", encoding="UTF-8") as extend_json_file:
        extend_json_file.write(response.text)

In [None]:
for json_path, json_extend_path in [
    (f"{JSON_TR}/{json_path}", f"{JSON_TR_EXT}/{json_path}") for json_path in sorted(os.listdir(JSON_TR))
]:
    print(f"[PROCESSING] {json_path}...")

    try:
        extend_json(json_path, json_extend_path)
    except Exception as err:
        print(f"[FAILED] {json_path}! Reason: {err}.")
    else:
        print(f"[PROCESSED] {json_path}!")

    time.sleep(TIMEOUT_IN_SECONDS)