# Notebook 01: GenAI Extraction and Document Chunking

This notebook starts the first stage of the end to end GenAI insurance document pipeline. The goal is to take a collection of long, unstructured documents and convert them into structured JSON outputs that can be analysed in later notebooks.

The raw dataset includes three document types stored in the data/raw folder:  
• Insurance policy wording style documents  
• ESG style sustainability reports  
• Incident and loss investigation summaries  

Each document is synthetic, created through paraphrasing or original generation so that it mimics the style, tone, and complexity of real insurance documents without raising copyright issues. The documents are intentionally messy, unstructured, and varied in length to reflect the challenges seen in real underwriting and claims workflows.

Insurance related documents are often long and difficult to process. Important details such as entities, regions, risk factors, exclusions, operational failures, and ESG concerns are spread across multiple paragraphs. Manual extraction is slow and inconsistent. This notebook focuses on building a reproducible GenAI workflow that can extract structured fields in a reliable way.

In this notebook we will:
1. Load the raw documents from data/raw  
2. Split each document into smaller chunks for safer extraction  
3. Define a consistent JSON schema for the key fields we want to extract  
4. Build a clear prompt template that enforces structure and avoids hallucination  
5. Run the extraction process using helper functions in src  
6. Apply a simple MapReduce style pattern by merging chunk outputs  
7. Validate and save the final JSON outputs into data/processed  

The goal is to produce clean, validated, and consistent structured data. These outputs will then be used in Notebook 02 for EDA and in Notebook 03 for feature engineering and classification. This mirrors practical workflows in insurtech companies, where large volumes of unstructured documents must be prepared and structured before modelling or decision support.


## Step 1: Set up imports and locate the raw documents

Before we can do any chunking or GenAI extraction, we need a reliable way to point the notebook at the `data/raw` folder and see exactly which files we are working with.

In this step we will:

- Import a small set of core Python utilities  
- Define a `PROJECT_ROOT` and `DATA_RAW_DIR` using `pathlib.Path` so paths are robust  
- List all text files in the three raw subfolders: `policies`, `esg`, and `incidents`  
- Store these file paths in a simple dictionary for later steps




In [1]:
from pathlib import Path

# 1) Define project root and raw data directory
PROJECT_ROOT = Path("..").resolve()
DATA_RAW_DIR = PROJECT_ROOT / "data" / "raw"

print(f"Project root: {PROJECT_ROOT}")
print(f"Raw data directory: {DATA_RAW_DIR}")

# 2) Collect documents by category (policies, esg, incidents)
documents_by_category = {}

for category_dir in sorted(DATA_RAW_DIR.iterdir()):
    if category_dir.is_dir():
        txt_files = sorted(category_dir.glob("*.txt"))
        documents_by_category[category_dir.name] = txt_files

# 3) Print a short summary of what we found
print("\nDiscovered raw documents:")
for category, paths in documents_by_category.items():
    print(f"- {category}: {len(paths)} files")
    for path in paths:
        print(f"  • {path.name}")

# 4) Optional: a flat list of all documents (for later steps if needed)
all_documents = [p for paths in documents_by_category.values() for p in paths]
print(f"\nTotal number of documents: {len(all_documents)}")


Project root: C:\Users\misha\OneDrive - University of Bristol\Job Apps\Concirrus\genai-insurance-risk-extraction
Raw data directory: C:\Users\misha\OneDrive - University of Bristol\Job Apps\Concirrus\genai-insurance-risk-extraction\data\raw

Discovered raw documents:
- esg: 3 files
  • esg_corporate_sustainability.txt
  • esg_energy_transition.txt
  • esg_supply_chain_governance.txt
- incident: 3 files
  • incident_marine_grounding.txt
  • incident_motor_fleet_collision.txt
  • incident_property_fire.txt
- policy: 8 files
  • auto_insurance_policy_synthetic.txt
  • businessowners_insurance_synthetic_01.txt
  • cyber_insurance_policy_synthetic.txt
  • group_life_policy_practice_unstructured_v1.txt
  • homeowners_declarations_synthetic.txt
  • homeowners_policy_ho3_synthetic.txt
  • travel_insurance_policy_synthetic_allianz.txt
  • travel_insurance_policy_synthetic_CHI.txt

Total number of documents: 14


## Step 2: Load document contents and create a simple overview

Now that we know where the files are and how many we have in each category, the next step is to actually **load the text contents** into memory and build a simple summary table.

In this step we will:

- Read each `.txt` file into a Python string  
- Capture basic metadata for each document  
  - category (policies, esg, incidents)  
  - filename  
  - full filesystem path  
  - character count  
  - word count  
  - a short text preview  
- Store everything in a pandas `DataFrame` so we can inspect the documents in a structured way


This overview will help us:

- Decide sensible chunk sizes later  
- Check that the documents have realistic lengths  
- Quickly spot any weird or empty files before we start chunking and extraction


In [2]:
import pandas as pd

# 1) Load contents of each document into memory and collect basic stats
records = []

for category, paths in documents_by_category.items():
    for path in paths:
        # Read the full text of the file
        text = path.read_text(encoding="utf-8")

        # Build a short, single line preview for quick inspection
        preview = text[:400].replace("\n", " ").strip()

        # Append a record (one row) for this document
        records.append(
            {
                "category": category,
                "filename": path.name,
                "path": path,
                "n_chars": len(text),
                "n_words": len(text.split()),
                "preview": preview,
            }
        )

# 2) Create a DataFrame with one row per document
docs_df = pd.DataFrame(records)

# 3) Show a compact summary
print("Document overview:")
display(
    docs_df[["category", "filename", "n_words", "n_chars", "preview"]]
)


Document overview:


Unnamed: 0,category,filename,n_words,n_chars,preview
0,esg,esg_corporate_sustainability.txt,714,4958,Synthetic ESG Report – Corporate Sustainabilit...
1,esg,esg_energy_transition.txt,847,5874,Synthetic ESG Report – Energy Transition and E...
2,esg,esg_supply_chain_governance.txt,823,5755,Synthetic ESG Report – Supply Chain and Climat...
3,incident,incident_marine_grounding.txt,778,4941,Synthetic Marine Incident Report – Engine Fail...
4,incident,incident_motor_fleet_collision.txt,788,5051,Synthetic Incident Report – Motor Fleet Collis...
5,incident,incident_property_fire.txt,789,5086,Synthetic Incident Report – Commercial Propert...
6,policy,auto_insurance_policy_synthetic.txt,1025,6981,Synthetic Auto Insurance Practice Document (f...
7,policy,businessowners_insurance_synthetic_01.txt,1896,12083,Synthetic Businessowners Insurance Practice Do...
8,policy,cyber_insurance_policy_synthetic.txt,1089,7425,Synthetic Cyber Insurance Practice Document (...
9,policy,group_life_policy_practice_unstructured_v1.txt,1376,8881,Synthetic Group Life Insurance Practice Docume...


## Step 3: Define a text chunking helper function

The next step is to define a consistent strategy for splitting each document into smaller chunks that can be sent to the LLM.

In this project we will:

- Chunk by **word count** rather than characters so that chunk sizes are easier to interpret.
- Use fixed size windows with a **small overlap** so that important sentences near boundaries are not lost.
- Keep the function parameters flexible so that chunk sizes can be tuned later without changing the rest of the pipeline.

Design choices:

- `max_words = 250`: target size of each chunk.
- `overlap = 50`: number of words that overlap between consecutive chunks.
- Output: a list of chunks, where each chunk is a string.

In this step we will:

1. Implement a `chunk_text` helper function.
2. Apply it to one example document from `docs_df`.
3. Inspect the number of chunks and a short preview, to check that the behaviour is sensible.


In [None]:
from typing import List


def chunk_text(
    text: str,
    max_words: int = 250,
    overlap: int = 50,
) -> List[str]:
    """
    Split a long text into overlapping chunks based on word count.

    Parameters
    ----------
    text : str
        The full document text as a single string.
    max_words : int, optional
        Maximum number of words in each chunk.
    overlap : int, optional
        Number of words that overlap between consecutive chunks.

    Returns
    -------
    List[str]
        A list of text chunks. Each chunk is a string containing
        up to max_words words, with overlap between neighbours.
    """
    # Split the text into individual words
    words = text.split()

    if not words:
        return []

    chunks: List[str] = []
    start = 0

    # Safety check on overlap
    if overlap >= max_words:
        raise ValueError("overlap must be smaller than max_words")

    # Slide a window over the list of words
    while start < len(words):
        end = start + max_words
        chunk_words = words[start:end]
        chunk_text_str = " ".join(chunk_words).strip()
        if chunk_text_str:
            chunks.append(chunk_text_str)

        # Move the start forward by max_words - overlap
        start = start + max_words - overlap

    return chunks


# Quick test on a single example document
example_row = docs_df.iloc[0]
example_text = example_row["path"].read_text(encoding="utf-8")

example_chunks = chunk_text(example_text, max_words=250, overlap=50)

print(f"Example document: {example_row['filename']}")
print(f"Total words in document: {example_row['n_words']}")
print(f"Number of chunks created: {len(example_chunks)}\n")

for i, chunk in enumerate(example_chunks[:2]):
    print(f"--- Chunk {i} (first 40 words) ---")
    print(" ".join(chunk.split()[:40]))
    print()
