# Notebook 01: GenAI Extraction and Document Chunking

This notebook starts the first stage of the end to end GenAI insurance document pipeline. The goal is to take a collection of long, unstructured documents and convert them into structured JSON outputs that can be analysed in later notebooks.

The raw dataset includes three document types stored in the data/raw folder:  
• Insurance policy wording style documents  
• ESG style sustainability reports  
• Incident and loss investigation summaries  

Each document is synthetic, created through paraphrasing or original generation so that it mimics the style, tone, and complexity of real insurance documents without raising copyright issues. The documents are intentionally messy, unstructured, and varied in length to reflect the challenges seen in real underwriting and claims workflows.

Insurance related documents are often long and difficult to process. Important details such as entities, regions, risk factors, exclusions, operational failures, and ESG concerns are spread across multiple paragraphs. Manual extraction is slow and inconsistent. This notebook focuses on building a reproducible GenAI workflow that can extract structured fields in a reliable way.

In this notebook we will:
1. Load the raw documents from data/raw  
2. Split each document into smaller chunks for safer extraction  
3. Define a consistent JSON schema for the key fields we want to extract  
4. Build a clear prompt template that enforces structure and avoids hallucination  
5. Run the extraction process using helper functions in src  
6. Apply a simple MapReduce style pattern by merging chunk outputs  
7. Validate and save the final JSON outputs into data/processed  

The goal is to produce clean, validated, and consistent structured data. These outputs will then be used in Notebook 02 for EDA and in Notebook 03 for feature engineering and classification. This mirrors practical workflows in insurtech companies, where large volumes of unstructured documents must be prepared and structured before modelling or decision support.


## Step 1: Set up imports and locate the raw documents

Before we can do any chunking or GenAI extraction, we need a reliable way to point the notebook at the `data/raw` folder and see exactly which files we are working with.

In this step we will:

- Import a small set of core Python utilities  
- Define a `PROJECT_ROOT` and `DATA_RAW_DIR` using `pathlib.Path` so paths are robust  
- List all text files in the three raw subfolders: `policies`, `esg`, and `incidents`  
- Store these file paths in a simple dictionary for later steps




In [1]:
from pathlib import Path

# 1) Define project root and raw data directory
PROJECT_ROOT = Path("..").resolve()
DATA_RAW_DIR = PROJECT_ROOT / "data" / "raw"

print(f"Project root: {PROJECT_ROOT}")
print(f"Raw data directory: {DATA_RAW_DIR}")

# 2) Collect documents by category (policies, esg, incidents)
documents_by_category = {}

for category_dir in sorted(DATA_RAW_DIR.iterdir()):
    if category_dir.is_dir():
        txt_files = sorted(category_dir.glob("*.txt"))
        documents_by_category[category_dir.name] = txt_files

# 3) Print a short summary of what we found
print("\nDiscovered raw documents:")
for category, paths in documents_by_category.items():
    print(f"- {category}: {len(paths)} files")
    for path in paths:
        print(f"  • {path.name}")

# 4) Optional: a flat list of all documents (for later steps if needed)
all_documents = [p for paths in documents_by_category.values() for p in paths]
print(f"\nTotal number of documents: {len(all_documents)}")


Project root: C:\Users\misha\OneDrive - University of Bristol\Job Apps\Concirrus\genai-insurance-risk-extraction
Raw data directory: C:\Users\misha\OneDrive - University of Bristol\Job Apps\Concirrus\genai-insurance-risk-extraction\data\raw

Discovered raw documents:
- esg: 3 files
  • esg_corporate_sustainability.txt
  • esg_energy_transition.txt
  • esg_supply_chain_governance.txt
- incident: 3 files
  • incident_marine_grounding.txt
  • incident_motor_fleet_collision.txt
  • incident_property_fire.txt
- policy: 8 files
  • auto_insurance_policy_synthetic.txt
  • businessowners_insurance_synthetic_01.txt
  • cyber_insurance_policy_synthetic.txt
  • group_life_policy_practice_unstructured_v1.txt
  • homeowners_declarations_synthetic.txt
  • homeowners_policy_ho3_synthetic.txt
  • travel_insurance_policy_synthetic_allianz.txt
  • travel_insurance_policy_synthetic_CHI.txt

Total number of documents: 14


## Step 2: Load document contents and create a simple overview

Now that we know where the files are and how many we have in each category, the next step is to actually **load the text contents** into memory and build a simple summary table.

In this step we will:

- Read each `.txt` file into a Python string  
- Capture basic metadata for each document  
  - category (policies, esg, incidents)  
  - filename  
  - full filesystem path  
  - character count  
  - word count  
  - a short text preview  
- Store everything in a pandas `DataFrame` so we can inspect the documents in a structured way


This overview will help us:

- Decide sensible chunk sizes later  
- Check that the documents have realistic lengths  
- Quickly spot any weird or empty files before we start chunking and extraction


In [2]:
import pandas as pd

# 1) Load contents of each document into memory and collect basic stats
records = []

for category, paths in documents_by_category.items():
    for path in paths:
        # Read the full text of the file
        text = path.read_text(encoding="utf-8")

        # Build a short, single line preview for quick inspection
        preview = text[:400].replace("\n", " ").strip()

        # Append a record (one row) for this document
        records.append(
            {
                "category": category,
                "filename": path.name,
                "path": path,
                "n_chars": len(text),
                "n_words": len(text.split()),
                "preview": preview,
            }
        )

# 2) Create a DataFrame with one row per document
docs_df = pd.DataFrame(records)

# 3) Show a compact summary
print("Document overview:")
display(
    docs_df[["category", "filename", "n_words", "n_chars", "preview"]]
)


Document overview:


Unnamed: 0,category,filename,n_words,n_chars,preview
0,esg,esg_corporate_sustainability.txt,714,4958,Synthetic ESG Report – Corporate Sustainabilit...
1,esg,esg_energy_transition.txt,847,5874,Synthetic ESG Report – Energy Transition and E...
2,esg,esg_supply_chain_governance.txt,823,5755,Synthetic ESG Report – Supply Chain and Climat...
3,incident,incident_marine_grounding.txt,778,4941,Synthetic Marine Incident Report – Engine Fail...
4,incident,incident_motor_fleet_collision.txt,788,5051,Synthetic Incident Report – Motor Fleet Collis...
5,incident,incident_property_fire.txt,789,5086,Synthetic Incident Report – Commercial Propert...
6,policy,auto_insurance_policy_synthetic.txt,1025,6981,Synthetic Auto Insurance Practice Document (f...
7,policy,businessowners_insurance_synthetic_01.txt,1896,12083,Synthetic Businessowners Insurance Practice Do...
8,policy,cyber_insurance_policy_synthetic.txt,1089,7425,Synthetic Cyber Insurance Practice Document (...
9,policy,group_life_policy_practice_unstructured_v1.txt,1376,8881,Synthetic Group Life Insurance Practice Docume...


## Step 3: Apply a consistent text chunking strategy

Now that the helper function has been implemented in `src/chunking.py`, we can import it and use it to split each document into manageable chunks for LLM processing.

In this project we use a word based chunking strategy with the following design choices:

- Split text into chunks of approximately 250 words.
- Use an overlap of 50 words so that important details near chunk boundaries are preserved.
- Keep the parameters flexible so that chunk sizes can be adjusted later without changing the pipeline.

In this step we will:

1. Import the `chunk_text` helper function from `src/chunking.py`.
2. Apply it to a single example document.
3. Inspect the number of chunks and preview the first few to confirm that the behaviour is sensible.


In [3]:
import sys
from pathlib import Path

# Ensure src/ is on the Python path
PROJECT_ROOT = Path("..").resolve()
sys.path.append(str(PROJECT_ROOT))

# Import the helper function
from src.chunking import chunk_text

# Test on one example document
example_row = docs_df.iloc[0]
example_text = example_row["path"].read_text(encoding="utf-8")

example_chunks = chunk_text(example_text, max_words=250, overlap=50)

print(f"Example document: {example_row['filename']}")
print(f"Total words in document: {example_row['n_words']}")
print(f"Number of chunks created: {len(example_chunks)}\n")

# Preview the first two chunks
for i, chunk in enumerate(example_chunks[:2]):
    print(f"--- Chunk {i} (first 40 words) ---")
    print(" ".join(chunk.split()[:40]))
    print()


Example document: esg_corporate_sustainability.txt
Total words in document: 714
Number of chunks created: 4

--- Chunk 0 (first 40 words) ---
Synthetic ESG Report – Corporate Sustainability Narrative (fully synthetic paraphrased text created for training and GenAI extraction testing; not based on any copyrighted ESG document) (inspired by: corporate ESG and sustainability disclosures from global manufacturing, logistics, and energy companies) The

--- Chunk 1 (first 40 words) ---
the footnotes said the reductions were influenced by lower production volumes rather than actual efficiency improvements. No single team seemed responsible for consolidating the data, which led to confusion over which version was the most accurate. The company recycled some



## Step 4: Chunk all documents and build a chunk level table

In the previous step we confirmed that the `chunk_text` helper function produces sensible chunks for a single document. The next step is to apply this function to every document in the corpus and create a structured table of chunks.

The goal of this step is to move from a **document level view** (`docs_df`) to a **chunk level view** that is suitable for LLM extraction.

In this step we will:

1. Loop over all rows in `docs_df` and apply `chunk_text` to each document.
2. For each chunk, record the following metadata:
   - `category` (policies, esg, incidents)
   - `filename`
   - `doc_index` (index of the document in `docs_df`)
   - `chunk_index` (position of the chunk within that document)
   - `chunk_text`
   - `n_words_chunk` (word count in the chunk)
3. Store all chunk records in a new pandas DataFrame called `chunks_df`.

The `chunks_df` table will have one row per chunk and will serve as the main input for the LLM extraction step. This makes it easy to track where each chunk came from and to aggregate results back to the document level later in the workflow.


In [4]:
# Step 4: Chunk all documents and build a chunk-level DataFrame

chunk_records = []

for doc_index, row in docs_df.iterrows():
    # Read full text for this document
    text = row["path"].read_text(encoding="utf-8")

    # Generate chunks using the helper
    chunks = chunk_text(text, max_words=250, overlap=50)

    # Create one record per chunk
    for chunk_index, chunk in enumerate(chunks):
        chunk_records.append(
            {
                "doc_index": doc_index,
                "category": row["category"],
                "filename": row["filename"],
                "chunk_index": chunk_index,
                "chunk_text": chunk,
                "n_words_chunk": len(chunk.split()),
            }
        )

# Build the chunk-level DataFrame
chunks_df = pd.DataFrame(chunk_records)

print("Chunk level overview:")
print(f"- Number of documents: {len(docs_df)}")
print(f"- Total number of chunks: {len(chunks_df)}")
print(f"- Average chunks per document: {len(chunks_df) / len(docs_df):.2f}\n")

display(
    chunks_df[["doc_index", "category", "filename", "chunk_index", "n_words_chunk"]]
    .head(10)
)


Chunk level overview:
- Number of documents: 14
- Total number of chunks: 76
- Average chunks per document: 5.43



Unnamed: 0,doc_index,category,filename,chunk_index,n_words_chunk
0,0,esg,esg_corporate_sustainability.txt,0,250
1,0,esg,esg_corporate_sustainability.txt,1,250
2,0,esg,esg_corporate_sustainability.txt,2,250
3,0,esg,esg_corporate_sustainability.txt,3,114
4,1,esg,esg_energy_transition.txt,0,250
5,1,esg,esg_energy_transition.txt,1,250
6,1,esg,esg_energy_transition.txt,2,250
7,1,esg,esg_energy_transition.txt,3,247
8,1,esg,esg_energy_transition.txt,4,47
9,2,esg,esg_supply_chain_governance.txt,0,250


## Step 5: Define the JSON schema and controlled vocabularies

Before calling the LLM, we need a clear and consistent definition of the structured output we expect from each chunk. This ensures that:

- The model always returns the same fields.
- Values are constrained where appropriate.
- Validation in later steps is straightforward.

In this project, each extraction call is expected to return the following JSON structure:

```json
{
  "entity_name": "",
  "region": "",
  "sector": "",
  "risk_type": "",
  "time_horizon": "",
  "key_risk_factors": [],
  "risk_summary": ""
}
