# Dataset Chunking Strategy & Analysis

### 1. Context & Objectives

This notebook outlines the semantic chunking strategy employed to process the EU AI Act. Our primary objective was to transform the raw legal text into high-quality, retrieval-ready units.

Legal texts present a unique challenge for RAG (Retrieval-Augmented Generation) pipelines: precision is paramount. A simple character-split would sever critical references (e.g., separating "Article 5" from its sub-paragraphs). Therefore, we adopted a structure-aware approach to ensure every chunk remains self-contained and semantically meaningful while adhering to token limits for optimal embedding performance.

### 2. Methodology

**Step 1: Structural Extraction**
We began by parsing the official AI Act text into its native hierarchical format, separating the document into **Articles, Recitals, and Annexes**. Crucially, this extraction captured the full depth of the legal structure, nesting all sub-points (e.g., "1.", "2.") and sub-sub-points (e.g., "(a)", "(b)") as children of their overlying parent elements.

* *Source:* `ai_act_extracted.json`
* *Outcome:* A clean JSON hierarchy reflecting the complete official legal document structure, preserving parent-child relationships.

**Step 2: Contextual Flattening**
A major issue in legal chunking is "orphan text." For example, a sub-point labeled "(a)" is meaningless without the preceding clause "1. High-risk AI systems shall...".
To solve this, we implemented a **concatenation strategy**. We flattened the hierarchy by prepending parent headers to child elements.

* *Source:* `ai_act_chunks.json`
* *Transformation:* `Header` + `Sub-point` → `Unified Chunk`
* *Outcome:* Every chunk now carries its full context, preventing semantic ambiguity.

**Step 3: Size Optimization (Refining Oversized Chunks)**
Initial profiling revealed a discrepancy in text density. While Articles are typically concise, Recitals (the explanatory background text) are often verbose. To prevent token overflow and context dilution, we applied a secondary **word-based split** specifically targeting these outliers.

* *Source:* `ai_act_chunks_split.json`
* *Transformation:* Any chunk exceeding **250 words** was split into smaller segments of roughly **150 words** with a **30-word overlap** to maintain narrative continuity.
* *Outcome:* Smaller chunks.



### 3. Data Analysis

In [13]:
import sys
from pathlib import Path

sys.path.append(str(Path.cwd().parent))
from src.analysis.analyze_dataset import analyze_chunks


project_root = Path.cwd().parent 
bin_size = 50
max_threshold = 500

**Phase 1: Initial Distribution (Before Splitting)**
Our initial pass yielded **1,389 chunks**. The structural extraction worked well for the binding laws, but the interpretative sections were disproportionately heavy.

* **Articles & Annexes:** These were naturally concise, averaging roughly **57–62 words** per chunk.
* **Recitals (The Problem Area):** The 180 Recitals averaged **193 words**, significantly longer than the rest of the text. Crucially, **42 Recitals exceeded 250 words**, creating "big-chunks" that would likely degrade retrieval performance.


In [11]:
file = project_root / "data" / "json" / "ai_act_chunks.json"
analyze_chunks(file, bin_size=bin_size, max_threshold=max_threshold)

Reading from: /Users/arthurreuss/code/Arthurreuss/Python/RUG/LTP/RAG_AI_Act/data/json/ai_act_chunks.json
Analyzing 1389 chunks...

TYPE                      | 0-50      | 50-100    | 100-150   | 150-200   | 200-250   | 250-300   | 300-350   | 350-400   | 400-450   | 450-500   | 500+      | TOTAL
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
Annexes                   | 100       | 105       | 14        | 3         | .         | 2         | .         | .         | .         | .         | .         | 224
Articles                  | 408       | 450       | 110       | 8         | 6         | 2         | .         | .         | 1         | .         | .         | 985
Recital                   | 9         | 32        | 43        | 33        | 20        | 12        | 7         | 5         | 7         | 10        | 2         | 180

AVERAGE LENGTHS BY TYPE
TYPE                

**Phase 2: Final Distribution (After Splitting)**
After applying our size optimization logic, the dataset grew slightly to **1,518 chunks**, achieving a much healthier distribution.

* **Normalization Success:** The average length of Recitals dropped from 193 words to a manageable **129 words**.
* **Eliminating Outliers:** We successfully eliminated all chunks over 250 words. The bulk of the dataset now sits in the **50–200 word range**.
* **Stability:** The Articles and Annexes remained largely untouched (increasing by only ~11 chunks total), confirming that our splitting logic correctly targeted only the problematic sections.

In [12]:
file2 = project_root / "data" / "json" / "ai_act_chunks_split.json"
analyze_chunks(file2, bin_size=bin_size, max_threshold=max_threshold)

Reading from: /Users/arthurreuss/code/Arthurreuss/Python/RUG/LTP/RAG_AI_Act/data/json/ai_act_chunks_split.json
Analyzing 1518 chunks...

TYPE                      | 0-50      | 50-100    | 100-150   | 150-200   | 200-250   | 250-300   | 300-350   | 350-400   | 400-450   | 450-500   | 500+      | TOTAL
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
Annexes                   | 102       | 105       | 15        | 6         | .         | .         | .         | .         | .         | .         | .         | 228
Articles                  | 409       | 452       | 110       | 15        | 6         | .         | .         | .         | .         | .         | .         | 992
Recital                   | 29        | 47        | 65        | 137       | 20        | .         | .         | .         | .         | .         | .         | 298

AVERAGE LENGTHS BY TYPE
TYPE          