# Analysis of the policies & classification

*This notebook uses Gemini AI to analyze privacy policy documents and extract important clauses classified as BLOCKER, BAD, NEUTRAL, or GOOD according to ToS;DR standards.*

## 1. Load Libraries

*Import required packages for async processing, AI model interaction, environment variables, and rich console visualization.*

In [1]:
import os
import json
import asyncio
import re
import random
import hashlib
from typing import List, Dict
from google import genai
from google.genai import types
from dotenv import load_dotenv
from pathlib import Path
from sklearn.model_selection import train_test_split

from rich.console import Console
from rich.progress import (
    Progress, SpinnerColumn, BarColumn, TextColumn, 
    TimeRemainingColumn, MofNCompleteColumn
)
from rich.panel import Panel
from rich.rule import Rule

## 2. Configuration

*Define file paths, AI model settings, processing limits, and initialize the Google Gemini client.*

In [2]:
ROOT = Path('../..')
DATA_DIR = ROOT / "data-generated" / "TOSDR"
MARKDOWN_OUTPUT = DATA_DIR / "policies_md.jsonl"
HIGHLIGHTS_OUTPUT = DATA_DIR / "gemini_policies_md.jsonl"
ENV_FILE = ROOT / ".env"
model_name = "gemini-2.0-flash-lite"

# Qwen3-0.6B optimization: Blocks ~4500 chars (approx 1024 tokens)
MAX_CHARS_PER_BLOCK = 4500 
MAX_TO_PROCESS = 1000
CONCURRENCY_LIMIT = 5

console = Console()
load_dotenv(ENV_FILE)
client = genai.Client(api_key=os.getenv("GOOGLE_AI_API_KEY"))

## 3. AI System Prompt

*Define the prompt that instructs Gemini on how to analyze documents and classify clauses according to ToS;DR standards.*

In [3]:
SYSTEM_PROMPT = """You are an expert legal auditor for ToS;dr. Your mission is to parse legal document blocks and extract ONLY the clauses that impact a user's rights, privacy, or safety.

CRITICAL PHILOSOPHY:
- DISTINGUISH BOILERPLATE FROM PROTECTIONS: Generic greetings or index lists are "NO_CLAUSES_FOUND". However, explicit promises (encryption, support, data limits) are valid points.
- BE SELECTIVE BUT COMPLETE: If a block contains 3 risks and 1 protection, extract all 4. 
- IGNORE PURE STRUCTURE: Skip table of contents, contact addresses (unless for rights exercise), and section headers without text.

STRICT CONSTRAINTS:
- ZERO EXTERNAL KNOWLEDGE: Do not use anything you know about this company from your training data. 
- TEXTUAL EVIDENCE ONLY: If a clause is not explicitly written in the provided segment, it does NOT exist.
- NO HALLUCINATIONS: Do not assume or infer policies. If the text doesn't mention "Arbitration", do not list an arbitration clause even if you know the company uses one.
- ANALYZE AS A BLIND DOCUMENT: Treat this segment as if you have never heard of the company before.

CLASSIFICATION:
- [GOOD]: Positive for user rights or security (e.g., encryption, clear notification periods, data deletion).
- [NEUTRAL]: Important facts for transparency (e.g., jurisdiction, specific age limits).
- [BAD]: Negative practices or risks (e.g., arbitration, tracking).
- [BLOCKER]: Critical dangers (e.g., data selling, private message access).

FEW-SHOT EXAMPLES:

Segment: 
"1. Introduction
2. Account Registration
3. Fees and Payments
4. Termination Policy
For more information, visit our help center or contact us at support@example.com."
Output: NO_CLAUSES_FOUND

Segment: 
"Welcome to our platform. We provide cloud storage features. We use industry-standard AES-256 encryption to protect your data at rest. Our customer support team is available 24/7 via live chat to assist with any security concerns."
Output:
- [GOOD] : Strong Encryption : The service uses industry-standard AES-256 encryption to protect stored data.
- [GOOD] : 24/7 Support : Users have constant access to support for technical or security issues.

Segment:
"## 5. Intellectual Property
You retain ownership of your photos. However, you grant us a worldwide, perpetual license to use, reproduce, and distribute your content. We may also use your username in commercial ads without compensation."
Output:
- [BLOCKER] : Perpetual Content License : The service takes an irrevocable and perpetual license to use all your content.
- [BAD] : Commercial use of Identity : The service can use your username for advertising without payment.

Segment:
"We store your data for 30 days after account deletion. Any disputes will be resolved in the courts of Paris, France. We do not track your location when the app is closed."
Output:
- [NEUTRAL] : Data Retention Period : Personal data is kept for 30 days after the account is closed.
- [NEUTRAL] : Jurisdiction : Disputes are handled specifically in Paris, France.
- [GOOD] : No Background Tracking : The app explicitly stops location tracking when not in active use.

OUTPUT FORMAT:
- [LABEL] : SHORT TITLE : Concise explanation.

If the segment contains only structure or irrelevant text, return: NO_CLAUSES_FOUND"""

## 4. Processing Functions

*Functions for parsing AI output, extracting highlights, and building the dataset.*

### 4.1 Segmentation of markdowns

*Segment a markdown into blocs, it uses a robust approach to ensure each bloc remains under the max context*

In [4]:
def segment_hierarchical(text: str, max_chars: int = MAX_CHARS_PER_BLOCK) -> List[str]:
    """
    Recursively splits markdown into independent blocks following logical priority:
    H1 (#) -> H2 (##) -> H3 (###) -> Paragraphs (\n\n) -> Lines (\n)
    """
    def split_recursive(content: str, separators: List[str]) -> List[str]:
        if len(content) <= max_chars or not separators:
            return [content]
        
        current_sep = separators[0]
        remaining_seps = separators[1:]
        
        # Split using lookahead to keep headers with their content
        parts = re.split(current_sep, content)
        
        final_parts = []
        current_buffer = ""
        
        for part in parts:
            if not part: continue
            
            if len(part) > max_chars:
                if current_buffer:
                    final_parts.append(current_buffer.strip())
                    current_buffer = ""
                final_parts.extend(split_recursive(part, remaining_seps))
            elif len(current_buffer) + len(part) > max_chars:
                if current_buffer:
                    final_parts.append(current_buffer.strip())
                current_buffer = part
            else:
                current_buffer += part
        
        if current_buffer:
            final_parts.append(current_buffer.strip())
            
        return final_parts

    # Ordered priority of separators
    separators = [
        r'\n(?=#\s)',    # H1
        r'\n(?=##\s)',   # H2
        r'\n(?=###\s)',  # H3
        r'\n\n',         # Paragraphs
        r'\n'            # Lines
    ]
    
    return [c for c in split_recursive(text, separators) if len(c.strip()) > 50]

### 4.2 Output Parser

*Parses the AI-generated text to extract structured highlights with labels (BLOCKER, BAD, GOOD, NEUTRAL).*

In [5]:
def parse_generative_output(text: str) -> List[Dict]:
    results = []
    # Pattern handles optional dash and flexible spacing
    pattern = re.compile(r"^[-]?\s*\[(BAD|GOOD|NEUTRAL|BLOCKER)\]\s*:\s*([^:]+):\s*(.+)$", re.MULTILINE)
    
    for match in pattern.finditer(text):
        results.append({
            "label": match.group(1).upper(),
            "title": match.group(2).strip(),
            "explanation": match.group(3).strip()
        })
    return results

### 4.3 AI Extraction Worker

*Sends document to Gemini AI for analysis and returns parsed highlights.*

In [6]:
async def extract_highlights(document_item, semaphore):
    """Segments the document and returns a list of independent block objects."""
    markdown_text = document_item.get('policy', '')
    if not markdown_text: return []

    chunks = segment_hierarchical(markdown_text)
    blocks_results = []

    async with semaphore:
        for i, chunk in enumerate(chunks):
            try:
                response = client.models.generate_content(
                    model=model_name,
                    config=types.GenerateContentConfig(
                        system_instruction=SYSTEM_PROMPT,
                        temperature=0.0,
                    ),
                    contents=f"STRICTLY ANALYZE THIS TEXT ONLY:\n\n{chunk}"
                )
                
                generated_text = response.text.strip() if response.text else ""
                
                final_output = "" if "NO_CLAUSES_FOUND" in generated_text else generated_text
                
                c_hash = hashlib.md5(chunk.encode('utf-8')).hexdigest()[:8]
                block_id = f"{document_item['service_id']}_{i}_{c_hash}"

                blocks_results.append({
                    "id": block_id,
                    "original_service_id": document_item["service_id"],
                    "service_name": document_item["service_name"],
                    "url": document_item["url"],
                    "input": chunk,
                    "output": final_output
                })
                
                await asyncio.sleep(0.1)
            except Exception as e:
                console.print(f"[red]Error on chunk {i} of {document_item['service_name']}: {e}[/red]")
                continue
                
    return blocks_results

### 4.4 Main Dataset Builder

*Coordinates the full pipeline: loading markdown files, deduplicating content, processing with AI, and saving results.*

In [7]:
async def main_dataset_builder():
    if not MARKDOWN_OUTPUT.exists():
        console.print("[red]Source file missing.[/red]")
        return

    already_processed_ids = set()
    if HIGHLIGHTS_OUTPUT.exists():
        with open(HIGHLIGHTS_OUTPUT, "r", encoding="utf-8") as f:
            for line in f:
                try: 
                    already_processed_ids.add(json.loads(line).get("original_service_id"))
                except: continue

    unprocessed_documents = []
    with open(MARKDOWN_OUTPUT, "r", encoding="utf-8") as f:
        for line in f:
            try:
                doc_data = json.loads(line)
                if doc_data.get("service_id") not in already_processed_ids:
                    unprocessed_documents.append(doc_data)
            except: continue

    if not unprocessed_documents:
        console.print("[bold green]âœ” Everything processed![/bold green]")
        return

    unprocessed_documents = unprocessed_documents[:MAX_TO_PROCESS]
    semaphore = asyncio.Semaphore(CONCURRENCY_LIMIT)
    
    with Progress(
        SpinnerColumn(), TextColumn("[progress.description]{task.description}"),
        BarColumn(), MofNCompleteColumn(), TimeRemainingColumn(), console=console,
    ) as progress:
        task = progress.add_task("[cyan]Processing documents...", total=len(unprocessed_documents))
        
        with open(HIGHLIGHTS_OUTPUT, "a", encoding="utf-8") as output_file:
            for document in unprocessed_documents:
                progress.update(task, description=f"[cyan]Analyzing {document['service_name']}...")
                
                # Get the list of processed blocks
                processed_blocks = await extract_highlights(document, semaphore)
                
                for block in processed_blocks:
                    output_file.write(json.dumps(block, ensure_ascii=False) + "\n")
                
                output_file.flush()
                progress.advance(task)

    console.print(f"\n[bold green]Complete![/bold green]")

## 5. Run Highlight Extraction

*Execute the main analysis pipeline to generate highlights for all documents.*

In [8]:
await main_dataset_builder()

Output()

## 6. Train/Validation Split

*Create training and test datasets by deduplicating documents and splitting into train/test sets.*

In [9]:
def run_split():
    if not HIGHLIGHTS_OUTPUT.exists():
        print("No analyzed data found.")
        return

    block_dataset = []
    with open(HIGHLIGHTS_OUTPUT, "r", encoding="utf-8") as f:
        for line in f:
            try:
                block_dataset.append(json.loads(line))
            except: continue

    print(f"Total blocks available: {len(block_dataset)}")
    
    train_set, test_set = train_test_split(block_dataset, test_size=0.05, random_state=42)
    
    output_dir = ROOT / "data-generated" / "EULAI"
    output_dir.mkdir(parents=True, exist_ok=True)
    
    for name, samples in [("train", train_set), ("test", test_set)]:
        output_path = output_dir / f"qwen_{name}.jsonl"
        with open(output_path, "w", encoding="utf-8") as f:
            for s in samples:
                f.write(json.dumps(s, ensure_ascii=False) + "\n")
        print(f"Saved: {output_path} ({len(samples)} blocks)")

run_split()

Total blocks available: 22005
Saved: ../../data-generated/EULAI/qwen_train.jsonl (20904 blocks)
Saved: ../../data-generated/EULAI/qwen_test.jsonl (1101 blocks)
