<a href="https://www.kaggle.com/code/roystondalmeida/legal-document-analyzer-and-summarizer?scriptVersionId=235285936" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# 📄 Legal Document Analyzer and Summarizer

This notebook demonstrates an end-to-end workflow for analyzing legal documents (PDF/DOCX) using Google Gemini via the `google-generativeai` SDK. The system extracts structured legal obligations and court decisions from complex legal texts, producing a machine-readable summary and actionable insights.

## 🧠 GenAI Capabilities Demonstrated

**This project showcases the following GenAI capabilities:**

- **Structured Output / Controlled Generation:**  
  The Gemini model is prompted to return information in a strict JSON schema, ensuring reliable, machine-readable output for downstream processing.

- **Document Understanding:**  
  The workflow parses and comprehends complex legal documents (PDF and DOCX), segmenting and analyzing them to extract obligations, deadlines, and court decisions.

- **Long Context Window:**  
  Gemini's ability to process and reason over large chunks of legal text enables accurate extraction of information from lengthy court orders and filings.

## 🔑 Secure API Key Management

> **Note:**  
> This notebook uses the Kaggle Secrets management feature to securely access the Gemini API key.  
> - The secret key is **never exposed** in code or output, even if the notebook is public.  
> - Do **not** hardcode API keys in the notebook.  
> - For more, see [Kaggle Secrets documentation](https://www.kaggle.com/docs/secrets).

## 🚀 Step 1: Environment Setup and Gemini SDK Initialization

In [1]:
# Install required packages
!pip install -q google-generativeai pypdf2 python-docx tqdm

# Kaggle environment setup and Gemini Python SDK setup
import google.generativeai as genai
from kaggle_secrets import UserSecretsClient

# Initialize the secrets client to securely access your API key stored in Kaggle Secrets
user_secrets = UserSecretsClient()

# Retrieve the Gemini API key using the label you assigned in the Secrets UI ("GEMINI_API_KEY")
gemini_api_key = user_secrets.get_secret("GEMINI_API_KEY")

# Configure the Gemini SDK with your API key
genai.configure(api_key=gemini_api_key)

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.3/244.3 kB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[?25h

## 🗂️ Step 2: Document Parsing Functions

Helper functions to extract text from PDF and DOCX files:

In [2]:
# Step 2: Define helper functions for document parsing implementation
from PyPDF2 import PdfReader
from docx import Document
import re

def parse_pdf(file_path):
    """
    Args:
        file_path (str): Path to the PDF file to be parsed
    Returns:
        str: Combined text from all pages separated by newlines, with whitespace normalized.
    Description:
        Extracts and concatenates text content from a PDF document, normalizing whitespace to reduce empty lines.
    """
    reader = PdfReader(file_path)
    text = ""
    for page in reader.pages:
        page_text = page.extract_text()
        if page_text:
            # Normalize whitespace, remove excessive newlines, and specific patterns
            page_text = re.sub(r'\s+', ' ', page_text).strip()
            page_text = re.sub(r'(\n){3,}', '\n\n', page_text)  # Reduce >2 newlines to 2
            page_text = re.sub(r'-\n', '', page_text)  # Remove hyphenated line breaks
            text += page_text + "\n"

    return text.strip()


def parse_docx(file_path):
    """
    Args:
        file_path (str): Path to the Word document to be parsed
    Returns:
        str: Combined text from all paragraphs separated by newlines, with whitespace normalized.
    Description:
        Extracts and combines text content from a DOCX document, normalizing whitespace to reduce empty lines.
    """
    doc = Document(file_path)
    text = ""
    for para in doc.paragraphs:
        para_text = para.text
        if para_text:
            # Normalize whitespace and remove leading/trailing whitespace
            para_text = re.sub(r'\s+', ' ', para_text).strip()
            text += para_text + "\n"

    # Remove excessive newlines and leading/trailing spaces
    text = re.sub(r'\n+', '\n', text).strip()
    return text.strip()

## 🧹 Step 3: Preprocessing Pipeline

Text cleaning and segmentation functions:

In [3]:
# Step 3: Preprocessing pipeline, define helper functions for text cleaning and segmentation
import re

def preprocess_text(text):
    """
    Args:
        text (str): Raw text input from parsed legal documents (PDF/DOCX)
    Returns:
        list: Text sections split by double newlines, with cleaned formatting
    Description:
        Cleans and segments legal document text into logical sections.
    Processing Steps:
        1. Normalizes multiple consecutive newlines to single newlines
        2. Compresses multiple whitespace characters to single spaces
        3. Splits content into sections using double newline delimiter
    """
    
    cleaned = re.sub(r'\n{2,}', '\n', text)  # Remove excessive newlines
    cleaned = re.sub(r'\s{2,}', ' ', cleaned)  # Remove extra spaces
    return cleaned.split('\n\n')  # Split by sections

## 🤖 Step 4: Gemini-Powered Analysis

Configure model and define analysis function:

In [4]:
# Step 4: Gemini-Powered Analysis

# Configure model with system instruction
model = genai.GenerativeModel('gemini-1.5-pro', 
                              system_instruction = "You are a legal document analyzer. Extract obligations, deadlines, and key clauses in JSON format.")

# Chunk processing strategy
def analyze_chunk(chunk):
    """Processes text chunks with Gemini."""
    
    prompt = f"""
                Analyze this legal document and return JSON. Extract legal obligations AND court decisions impacting specific parties:
                {chunk}

                Required JSON format:
                {{
                    "summary": "concise summary of document",
                    "obligations": [
                        {{
                            "docket_number": "case identifier, if available",
                            "party": "responsible entity",
                            "action": "specific requirement",
                            "deadline": "YYYY-MM-DD, if applicable",
                            "clause_reference": "section identifier, if available",
                            "details": "further details"
                        }}
                    ],
                    "court_decisions": [
                        {{
                            "docket_number": "case identifier",
                            "case": "case identifier",
                            "decision": "summary of the court's decision",
                            "party": "party affected",
                            "other_party": "opposing party (if any)",
                            "reasoning": "the court's reasoning"
                        }}
                    ], 
                    "justice_opinions": [
                        {{
                            "docket_number": "case identifier",
                            "justice": "Name of the justice providing the opinion",
                            "opinion_summary": "Detailed summary of the justice's opinion on the case",
                            "reasoning": "Reasons for the opinion",
                            "related_cases": "Any cases related to the opinion",
                        }}
                    ],
                "other_notes": "Include any other noteworthy legal points or observations from this excerpt."
                }}
            """
    
    response = model.generate_content(prompt)
    return response.text

## 📊 Step 5: Result Aggregation

Combine results from all chunks:

In [5]:
import json
import re
from tqdm import tqdm

def extract_json_from_response(response):
    """
    Extracts a JSON string from a text response, handling potential errors.
    """
    if not isinstance(response, str):
        print(f"❌ Unexpected response type: {type(response)}, Response: {response}")
        return "{}"

    # Check if response is empty or just whitespace
    if not response.strip():
        print("❌ Empty response received.")
        return "{}"

    try:
        # Remove code fences if present
        match = re.search(r'\{[\s\S]*\}', response)
        if match:
            json_string = match.group(0)
            
            # Validate JSON string before returning
            try:
                json.loads(json_string)
                return json_string
            except json.JSONDecodeError as e:
                print(f"❌ Invalid JSON found after extraction: {e}")
                return "{}"
        else:
            print("❌ No JSON object found in model response.")
            return "{}"  # Return empty JSON object as string if no match found
    except Exception as e:
        print(f"❌ Extraction Error: {e}")
        return "{}"

def process_document(full_text):
    """
    Args:
        full_text (str): Raw text content from parsed legal documents
    Returns:
        dict: Structured analysis containing:
            - summary (str): Cumulative plain-language summary
            - obligations (list): Extracted legal obligations with metadata
    Description:
        Orchestrates end-to-end analysis of legal documents using GenAI.
    Processing Workflow:
        1. Preprocesses text into logical chunks
        2. Analyzes each chunk with GenAI (Gemini)
        3. Aggregates results into unified structure
        4. Provides progress visualization via tqdm
    """
    
    chunks = preprocess_text(full_text)
    results = {
        "summary": "",
        "obligations": [],
        "court_decisions": [],
        "justice_opinions": [],
        "other_notes": ""
    }

    for chunk in tqdm(chunks):
        response = analyze_chunk(chunk)
        try:
            json_str = extract_json_from_response(response)
            if not json_str: # Here
                print("❌ No JSON string extracted. Skipping chunk.")
                continue
            part = json.loads(json_str)  # Convert JSON string to dictionary
        except (json.JSONDecodeError, TypeError) as e:
            print(f"❌ JSON Processing Error: {e}, Response: {response}")
            part = {}  # Handle the error

        # Aggregate data safely
        results["summary"] += part.get("summary", "") + " "
        results["obligations"].extend(part.get("obligations", []))
        results["court_decisions"].extend(part.get("court_decisions", []))
        results["justice_opinions"].extend(part.get("justice_opinions", []))
        results["other_notes"] += part.get("other_notes", "") + " "

    return results

## 🏁 Step 6: Execution Workflow

Run the complete analysis pipeline:

In [6]:
# Step 6: Execution workflow

# Workflow
if __name__ == "__main__":
    
    # 1. Access uploaded document
    input_path = '/kaggle/input/court-order/court_order_2.pdf'  # ← Replace with your path
    
    # 2. Process document
    print("🔍 Analyzing document...")
    try:
        # File parsing
        if input_path.endswith('.pdf'):
            text = parse_pdf(input_path)
        elif input_path.endswith('.docx'):
            text = parse_docx(input_path)
        else:
            raise ValueError("Unsupported file format. Only PDF and DOCX are supported.")
        
        print("\n📄 Extracted Text Preview:\n")
        print(text[:500] + "...")  # Show first 500 characters
        
        # Document analysis
        results = process_document(text)
        
        print("\n🔧 Analysis Results:\n")
        print(json.dumps(results, indent=2))  # Pretty-print JSON
        
    except Exception as e:
        print(f"❌ Error processing document: {str(e)}")
        exit(1)

    # 3. Save results
    output_path = '/kaggle/working/legal_analysis.json'
    with open(output_path, 'w') as f:
        json.dump(results, f, indent=2)
    
    print(f"\n✅ Analysis saved to {output_path}")

🔍 Analyzing document...

📄 Extracted Text Preview:

(ORDER LIST: 604 U.S.) MONDAY, MARCH 31, 2025 CERTIORARI -- SUMMARY DISPOSITION 24-6415 MORRISSETTE, RAHEEM V. UNITED STATES The motion of petitioner for leave to proceed in forma pauperis and the petition for a writ of certiorari are granted. The judgment is vacated, and the case is remanded to the United States Court of Appeals for the Eleventh Circuit for further consideration in light of United States v. Rahimi, 602 U. S. 680 (2024). ORDERS IN PENDING CASES 24A796 BARKSDALE, TONY V. MARSHALL...


100%|██████████| 1/1 [00:13<00:00, 13.47s/it]


🔧 Analysis Results:

{
  "summary": "This document lists orders from the Supreme Court of the United States, including certiorari grants, denials, and other decisions on pending cases.  A key decision involves vacating a judgment and remanding a case to the Eleventh Circuit in light of *United States v. Rahimi*.  Numerous certiorari petitions are denied. Justice Sotomayor, joined by Justice Jackson, dissents from the denial of certiorari in *Shockley v. Vandergriff*, arguing that a single circuit judge's vote to grant a certificate of appealability should be sufficient for the appeal to proceed, regardless of the majority view of the panel. ",
  "obligations": [
    {
      "docket_number": "24-6725",
      "party": "Yates, Fernando (petitioner)",
      "action": "Pay docketing fee and submit petition compliant with Rule 33.1.",
      "deadline": "2025-04-21",
      "clause_reference": "Rule 38(a)",
      "details": ""
    }
  ],
  "court_decisions": [
    {
      "docket_number": "24




# 📌 Next Steps

1. **Upload Documents**  
   Use Kaggle's "Add Data" feature (right sidebar ➔ "Input" tab ➔ "Add Input" button) to upload legal documents

2. **Update Paths**  
   Modify `input_path` in Step 6 to match your document's path:  
   `/kaggle/input/[YOUR_DATASET_NAME]/your_document.pdf`

3. **Run All Cells**  
   Execute the full notebook (Runtime ➔ Run all)

4. **Download Results**  
   - **Output Location**: `legal_analysis.json` will appear in:  
     `Right sidebar ➔ Output tab ➔ /kaggle/working/`  
   - **Download Method**:  
     - Click the vertical "⋮" next to the file  
     - Select "Download"  