# Document Analyzer - Step-by-Step Notebook

This notebook converts `src/document_analyzer/data_analysis.py` into a procedural, debuggable format.

## What This Does
Analyzes documents using an LLM to extract **structured metadata**:
- Title, Author, Publisher
- Date Created, Last Modified
- Language, Page Count
- Summary (list of key points)
- Sentiment/Tone

## Prerequisites
- API keys set: `GOOGLE_API_KEY` and/or `GROQ_API_KEY`
- Config file at `config/config.yaml`
- A document (text) to analyze

---
## Cell 1: Configuration Placeholders

**Purpose:** Define sample document text for testing. Replace with your actual document content.

**External Dependencies:**
- Environment variables: `GOOGLE_API_KEY`, `GROQ_API_KEY`, `LLM_PROVIDER`

In [None]:
# ============================================================
# CONFIGURATION PLACEHOLDERS - MODIFY THESE BEFORE RUNNING
# ============================================================

# Sample document text for testing (replace with your actual document)
SAMPLE_DOCUMENT_TEXT = """
Annual Report 2025
By: John Smith, Jane Doe
Published by: Acme Corporation
Date: January 15, 2025

Executive Summary:
This report presents the financial performance and strategic initiatives 
of Acme Corporation for the fiscal year 2025. Key highlights include 
a 15% increase in revenue, expansion into three new markets, and the 
successful launch of our flagship product line.

The company demonstrated strong growth despite challenging market conditions.
Our commitment to innovation and customer satisfaction continues to drive 
our success in the competitive landscape.

Looking ahead, we plan to invest heavily in R&D and sustainable practices
to ensure long-term value creation for our stakeholders.
"""  # <-- REPLACE WITH YOUR DOCUMENT TEXT

# Or load from a file:
# DOCUMENT_FILE_PATH = "path/to/your/document.txt"  # <-- UNCOMMENT AND SET

print(f"Document preview ({len(SAMPLE_DOCUMENT_TEXT)} chars):")
print(SAMPLE_DOCUMENT_TEXT[:200] + "...")

---
## Cell 2: Imports

**Purpose:** Import all required libraries and modules.

**Key Dependencies:**
- `langchain_core.output_parsers.JsonOutputParser`: Parses LLM output as JSON
- `langchain.output_parsers.OutputFixingParser`: Auto-fixes malformed JSON responses
- `utils.model_loader.ModelLoader`: Loads LLM from config
- `model.models.Metadata`: Pydantic model defining expected output structure

In [None]:
import os
import sys
from typing import List, Union

# LangChain imports
from langchain_core.output_parsers import JsonOutputParser
from langchain.output_parsers import OutputFixingParser

# Project imports
from utils.model_loader import ModelLoader
from logger import GLOBAL_LOGGER as log
from exception.custom_exception import DocumentPortalException
from model.models import Metadata
from prompt.prompt_library import PROMPT_REGISTRY

print("All imports successful!")

---
## Cell 3: Understanding the Metadata Schema

**Purpose:** Show the expected output structure from the document analyzer.

The `Metadata` Pydantic model defines what fields the LLM should extract:

| Field | Type | Description |
|-------|------|-------------|
| `Summary` | `List[str]` | Key points from the document |
| `Title` | `str` | Document title |
| `Author` | `List[str]` | Author names |
| `DateCreated` | `str` | Creation date |
| `LastModifiedDate` | `str` | Last modification date |
| `Publisher` | `str` | Publisher name |
| `Language` | `str` | Document language |
| `PageCount` | `int \| str` | Number of pages (or "Not Available") |
| `SentimentTone` | `str` | Overall tone/sentiment |

In [None]:
# Display the Metadata schema
print("Metadata Schema:")
print("="*60)
for field_name, field_info in Metadata.model_fields.items():
    print(f"  {field_name}: {field_info.annotation}")

print("\nExample output:")
example = Metadata(
    Summary=["Key point 1", "Key point 2"],
    Title="Sample Document",
    Author=["John Doe"],
    DateCreated="2025-01-01",
    LastModifiedDate="2025-01-15",
    Publisher="Example Corp",
    Language="English",
    PageCount=10,
    SentimentTone="Professional"
)
print(example.model_dump_json(indent=2))

---
## Cell 4: Load LLM Model

**Purpose:** Initialize the Language Model that will analyze documents.

**What happens:**
- `ModelLoader` reads `config/config.yaml` for model settings
- Checks `LLM_PROVIDER` env var (default: "google")
- Returns either `ChatGoogleGenerativeAI` or `ChatGroq`

In [None]:
# Load the LLM
try:
    loader = ModelLoader()
    llm = loader.load_llm()
    
    if not llm:
        raise ValueError("LLM could not be loaded")
    
    log.info("LLM loaded successfully")
    print(f"LLM loaded: {type(llm).__name__}")
    
except Exception as e:
    print(f"ERROR loading LLM: {e}")
    raise DocumentPortalException("LLM loading error", sys)

---
## Cell 5: Initialize Output Parsers

**Purpose:** Set up parsers to convert LLM text output into structured JSON.

**Two parsers work together:**

1. **JsonOutputParser**: Primary parser that expects the LLM to return JSON matching `Metadata` schema
2. **OutputFixingParser**: Backup parser that auto-fixes malformed JSON by asking the LLM to correct it

**Why both?** LLMs sometimes return slightly malformed JSON (missing quotes, trailing commas). The fixing parser catches and repairs these errors automatically.

In [None]:
# Initialize parsers
parser = JsonOutputParser(pydantic_object=Metadata)
fixing_parser = OutputFixingParser.from_llm(parser=parser, llm=llm)

print("Parsers initialized:")
print(f"  - JsonOutputParser (expects Metadata schema)")
print(f"  - OutputFixingParser (auto-fixes malformed JSON)")

# Show the format instructions that will be sent to the LLM
print("\nFormat instructions for LLM:")
print("="*60)
format_instructions = parser.get_format_instructions()
print(format_instructions[:500] + "...")

---
## Cell 6: Load Document Analysis Prompt

**Purpose:** Load the prompt template that tells the LLM how to analyze documents.

**The prompt includes:**
- Instructions for document analysis
- Required output format (JSON schema)
- Placeholder for the document text

In [None]:
# Load the document analysis prompt
prompt = PROMPT_REGISTRY["document_analysis"]

print("Document Analysis Prompt loaded!")
print("\nPrompt template:")
print("="*60)
print(prompt.template)

---
## Cell 7: Build the Analysis Chain

**Purpose:** Create the LCEL chain that connects prompt → LLM → parser.

**Flow:**
```
Document Text + Format Instructions
    ↓
Prompt Template (fills placeholders)
    ↓
LLM (generates JSON response)
    ↓
OutputFixingParser (parses & fixes JSON)
    ↓
Python Dict (Metadata fields)
```

In [None]:
# Build the analysis chain
analysis_chain = prompt | llm | fixing_parser

log.info("Document analysis chain initialized")
print("Analysis chain built successfully!")
print("\nChain structure:")
print("  prompt → llm → fixing_parser")

---
## Cell 8: Helper Function - Analyze Document

**Purpose:** Wrapper function to invoke the analysis chain with proper error handling.

**Parameters:**
- `document_text`: The full text content of the document to analyze

**Returns:** Dictionary with extracted metadata fields

In [None]:
def analyze_document(document_text: str) -> dict:
    """
    Analyze a document's text and extract structured metadata & summary.
    
    Args:
        document_text: The full text content of the document
        
    Returns:
        dict: Extracted metadata with keys:
            - Summary, Title, Author, DateCreated, LastModifiedDate,
            - Publisher, Language, PageCount, SentimentTone
    """
    try:
        log.info("Starting document analysis", doc_length=len(document_text))
        
        response = analysis_chain.invoke({
            "format_instructions": parser.get_format_instructions(),
            "document_text": document_text
        })
        
        log.info("Metadata extraction successful", keys=list(response.keys()))
        return response
        
    except Exception as e:
        print(f"ERROR during analysis: {e}")
        log.error("Metadata analysis failed", error=str(e))
        raise DocumentPortalException("Metadata extraction failed", sys)

print("analyze_document function defined.")

---
## Cell 9: Test - Analyze Sample Document

**Purpose:** Test the analysis chain with the sample document from Cell 1.

**Expected output:** A dictionary containing all Metadata fields.

In [None]:
# ============================================================
# TEST: Analyze Sample Document
# ============================================================

print("Analyzing document...")
print("="*60)

result = analyze_document(SAMPLE_DOCUMENT_TEXT)

print("\nExtracted Metadata:")
print("="*60)
for key, value in result.items():
    if isinstance(value, list):
        print(f"\n{key}:")
        for item in value:
            print(f"  - {item}")
    else:
        print(f"{key}: {value}")

---
## Cell 10: Test - Load and Analyze from File

**Purpose:** Analyze a document loaded from a file path.

**Modify `DOCUMENT_FILE_PATH` to point to your actual document.**

In [None]:
# ============================================================
# TEST: Load and Analyze from File
# ============================================================

DOCUMENT_FILE_PATH = "data/multi_doc_chat/state_of_the_union.txt"  # <-- CHANGE THIS

try:
    # Check if file exists
    if not os.path.exists(DOCUMENT_FILE_PATH):
        print(f"File not found: {DOCUMENT_FILE_PATH}")
        print("Skipping file-based test. Update DOCUMENT_FILE_PATH to test.")
    else:
        # Load document text from file
        with open(DOCUMENT_FILE_PATH, 'r', encoding='utf-8') as f:
            file_content = f.read()
        
        print(f"Loaded document: {DOCUMENT_FILE_PATH}")
        print(f"Document length: {len(file_content)} characters")
        print("="*60)
        
        # Analyze (truncate if too long for demo)
        content_to_analyze = file_content[:10000]  # First 10K chars for testing
        if len(file_content) > 10000:
            print(f"Note: Analyzing first 10,000 characters only")
        
        file_result = analyze_document(content_to_analyze)
        
        print("\nExtracted Metadata:")
        print("="*60)
        for key, value in file_result.items():
            if isinstance(value, list):
                print(f"\n{key}:")
                for item in value:
                    print(f"  - {item}")
            else:
                print(f"{key}: {value}")

except Exception as e:
    print(f"Error analyzing file: {e}")

---
## Cell 11: Debug - Inspect Raw LLM Response

**Purpose:** See the raw LLM output before parsing, useful for debugging JSON issues.

This bypasses the fixing parser to show exactly what the LLM returns.

In [None]:
# ============================================================
# DEBUG: Inspect Raw LLM Response
# ============================================================

from langchain_core.output_parsers import StrOutputParser

# Chain without fixing parser (raw output)
raw_chain = prompt | llm | StrOutputParser()

print("Getting raw LLM response (before parsing)...")
print("="*60)

raw_response = raw_chain.invoke({
    "format_instructions": parser.get_format_instructions(),
    "document_text": SAMPLE_DOCUMENT_TEXT[:500]  # Use short text for debug
})

print("\nRaw LLM Response:")
print("-"*60)
print(raw_response)
print("-"*60)

# Try parsing manually
print("\nAttempting to parse...")
try:
    import json
    parsed = json.loads(raw_response)
    print("✓ Valid JSON!")
    print(f"Keys: {list(parsed.keys())}")
except json.JSONDecodeError as e:
    print(f"✗ Invalid JSON: {e}")
    print("The OutputFixingParser would attempt to fix this.")

---
## Cell 12: Debug - Test with Different Document Types

**Purpose:** Test the analyzer with different types of documents to see how it handles various content.

In [None]:
# ============================================================
# DEBUG: Test Different Document Types
# ============================================================

test_documents = {
    "Technical Doc": """
        Technical Specification v2.1
        Author: Engineering Team
        Last Updated: 2025-02-01
        
        This document outlines the API specifications for the 
        Document Portal system. Key endpoints include /analyze, 
        /compare, and /chat. Authentication uses JWT tokens.
    """,
    
    "News Article": """
        Breaking: Tech Company Announces Record Profits
        By Sarah Johnson, Tech Times
        February 5, 2025
        
        In a stunning quarterly report, the company exceeded 
        analyst expectations with 40% year-over-year growth.
        CEO stated optimism about future AI investments.
    """,
    
    "Minimal Text": "Just a short note about meetings tomorrow."
}

for doc_type, content in test_documents.items():
    print(f"\n{'='*60}")
    print(f"Testing: {doc_type}")
    print(f"{'='*60}")
    
    try:
        result = analyze_document(content)
        print(f"Title: {result.get('Title', 'N/A')}")
        print(f"Author: {result.get('Author', 'N/A')}")
        print(f"Sentiment: {result.get('SentimentTone', 'N/A')}")
        print(f"Summary: {result.get('Summary', ['N/A'])[0][:100]}...")
    except Exception as e:
        print(f"Error: {e}")

---
## Cell 13: Export Results to JSON

**Purpose:** Save the analysis results to a JSON file for later use.

In [None]:
# ============================================================
# EXPORT: Save Results to JSON
# ============================================================

import json
from datetime import datetime

OUTPUT_PATH = "data/document_analysis/analysis_result.json"  # <-- CHANGE THIS

# Run analysis
result = analyze_document(SAMPLE_DOCUMENT_TEXT)

# Add metadata about the analysis
output = {
    "analysis_timestamp": datetime.now().isoformat(),
    "document_length": len(SAMPLE_DOCUMENT_TEXT),
    "results": result
}

# Ensure directory exists
os.makedirs(os.path.dirname(OUTPUT_PATH), exist_ok=True)

# Save to file
with open(OUTPUT_PATH, 'w', encoding='utf-8') as f:
    json.dump(output, f, indent=2, ensure_ascii=False)

print(f"Results saved to: {OUTPUT_PATH}")
print("\nFile contents:")
print(json.dumps(output, indent=2))

---
## Summary

### Variables Persisting Between Cells

| Variable | Type | Description |
|----------|------|-------------|
| `llm` | LLM | Language model instance |
| `parser` | JsonOutputParser | Parses LLM output to JSON |
| `fixing_parser` | OutputFixingParser | Auto-fixes malformed JSON |
| `prompt` | ChatPromptTemplate | Document analysis prompt template |
| `analysis_chain` | Chain | Complete analysis pipeline |

### Functions

| Function | Purpose |
|----------|--------|
| `analyze_document(text)` | Analyze document and return metadata dict |

### Output Schema (Metadata)

```json
{
  "Summary": ["Point 1", "Point 2"],
  "Title": "Document Title",
  "Author": ["Author Name"],
  "DateCreated": "2025-01-01",
  "LastModifiedDate": "2025-01-15",
  "Publisher": "Publisher Name",
  "Language": "English",
  "PageCount": 10,
  "SentimentTone": "Professional"
}
```