# Document Comparator - Step-by-Step Notebook

This notebook converts `src/document_compare/document_comparator.py` into a procedural, debuggable format.

## What This Does
Compares two PDF documents using an LLM to identify **page-by-page differences**:
- Analyzes content from both documents
- Identifies changes on each page
- Returns structured output as a DataFrame

## Prerequisites
- API keys set: `GOOGLE_API_KEY` and/or `GROQ_API_KEY`
- Config file at `config/config.yaml`
- Two documents (as combined text) to compare

---
## Cell 1: Configuration Placeholders

**Purpose:** Define sample combined document text for testing. Replace with your actual document content.

**External Dependencies:**
- Environment variables: `GOOGLE_API_KEY`, `GROQ_API_KEY`, `LLM_PROVIDER`

**Input Format:** The `combined_docs` should contain content from both PDFs, typically formatted as:
```
=== DOCUMENT 1 ===
Page 1: ...
Page 2: ...

=== DOCUMENT 2 ===
Page 1: ...
Page 2: ...
```

In [None]:
# ============================================================
# CONFIGURATION PLACEHOLDERS - MODIFY THESE BEFORE RUNNING
# ============================================================

# Sample combined documents for testing (replace with your actual content)
SAMPLE_COMBINED_DOCS = """
=== DOCUMENT 1 (Original Version) ===

Page 1:
Company Policy Manual v1.0
Effective Date: January 1, 2024
Section 1: Employee Benefits
All full-time employees are entitled to 15 days of paid vacation per year.
Health insurance coverage begins after 30 days of employment.

Page 2:
Section 2: Work Hours
Standard work hours are 9:00 AM to 5:00 PM, Monday through Friday.
Overtime requires manager approval.

=== DOCUMENT 2 (Updated Version) ===

Page 1:
Company Policy Manual v2.0
Effective Date: January 1, 2025
Section 1: Employee Benefits
All full-time employees are entitled to 20 days of paid vacation per year.
Health insurance coverage begins immediately upon employment.

Page 2:
Section 2: Work Hours
Standard work hours are 9:00 AM to 5:00 PM, Monday through Friday.
Remote work options available with manager approval.
Overtime requires manager approval.
"""  # <-- REPLACE WITH YOUR COMBINED DOCUMENT CONTENT

print(f"Combined documents preview ({len(SAMPLE_COMBINED_DOCS)} chars):")
print(SAMPLE_COMBINED_DOCS[:300] + "...")

---
## Cell 2: Imports

**Purpose:** Import all required libraries and modules.

**Key Dependencies:**
- `pandas`: For structuring comparison results as DataFrame
- `langchain_core.output_parsers.JsonOutputParser`: Parses LLM output as JSON
- `langchain.output_parsers.OutputFixingParser`: Auto-fixes malformed JSON
- `model.models.SummaryResponse`: Pydantic model for comparison output schema

In [None]:
import sys
from dotenv import load_dotenv
import pandas as pd

# LangChain imports
from langchain_core.output_parsers import JsonOutputParser
from langchain.output_parsers import OutputFixingParser

# Project imports
from utils.model_loader import ModelLoader
from logger import GLOBAL_LOGGER as log
from exception.custom_exception import DocumentPortalException
from prompt.prompt_library import PROMPT_REGISTRY
from model.models import SummaryResponse, PromptType

# Load environment variables
load_dotenv()

print("All imports successful!")

---
## Cell 3: Understanding the Output Schema

**Purpose:** Show the expected output structure from the document comparator.

The `SummaryResponse` model expects a list of `ChangeFormat` objects:

| Field | Type | Description |
|-------|------|-------------|
| `Page` | `str` | Page number where change was found |
| `Changes` | `str` | Description of what changed |

In [None]:
from model.models import ChangeFormat

# Display the schema
print("ChangeFormat Schema:")
print("="*60)
for field_name, field_info in ChangeFormat.model_fields.items():
    print(f"  {field_name}: {field_info.annotation}")

print("\nExpected LLM output format:")
print("="*60)
example_output = [
    {"Page": "1", "Changes": "Version updated from v1.0 to v2.0; Vacation days increased from 15 to 20"},
    {"Page": "2", "Changes": "Added remote work options paragraph"}
]
print(example_output)

print("\nAs DataFrame:")
print(pd.DataFrame(example_output))

---
## Cell 4: Load LLM Model

**Purpose:** Initialize the Language Model that will compare documents.

**What happens:**
- `ModelLoader` reads `config/config.yaml` for model settings
- Checks `LLM_PROVIDER` env var (default: "google")
- Returns either `ChatGoogleGenerativeAI` or `ChatGroq`

In [None]:
# Load the LLM
try:
    loader = ModelLoader()
    llm = loader.load_llm()
    
    if not llm:
        raise ValueError("LLM could not be loaded")
    
    log.info("LLM loaded successfully", model=llm)
    print(f"LLM loaded: {type(llm).__name__}")
    
except Exception as e:
    print(f"ERROR loading LLM: {e}")
    raise DocumentPortalException("LLM loading error", sys)

---
## Cell 5: Initialize Output Parsers

**Purpose:** Set up parsers to convert LLM text output into structured JSON.

**Two parsers:**
1. **JsonOutputParser**: Primary parser expecting JSON matching `SummaryResponse` schema
2. **OutputFixingParser**: Backup parser that auto-fixes malformed JSON

**Note:** The comparison chain uses the primary `parser` (not fixing_parser) by default.

In [None]:
# Initialize parsers
parser = JsonOutputParser(pydantic_object=SummaryResponse)
fixing_parser = OutputFixingParser.from_llm(parser=parser, llm=llm)

print("Parsers initialized:")
print(f"  - JsonOutputParser (expects SummaryResponse schema)")
print(f"  - OutputFixingParser (available for fixing malformed JSON)")

# Show format instructions
print("\nFormat instructions for LLM:")
print("="*60)
format_instructions = parser.get_format_instructions()
print(format_instructions[:600] + "...")

---
## Cell 6: Load Document Comparison Prompt

**Purpose:** Load the prompt template that tells the LLM how to compare documents.

**The prompt instructs the LLM to:**
1. Compare content from two PDFs
2. Identify differences and note page numbers
3. Provide page-wise comparison
4. Mark unchanged pages as 'NO CHANGE'

In [None]:
# Load the document comparison prompt
prompt = PROMPT_REGISTRY[PromptType.DOCUMENT_COMPARISON.value]

print("Document Comparison Prompt loaded!")
print("\nPrompt template:")
print("="*60)
print(prompt.template)

---
## Cell 7: Build the Comparison Chain

**Purpose:** Create the LCEL chain that connects prompt → LLM → parser.

**Flow:**
```
Combined Documents + Format Instructions
    ↓
Prompt Template (fills placeholders)
    ↓
LLM (generates JSON comparison)
    ↓
JsonOutputParser (parses to list of dicts)
    ↓
[{"Page": "1", "Changes": "..."}, ...]
```

In [None]:
# Build the comparison chain
comparison_chain = prompt | llm | parser

log.info("DocumentComparatorLLM chain initialized", model=llm)
print("Comparison chain built successfully!")
print("\nChain structure:")
print("  prompt → llm → parser")

---
## Cell 8: Helper Function - Format Response to DataFrame

**Purpose:** Convert the parsed LLM response (list of dicts) into a pandas DataFrame.

**Input:** List of dictionaries with 'Page' and 'Changes' keys

**Output:** pandas DataFrame for easy viewing and export

In [None]:
def format_response(response_parsed: list[dict]) -> pd.DataFrame:
    """
    Convert parsed LLM response into a pandas DataFrame.
    
    Args:
        response_parsed: List of dicts with 'Page' and 'Changes' keys
        
    Returns:
        pd.DataFrame: Formatted comparison results
    """
    try:
        df = pd.DataFrame(response_parsed)
        log.info("Response formatted to DataFrame", rows=len(df))
        return df
    except Exception as e:
        print(f"ERROR formatting response: {e}")
        log.error("Error formatting response into DataFrame", error=str(e))
        raise DocumentPortalException("Error formatting response", sys)

print("format_response function defined.")

---
## Cell 9: Main Function - Compare Documents

**Purpose:** Wrapper function to invoke the comparison chain with proper error handling.

**Parameters:**
- `combined_docs`: String containing content from both documents to compare

**Returns:** pandas DataFrame with page-by-page comparison results

In [None]:
def compare_documents(combined_docs: str) -> pd.DataFrame:
    """
    Compare two documents and return page-wise differences.
    
    Args:
        combined_docs: String containing both documents' content
        
    Returns:
        pd.DataFrame: Comparison results with columns ['Page', 'Changes']
    """
    try:
        inputs = {
            "combined_docs": combined_docs,
            "format_instruction": parser.get_format_instructions()
        }
        
        log.info("Invoking document comparison LLM chain")
        response = comparison_chain.invoke(inputs)
        log.info("Chain invoked successfully", response_preview=str(response)[:200])
        
        return format_response(response)
        
    except Exception as e:
        print(f"ERROR comparing documents: {e}")
        log.error("Error in compare_documents", error=str(e))
        raise DocumentPortalException("Error comparing documents", sys)

print("compare_documents function defined.")

---
## Cell 10: Test - Compare Sample Documents

**Purpose:** Test the comparison chain with the sample documents from Cell 1.

**Expected output:** DataFrame showing page-by-page differences.

In [None]:
# ============================================================
# TEST: Compare Sample Documents
# ============================================================

print("Comparing documents...")
print("="*60)

result_df = compare_documents(SAMPLE_COMBINED_DOCS)

print("\nComparison Results:")
print("="*60)
print(result_df.to_string(index=False))

---
## Cell 11: Test - Compare Documents from Files

**Purpose:** Load and compare documents from actual PDF files.

**Note:** This requires the document loading utilities from your project.
Modify the file paths to point to your actual documents.

In [None]:
# ============================================================
# TEST: Compare Documents from Files
# ============================================================

import os

# File paths (CHANGE THESE)
DOC1_PATH = "data/document_compare/document1.pdf"  # <-- CHANGE THIS
DOC2_PATH = "data/document_compare/document2.pdf"  # <-- CHANGE THIS

# Check if files exist
if not os.path.exists(DOC1_PATH) or not os.path.exists(DOC2_PATH):
    print(f"Files not found:")
    print(f"  Doc 1: {DOC1_PATH} - {'EXISTS' if os.path.exists(DOC1_PATH) else 'NOT FOUND'}")
    print(f"  Doc 2: {DOC2_PATH} - {'EXISTS' if os.path.exists(DOC2_PATH) else 'NOT FOUND'}")
    print("\nSkipping file-based test. Update paths to test with real files.")
else:
    # You would load documents here using your document loading utilities
    # Example (pseudo-code):
    # from utils.document_ops import load_pdf_text
    # doc1_text = load_pdf_text(DOC1_PATH)
    # doc2_text = load_pdf_text(DOC2_PATH)
    # combined = f"=== DOCUMENT 1 ===\n{doc1_text}\n\n=== DOCUMENT 2 ===\n{doc2_text}"
    # result = compare_documents(combined)
    print("Files found! Implement document loading to test.")

---
## Cell 12: Debug - Inspect Raw LLM Response

**Purpose:** See the raw LLM output before parsing, useful for debugging JSON issues.

This bypasses the parser to show exactly what the LLM returns.

In [None]:
# ============================================================
# DEBUG: Inspect Raw LLM Response
# ============================================================

from langchain_core.output_parsers import StrOutputParser

# Chain without parser (raw output)
raw_chain = prompt | llm | StrOutputParser()

print("Getting raw LLM response (before parsing)...")
print("="*60)

raw_response = raw_chain.invoke({
    "combined_docs": SAMPLE_COMBINED_DOCS,
    "format_instruction": parser.get_format_instructions()
})

print("\nRaw LLM Response:")
print("-"*60)
print(raw_response)
print("-"*60)

# Try parsing manually
print("\nAttempting to parse...")
try:
    import json
    parsed = json.loads(raw_response)
    print("✓ Valid JSON!")
    print(f"Number of changes found: {len(parsed)}")
except json.JSONDecodeError as e:
    print(f"✗ Invalid JSON: {e}")
    print("\nTrying with OutputFixingParser...")
    fixed = fixing_parser.parse(raw_response)
    print(f"Fixed result: {fixed}")

---
## Cell 13: Debug - Test with Different Document Pairs

**Purpose:** Test the comparator with different types of document changes.

In [None]:
# ============================================================
# DEBUG: Test Different Document Pairs
# ============================================================

test_pairs = {
    "Minor Changes": """
=== DOCUMENT 1 ===
Page 1: The quick brown fox jumps over the lazy dog.

=== DOCUMENT 2 ===
Page 1: The quick red fox jumps over the lazy dog.
""",
    
    "Major Rewrite": """
=== DOCUMENT 1 ===
Page 1: Introduction to Machine Learning
This chapter covers basic concepts.

=== DOCUMENT 2 ===
Page 1: Deep Learning Fundamentals
This chapter has been completely rewritten to focus on neural networks.
""",
    
    "Identical Documents": """
=== DOCUMENT 1 ===
Page 1: Same content in both documents.

=== DOCUMENT 2 ===
Page 1: Same content in both documents.
"""
}

for test_name, combined_docs in test_pairs.items():
    print(f"\n{'='*60}")
    print(f"Testing: {test_name}")
    print(f"{'='*60}")
    
    try:
        result = compare_documents(combined_docs)
        print(result.to_string(index=False))
    except Exception as e:
        print(f"Error: {e}")

---
## Cell 14: Export Results to CSV/Excel

**Purpose:** Save the comparison results to a file for reporting.

In [None]:
# ============================================================
# EXPORT: Save Results to File
# ============================================================

import os
from datetime import datetime

OUTPUT_DIR = "data/document_compare"  # <-- CHANGE THIS
OUTPUT_CSV = os.path.join(OUTPUT_DIR, "comparison_result.csv")
OUTPUT_EXCEL = os.path.join(OUTPUT_DIR, "comparison_result.xlsx")

# Run comparison
result_df = compare_documents(SAMPLE_COMBINED_DOCS)

# Add metadata
result_df['Comparison_Date'] = datetime.now().strftime('%Y-%m-%d %H:%M:%S')

# Ensure directory exists
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Save to CSV
result_df.to_csv(OUTPUT_CSV, index=False)
print(f"Results saved to CSV: {OUTPUT_CSV}")

# Save to Excel (if openpyxl is installed)
try:
    result_df.to_excel(OUTPUT_EXCEL, index=False)
    print(f"Results saved to Excel: {OUTPUT_EXCEL}")
except ImportError:
    print("Note: Install openpyxl to export to Excel: pip install openpyxl")

print("\nExported DataFrame:")
print(result_df)

---
## Summary

### Variables Persisting Between Cells

| Variable | Type | Description |
|----------|------|-------------|
| `llm` | LLM | Language model instance |
| `parser` | JsonOutputParser | Parses LLM output to JSON |
| `fixing_parser` | OutputFixingParser | Auto-fixes malformed JSON |
| `prompt` | ChatPromptTemplate | Document comparison prompt template |
| `comparison_chain` | Chain | Complete comparison pipeline |

### Functions

| Function | Purpose |
|----------|--------|
| `format_response(response)` | Convert list of dicts to DataFrame |
| `compare_documents(combined_docs)` | Main comparison function |

### Output Schema

```python
# LLM returns list of:
[{"Page": "1", "Changes": "Description of changes"},
 {"Page": "2", "Changes": "NO CHANGE"},
 ...]

# Converted to DataFrame:
#    Page                      Changes
# 0     1    Description of changes
# 1     2                  NO CHANGE
```