# Experiments: document_comparator

**Original File:** `src/document_compare/document_comparator.py`

## Purpose
This module provides LLM-powered document comparison capabilities. It compares multiple documents and generates structured comparison results highlighting similarities, differences, and key insights.

## Key Components
- **DocumentComparatorLLM class**: Main comparator using LLM for intelligent comparison
  - `compare_documents()`: Compare combined document text and return structured results
  - `_format_response()`: Convert comparison results to pandas DataFrame
  - Uses `JsonOutputParser` for structured output

## Prerequisites
- `langchain`, `langchain-core`, `pandas` installed
- Environment variables configured (API keys for LLM)
- `SummaryResponse` Pydantic model in `model/models.py`
- Comparison prompt template in `prompt/prompt_library.py`

## Instructions & Setup Guide

### Execution Order
1. Run the imports cell
2. Review the DocumentComparatorLLM class definition
3. Initialize the comparator
4. Prepare document text (combine documents to compare)
5. Call `compare_documents()` with the combined text

### Dependencies
```bash
pip install langchain langchain-core pandas python-dotenv pydantic
```

### Configuration
- Ensure `.env` file contains `GROQ_API_KEY` and/or `GOOGLE_API_KEY`
- The `SummaryResponse` Pydantic model defines the comparison output structure
- Run from project root directory for proper imports

## 1. Imports and Dependencies

Import all required modules for document comparison.

In [None]:
import sys
from dotenv import load_dotenv
import pandas as pd

# LangChain imports
from langchain_core.output_parsers import JsonOutputParser
from langchain.output_parsers import OutputFixingParser

# Project imports
from utils.model_loader import ModelLoader
from logger import GLOBAL_LOGGER as log
from exception.custom_exception import DocumentPortalException
from prompt.prompt_library import PROMPT_REGISTRY
from model.models import SummaryResponse, PromptType

# Load environment variables
load_dotenv()

print("All imports successful!")

## 2. DocumentComparatorLLM Class Definition

The main class for comparing documents using LLM. Key features:
- **JsonOutputParser**: Parses LLM output into structured format
- **OutputFixingParser**: Available for error correction if needed
- **DataFrame output**: Results formatted as pandas DataFrame for easy analysis

In [None]:
class DocumentComparatorLLM:
    """
    LLM-based document comparator that analyzes and compares multiple documents.
    
    The comparator:
    - Takes combined document text as input
    - Uses LLM to identify similarities and differences
    - Returns structured comparison results as a DataFrame
    """
    
    def __init__(self):
        # Load environment variables
        load_dotenv()
        
        # Initialize model loader and LLM
        self.loader = ModelLoader()
        self.llm = self.loader.load_llm()
        
        # Set up JSON parser with SummaryResponse Pydantic model
        self.parser = JsonOutputParser(pydantic_object=SummaryResponse)
        
        # OutputFixingParser for handling malformed JSON
        self.fixing_parser = OutputFixingParser.from_llm(
            parser=self.parser, 
            llm=self.llm
        )
        
        # Load comparison prompt template
        self.prompt = PROMPT_REGISTRY[PromptType.DOCUMENT_COMPARISON.value]
        
        # Build the LCEL chain: prompt -> LLM -> parser
        self.chain = self.prompt | self.llm | self.parser
        
        log.info("DocumentComparatorLLM initialized", model=self.llm)

print("DocumentComparatorLLM class defined successfully!")

## 3. Document Comparison Methods

The `compare_documents()` method processes combined document text and returns comparison results.

### How it works:
1. Takes combined document text (multiple docs concatenated with labels)
2. Invokes the LCEL chain with the text and format instructions
3. LLM analyzes and compares the documents
4. Results are formatted into a pandas DataFrame

In [None]:
def compare_documents(self, combined_docs: str) -> pd.DataFrame:
    """
    Compare documents and return structured comparison results.
    
    Args:
        combined_docs: Combined text of all documents to compare,
                      typically with document labels/separators
    
    Returns:
        pd.DataFrame: Comparison results with columns for aspects,
                     document details, similarities, and differences
    
    Raises:
        DocumentPortalException: If comparison fails
    """
    try:
        # Prepare inputs for the chain
        inputs = {
            "combined_docs": combined_docs,
            "format_instruction": self.parser.get_format_instructions()
        }

        log.info("Invoking document comparison LLM chain")
        
        # Invoke the chain
        response = self.chain.invoke(inputs)
        
        log.info("Chain invoked successfully", response_preview=str(response)[:200])
        
        # Format and return as DataFrame
        return self._format_response(response)
        
    except Exception as e:
        log.error("Error in compare_documents", error=str(e))
        raise DocumentPortalException("Error comparing documents", sys)

# Attach to class
DocumentComparatorLLM.compare_documents = compare_documents
print("compare_documents method added!")

In [None]:
def _format_response(self, response_parsed: list[dict]) -> pd.DataFrame:
    """
    Format the parsed response into a pandas DataFrame.
    
    Args:
        response_parsed: List of dictionaries with comparison results
    
    Returns:
        pd.DataFrame: Formatted comparison results
    """
    try:
        df = pd.DataFrame(response_parsed)
        return df
    except Exception as e:
        log.error("Error formatting response into DataFrame", error=str(e))
        raise DocumentPortalException("Error formatting response", sys)

# Attach to class
DocumentComparatorLLM._format_response = _format_response
print("_format_response method added!")

## 4. Understanding the SummaryResponse Model

Let's examine what the `SummaryResponse` Pydantic model looks like. This defines the structure of the comparison output.

In [None]:
# Print the SummaryResponse model schema
from model.models import SummaryResponse

print("SummaryResponse Model Schema:")
print("-" * 50)
print(SummaryResponse.model_json_schema())

In [None]:
# View the format instructions that are sent to the LLM
parser = JsonOutputParser(pydantic_object=SummaryResponse)
print("Format Instructions for LLM:")
print("-" * 50)
print(parser.get_format_instructions())

## 5. Usage Example

Demonstrate how to use the DocumentComparatorLLM with sample documents.

In [None]:
# Initialize the comparator
comparator = DocumentComparatorLLM()
print("Comparator initialized!")

In [None]:
# Sample documents for comparison
document_1 = """
Document: Company_Policy_2024.pdf

Employee Vacation Policy

1. Annual Leave: Employees are entitled to 20 days of paid vacation per year.
2. Carry Over: Up to 5 unused days can be carried to the next year.
3. Request Process: Submit vacation requests at least 2 weeks in advance.
4. Approval: Manager approval required for all vacation requests.
5. Blackout Dates: December 15-31 are company blackout dates.
"""

document_2 = """
Document: Company_Policy_2025.pdf

Employee Vacation Policy

1. Annual Leave: Employees are entitled to 25 days of paid vacation per year.
2. Carry Over: Up to 10 unused days can be carried to the next year.
3. Request Process: Submit vacation requests at least 1 week in advance.
4. Approval: Manager approval required for requests over 5 consecutive days.
5. Flexible Holidays: Employees can choose 3 floating holidays.
"""

# Combine documents for comparison
combined_docs = f"{document_1}\n\n---\n\n{document_2}"

print("Documents prepared for comparison")
print(f"Total length: {len(combined_docs)} characters")

In [None]:
# Compare the documents
comparison_df = comparator.compare_documents(combined_docs)

print("\nComparison Results:")
print("=" * 80)
display(comparison_df)  # Use display() in Jupyter for better formatting

In [None]:
# Explore the comparison results
print("\nDataFrame Info:")
print(f"Columns: {list(comparison_df.columns)}")
print(f"Rows: {len(comparison_df)}")

# Show detailed view of each row
for idx, row in comparison_df.iterrows():
    print(f"\n--- Row {idx} ---")
    for col in comparison_df.columns:
        print(f"{col}: {row[col]}")

## 6. Working with PDF Files

Example of comparing actual PDF documents using the DocumentComparator from data_ingestion.

In [None]:
from src.document_ingestion.data_ingestion import DocumentComparator

# Initialize document handler for PDF operations
doc_handler = DocumentComparator(session_id="comparison_demo")
print(f"DocumentComparator initialized with session: {doc_handler.session_id}")

In [None]:
# Example: Compare PDF files (update paths as needed)
import os

PDF_DIR = "data/document_compare/comparison_demo"  # Session directory

if os.path.exists(PDF_DIR) and os.listdir(PDF_DIR):
    # Combine all PDFs in the session directory
    combined_text = doc_handler.combine_documents()
    
    print(f"Combined document length: {len(combined_text)} characters")
    
    # Compare using LLM
    pdf_comparison = comparator.compare_documents(combined_text)
    print("\nPDF Comparison Results:")
    display(pdf_comparison)
else:
    print(f"No PDFs found in: {PDF_DIR}")
    print("To test PDF comparison:")
    print("1. Upload PDFs using doc_handler.save_uploaded_files()")
    print("2. Then run this cell again")

## 7. Export Comparison Results

Export comparison results to various formats for further analysis.

In [None]:
# Export to CSV
output_path = "data/document_compare/comparison_results.csv"
comparison_df.to_csv(output_path, index=False)
print(f"Results exported to: {output_path}")

In [None]:
# Export to JSON
json_output = comparison_df.to_json(orient='records', indent=2)
print("JSON Output:")
print(json_output)

## Summary & Next Steps

### Key Takeaways
1. **DocumentComparatorLLM** uses LLM to intelligently compare documents
2. The LCEL chain (`prompt | llm | parser`) processes combined document text
3. Results are returned as a **pandas DataFrame** for easy analysis
4. Supports both text and PDF document comparison

### Possible Extensions
- Add visualization of comparison results (charts, diff views)
- Support for more than 2 documents in comparison
- Implement version-specific comparisons (track changes over time)
- Add semantic similarity scores between document sections
- Export comparison reports in PDF/HTML format