# Experiments: data_analysis

**Original File:** `src/document_analyzer/data_analysis.py`

## Purpose
This module provides document analysis capabilities using LLM-based extraction. It analyzes document text and extracts structured metadata including topics, entities, key points, and summaries.

## Key Components
- **DocumentAnalyzer class**: Main analyzer for extracting structured metadata
  - `analyze_document()`: Analyze document text and return structured metadata
  - Uses `JsonOutputParser` with Pydantic model for structured output
  - `OutputFixingParser` for automatic error correction

## Prerequisites
- `langchain`, `langchain-core` installed
- Environment variables configured (API keys for LLM)
- Pydantic models defined in `model/models.py`
- Prompt templates in `prompt/prompt_library.py`

## Instructions & Setup Guide

### Execution Order
1. Run the imports cell
2. Review the DocumentAnalyzer class definition
3. Initialize the analyzer
4. Load a document (PDF, text, etc.)
5. Call `analyze_document()` with the document text

### Dependencies
```bash
pip install langchain langchain-core python-dotenv pydantic
```

### Configuration
- Ensure `.env` file contains `GROQ_API_KEY` and/or `GOOGLE_API_KEY`
- The `Metadata` Pydantic model defines the output structure
- Run from project root directory for proper imports

## 1. Imports and Dependencies

Import all required modules for document analysis.

In [None]:
import os
import sys

# LangChain imports
from langchain_core.output_parsers import JsonOutputParser
from langchain.output_parsers import OutputFixingParser

# Project imports
from utils.model_loader import ModelLoader
from logger import GLOBAL_LOGGER as log
from exception.custom_exception import DocumentPortalException
from model.models import Metadata  # Pydantic model for structured output
from prompt.prompt_library import PROMPT_REGISTRY

print("All imports successful!")

## 2. DocumentAnalyzer Class Definition

The main class for analyzing documents and extracting structured metadata. Key features:
- **JsonOutputParser**: Parses LLM output into structured Pydantic model
- **OutputFixingParser**: Automatically fixes malformed JSON responses
- **Chain pattern**: Uses LCEL for composable processing

In [None]:
class DocumentAnalyzer:
    """
    Analyzes documents using a pre-trained model.
    Automatically logs all actions and supports session-based organization.
    
    The analyzer extracts structured metadata from document text including:
    - Document title and topics
    - Key entities and concepts
    - Summary and key points
    - Document type classification
    """
    
    def __init__(self):
        try:
            # Initialize model loader and LLM
            self.loader = ModelLoader()
            self.llm = self.loader.load_llm()
            
            # Prepare parsers for structured output
            # JsonOutputParser converts LLM text to Pydantic model
            self.parser = JsonOutputParser(pydantic_object=Metadata)
            
            # OutputFixingParser wraps the parser and can fix malformed JSON
            self.fixing_parser = OutputFixingParser.from_llm(
                parser=self.parser, 
                llm=self.llm
            )
            
            # Load the analysis prompt template
            self.prompt = PROMPT_REGISTRY["document_analysis"]
            
            log.info("DocumentAnalyzer initialized successfully")
            
        except Exception as e:
            log.error(f"Error initializing DocumentAnalyzer: {e}")
            raise DocumentPortalException("Error in DocumentAnalyzer initialization", sys)

print("DocumentAnalyzer class defined successfully!")

## 3. Document Analysis Method

The `analyze_document()` method processes document text and returns structured metadata.

### How it works:
1. Creates an LCEL chain: `prompt | llm | fixing_parser`
2. Passes document text and format instructions to the chain
3. LLM generates structured JSON output
4. Parser validates and converts to Python dict

In [None]:
def analyze_document(self, document_text: str) -> dict:
    """
    Analyze a document's text and extract structured metadata & summary.
    
    Args:
        document_text: The full text content of the document to analyze
    
    Returns:
        dict: Structured metadata including title, topics, entities,
              summary, key points, and document type
    
    Raises:
        DocumentPortalException: If metadata extraction fails
    """
    try:
        # Build the LCEL chain: prompt -> LLM -> parser
        chain = self.prompt | self.llm | self.fixing_parser
        
        log.info("Meta-data analysis chain initialized")

        # Invoke the chain with document text and format instructions
        response = chain.invoke({
            "format_instructions": self.parser.get_format_instructions(),
            "document_text": document_text
        })

        log.info("Metadata extraction successful", keys=list(response.keys()))
        
        return response

    except Exception as e:
        log.error("Metadata analysis failed", error=str(e))
        raise DocumentPortalException("Metadata extraction failed", sys)

# Attach method to class
DocumentAnalyzer.analyze_document = analyze_document
print("analyze_document method added!")

## 4. Understanding the Metadata Model

Let's examine what the `Metadata` Pydantic model looks like. This defines the structure of the extracted information.

In [None]:
# Print the Metadata model schema
from model.models import Metadata

print("Metadata Model Schema:")
print("-" * 50)
print(Metadata.model_json_schema())

In [None]:
# View the format instructions that are sent to the LLM
from langchain_core.output_parsers import JsonOutputParser

parser = JsonOutputParser(pydantic_object=Metadata)
print("Format Instructions for LLM:")
print("-" * 50)
print(parser.get_format_instructions())

## 5. Usage Example

Demonstrate how to use the DocumentAnalyzer with sample text.

In [None]:
# Initialize the analyzer
analyzer = DocumentAnalyzer()
print("Analyzer initialized!")

In [None]:
# Sample document text for analysis
sample_document = """
Introduction to Machine Learning

Machine learning is a subset of artificial intelligence (AI) that enables systems to learn 
and improve from experience without being explicitly programmed. It focuses on developing 
computer programs that can access data and use it to learn for themselves.

The process begins with observations or data, such as examples, direct experience, or 
instruction, to look for patterns in data and make better decisions in the future. The 
primary aim is to allow computers to learn automatically without human intervention.

Key Concepts:
1. Supervised Learning - Learning from labeled training data
2. Unsupervised Learning - Finding patterns in unlabeled data
3. Reinforcement Learning - Learning through rewards and penalties

Applications include image recognition, natural language processing, recommendation systems,
and autonomous vehicles. Major companies like Google, Amazon, and Microsoft heavily invest
in machine learning research and applications.
"""

print(f"Document length: {len(sample_document)} characters")
print("\nFirst 200 characters:")
print(sample_document[:200] + "...")

In [None]:
# Analyze the document
result = analyzer.analyze_document(sample_document)

print("Analysis Result:")
print("=" * 50)
for key, value in result.items():
    print(f"\n{key.upper()}:")
    if isinstance(value, list):
        for item in value:
            print(f"  - {item}")
    else:
        print(f"  {value}")

## 6. Working with PDF Documents

Example of analyzing a PDF document using the DocHandler from data_ingestion.

In [None]:
from src.document_ingestion.data_ingestion import DocHandler

# Initialize document handler
doc_handler = DocHandler(session_id="analysis_demo")
print(f"DocHandler initialized with session: {doc_handler.session_id}")

In [None]:
# Example: Analyze a PDF file (update path as needed)
PDF_PATH = "data/document_analysis/sample.pdf"  # Update this path

if os.path.exists(PDF_PATH):
    # Read PDF content
    pdf_text = doc_handler.read_pdf(PDF_PATH)
    print(f"PDF loaded: {len(pdf_text)} characters")
    
    # Analyze the PDF
    pdf_result = analyzer.analyze_document(pdf_text)
    print("\nPDF Analysis Result:")
    print(pdf_result)
else:
    print(f"PDF not found at: {PDF_PATH}")
    print("Skipping PDF analysis example.")

## Summary & Next Steps

### Key Takeaways
1. **DocumentAnalyzer** extracts structured metadata from document text using LLM
2. Uses **JsonOutputParser** with Pydantic models for type-safe structured output
3. **OutputFixingParser** automatically corrects malformed JSON responses
4. The LCEL chain pattern (`prompt | llm | parser`) enables composable processing

### Possible Extensions
- Add support for different analysis types (legal, medical, technical)
- Implement batch processing for multiple documents
- Add confidence scores to extracted metadata
- Support for multilingual document analysis
- Integration with document classification systems