# Document Processing with Docling and Propositional Chunking

This notebook demonstrates an advanced approach for processing documents into atomic, self-contained propositions using:
1. **Docling**: Convert PDF/documents to clean markdown format
2. **Propositional Chunking with Claude 3.5 Haiku**: Extract atomic facts and claims that can stand alone

## Part 1: Document Processing with Docling

### Step 1: Convert PDF to Markdown using Docling

[Docling](https://github.com/DS4SD/docling) is a powerful document conversion library that can parse complex document layouts (PDFs, PowerPoints, Word docs, etc.) and convert them into clean markdown format. This is particularly useful for:
- Preserving document structure (headings, lists, tables)
- Handling multi-column layouts
- Extracting text from complex academic papers

In this example, we're converting a research paper from arXiv about GraphRAG (Graph-based Retrieval Augmented Generation), which we'll then process using propositional chunking for maximum retrieval precision.


In [None]:
from docling.document_converter import DocumentConverter

source = "https://arxiv.org/pdf/2404.16130"
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())

### Step 2: Install Required Dependencies

For propositional chunking, we'll use:
- **langchain-aws**: Provides integration with AWS Bedrock for both LLMs and embeddings
- **boto3**: AWS SDK for Python to interact with AWS services
- **langchain**: Core LangChain library for document handling
- **typing**: For type hints and annotations


In [None]:
%pip install --quiet langchain langchain-aws boto3



In [None]:
# AWS Credentials Configuration (optional)
# If you haven't configured AWS credentials, you can do it in one of these ways:

# Option 1: Use environment variables (uncomment if needed)
# import os
# os.environ["AWS_ACCESS_KEY_ID"] = "your-access-key-id"
# os.environ["AWS_SECRET_ACCESS_KEY"] = "your-secret-access-key"
# os.environ["AWS_SESSION_TOKEN"] = "your-session-token"  # Optional, for temporary credentials

# Option 2: Pass credentials directly to BedrockEmbeddings (uncomment if needed)
# from langchain_aws import BedrockEmbeddings
# bedrock_embeddings = BedrockEmbeddings(
#     model_id="amazon.titan-embed-text-v2:0",
#     region_name="us-east-1",
#     aws_access_key_id="your-access-key-id",
#     aws_secret_access_key="your-secret-access-key",
#     aws_session_token="your-session-token"  # Optional
# )

# Option 3: Use AWS CLI configuration (run in terminal)
# aws configure


In [None]:
### AWS Credentials Setup

Make sure you have AWS credentials configured to use Bedrock. You can configure them using:
1. AWS CLI: `aws configure`
2. Environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
3. IAM roles (if running on AWS infrastructure)

The cell above shows different configuration options you can uncomment if needed.


### Step 3: Configure Propositional Chunker with Claude 3.5 Haiku

**Propositional Chunking** is an advanced text processing technique that:
- Breaks down text into atomic propositions (single facts or claims)
- Each proposition is self-contained and can stand alone
- Maintains complete context within each proposition
- Provides more granular and precise retrieval for RAG applications

We're using **Anthropic's Claude 3.5 Haiku** model via AWS Bedrock which:
- Offers excellent performance for text analysis tasks
- Provides fast inference with high quality outputs
- Is cost-effective for processing large documents
- Excels at extracting structured information from unstructured text


In [None]:
from langchain_aws import ChatBedrock
from langchain.schema import Document
from typing import List, Dict
import json
import re

class PropositionalChunker:
    """
    A custom chunker that breaks down text into atomic propositions using Claude 3.5 Haiku.
    Each proposition contains a single fact or claim that can stand alone.
    """
    
    def __init__(self, llm):
        self.llm = llm
        
    def extract_propositions(self, text: str) -> List[str]:
        """Extract atomic propositions from text using Claude 3.5 Haiku."""
        
        prompt = f"""Break down the following text into atomic propositions. Each proposition should:
1. Contain ONLY ONE fact, claim, or piece of information
2. Be self-contained and understandable without additional context
3. Include necessary context (e.g., who, what, when, where) to stand alone
4. Be a complete sentence

Return the propositions as a JSON array of strings.

Text to analyze:
{text}

Output format:
["proposition 1", "proposition 2", ...]

Important: Return ONLY the JSON array, no explanations or additional text."""

        response = self.llm.invoke(prompt)
        
        # Extract JSON from response
        content = response.content
        
        # Try to find JSON array in the response
        json_match = re.search(r'\[.*\]', content, re.DOTALL)
        if json_match:
            try:
                propositions = json.loads(json_match.group())
                return propositions
            except json.JSONDecodeError:
                # Fallback to simple sentence splitting if JSON parsing fails
                return [s.strip() for s in text.split('.') if s.strip()]
        else:
            # Fallback to simple sentence splitting
            return [s.strip() for s in text.split('.') if s.strip()]
    
    def chunk_document(self, text: str, chunk_size: int = 5) -> List[Document]:
        """
        Convert text into document chunks, where each chunk contains 
        a group of related propositions.
        
        Args:
            text: The text to chunk
            chunk_size: Number of propositions per chunk (default: 5)
        
        Returns:
            List of Document objects containing propositional chunks
        """
        # Extract all propositions
        propositions = self.extract_propositions(text)
        
        # Group propositions into chunks
        chunks = []
        for i in range(0, len(propositions), chunk_size):
            chunk_propositions = propositions[i:i + chunk_size]
            chunk_text = " ".join(chunk_propositions)
            
            # Create metadata for the chunk
            metadata = {
                "chunk_index": i // chunk_size,
                "num_propositions": len(chunk_propositions),
                "start_proposition": i,
                "end_proposition": min(i + chunk_size, len(propositions))
            }
            
            chunks.append(Document(page_content=chunk_text, metadata=metadata))
        
        return chunks

# Initialize Claude 3.5 Haiku via AWS Bedrock
llm = ChatBedrock(
    model_id="amazon.nova-lite-v1:0",
    region_name="us-east-1",  # Change to your preferred AWS region
    model_kwargs={
        "temperature": 0.1,  # Lower temperature for more consistent extraction
        "max_tokens": 4096
    }
)

# Create the propositional chunker
propositional_chunker = PropositionalChunker(llm)

### Step 4: Create Propositional Document Chunks

Now we'll process the markdown text through the PropositionalChunker to create document chunks. This process:
1. Takes the markdown output from Docling
2. Uses Claude 3.5 Haiku to extract atomic propositions (single facts/claims)
3. Each proposition is self-contained with necessary context
4. Groups propositions into chunks for efficient storage and retrieval

This approach is superior to both naive and semantic splitting because:
- Each proposition is a complete, standalone fact
- No information is lost due to arbitrary boundaries
- Provides maximum granularity for precise retrieval
- Improves accuracy in RAG applications by avoiding partial or incomplete information
- Better handling of complex documents with multiple interrelated topics

In [None]:
# Process the document with propositional chunking
# Note: For large documents, we'll process in smaller sections to avoid token limits
markdown_text = result.document.export_to_markdown()

# Split into manageable sections (approximately 20000 characters each)
section_size = 20000
sections = [markdown_text[i:i+section_size] for i in range(0, len(markdown_text), section_size)]

# Process each section and collect all chunks
all_chunks = []
for i, section in enumerate(sections[:5]):  # Process first 5 sections for demo
    print(f"Processing section {i+1}/{min(5, len(sections))}...")
    chunks = propositional_chunker.chunk_document(section)  # Default 5 propositions per chunk
    all_chunks.extend(chunks)

docs = all_chunks
print(f"\nTotal chunks created: {len(docs)}")


### Step 5: Examine the Results

Let's look at the generated propositional chunks to see how the PropositionalChunker has broken down the text into atomic facts. Each chunk contains a small group of self-contained propositions that can stand alone.


In [None]:
# Display examples of propositional chunks
print("="*80)
print("PROPOSITIONAL CHUNKING RESULTS")
print("="*80)

# Show first few chunks with metadata
for i, doc in enumerate(docs[:3]):
    print(f"\n--- Chunk {i+1} ---")
    print(f"Metadata: {doc.metadata}")
    print(f"\nContent:\n{doc.page_content}")
    print("-"*40)

# Show statistics
print(f"\n\nSTATISTICS:")
print(f"Total chunks created: {len(docs)}")
print(f"Average chunk length: {sum(len(d.page_content) for d in docs) / len(docs):.0f} characters")
print(f"Min chunk length: {min(len(d.page_content) for d in docs)} characters")
print(f"Max chunk length: {max(len(d.page_content) for d in docs)} characters")

## Part 2: Comparison - Propositional vs Semantic Chunking

### Why Propositional Chunking?

Propositional chunking offers several advantages over traditional semantic chunking:

#### 1. **Atomic Information Units**
- Each proposition contains exactly one fact or claim
- No partial information or incomplete thoughts
- Perfect for fact-checking and verification systems

#### 2. **Self-Contained Context**
- Every proposition includes necessary context (who, what, when, where)
- Can be understood without referring to surrounding text
- Ideal for question-answering systems

#### 3. **Improved Retrieval Precision**
- More granular matching with user queries
- Reduces irrelevant information in retrieved chunks
- Better alignment with specific information needs

#### 4. **Enhanced Flexibility**
- Propositions can be dynamically grouped based on query requirements
- Supports both fine-grained and coarse-grained retrieval
- Can be re-combined in different ways for different use cases

### Use Cases

Propositional chunking is particularly valuable for:
- **Legal documents**: Where every claim must be precise and traceable
- **Scientific papers**: Where individual findings and claims need to be isolated
- **Knowledge bases**: Where facts need to be stored and retrieved independently
- **Fact-checking systems**: Where individual claims need verification
- **Multi-hop reasoning**: Where complex queries require combining multiple facts

### Trade-offs

While propositional chunking offers superior precision, consider:
- **Higher computational cost**: Requires LLM inference for chunking
- **More storage**: Generally creates more chunks than semantic chunking
- **Processing time**: Slower than embedding-based semantic chunking
- **Token usage**: Consumes API tokens for the chunking process itself

### Best Practices

1. **Adjust chunk size based on use case**: 
   - Smaller chunks (1-3 propositions) for high precision
   - Larger chunks (5-10 propositions) for context preservation

2. **Implement caching**: Store processed propositions to avoid re-processing

3. **Combine with embeddings**: Use embeddings on the propositions for similarity search

4. **Consider hybrid approaches**: Use propositional chunking for critical sections and semantic chunking for less critical content


In [None]:
# Optional: Process larger portions of the document
# Uncomment below to process the entire document (this will use more API calls)

# def process_full_document(text, chunker, section_size=2000, max_sections=None):
#     """Process a full document with propositional chunking."""
#     sections = [text[i:i+section_size] for i in range(0, len(text), section_size)]
#     
#     if max_sections:
#         sections = sections[:max_sections]
#     
#     all_chunks = []
#     total_sections = len(sections)
#     
#     for i, section in enumerate(sections):
#         print(f"Processing section {i+1}/{total_sections}...", end="\r")
#         try:
#             chunks = chunker.chunk_document(section, chunk_size=3)
#             all_chunks.extend(chunks)
#         except Exception as e:
#             print(f"\nError processing section {i+1}: {e}")
#             continue
#     
#     print(f"\nCompleted! Total chunks: {len(all_chunks)}")
#     return all_chunks

# # Process more of the document (e.g., first 20 sections)
# full_docs = process_full_document(
#     result.document.export_to_markdown(), 
#     propositional_chunker,
#     max_sections=20
# )


## Part 3: Using Propositional Chunks in RAG Systems

### Integration with Vector Databases

Once you have your propositional chunks, you can easily integrate them with vector databases for RAG applications:

```python
# Example: Using with ChromaDB or similar vector stores
# from langchain.vectorstores import Chroma
# from langchain_aws import BedrockEmbeddings

# # Initialize embeddings (can still use Titan for embedding the propositions)
# embeddings = BedrockEmbeddings(
#     model_id="amazon.titan-embed-text-v2:0",
#     region_name="us-east-1"
# )

# # Create vector store from propositional chunks
# vectorstore = Chroma.from_documents(
#     documents=docs,
#     embedding=embeddings,
#     collection_name="propositional_chunks"
# )

# # Query the vector store
# query = "What is GraphRAG?"
# results = vectorstore.similarity_search(query, k=5)
```

### Advanced Retrieval Strategies

With propositional chunks, you can implement sophisticated retrieval strategies:

1. **Multi-proposition retrieval**: Retrieve multiple related propositions and combine them
2. **Contextual expansion**: After retrieving a proposition, fetch neighboring propositions
3. **Hierarchical retrieval**: Store propositions at different granularity levels
4. **Semantic re-ranking**: Use Claude to re-rank retrieved propositions based on relevance

### Performance Optimization Tips

- **Batch processing**: Process multiple sections in parallel when possible
- **Caching layer**: Implement Redis or similar for caching processed propositions
- **Incremental processing**: Process new documents incrementally rather than re-processing everything
- **Hybrid storage**: Use both vector and graph databases for different query types


In [None]:
# Save propositional chunks for later use
import json

def save_chunks(chunks, filename="propositional_chunks.json"):
    """Save propositional chunks to a JSON file."""
    chunks_data = []
    for chunk in chunks:
        chunks_data.append({
            "content": chunk.page_content,
            "metadata": chunk.metadata
        })
    
    with open(filename, 'w') as f:
        json.dump(chunks_data, f, indent=2)
    
    print(f"Saved {len(chunks)} chunks to {filename}")

def load_chunks(filename="propositional_chunks.json"):
    """Load propositional chunks from a JSON file."""
    with open(filename, 'r') as f:
        chunks_data = json.load(f)
    
    chunks = []
    for data in chunks_data:
        chunks.append(Document(
            page_content=data["content"],
            metadata=data["metadata"]
        ))
    
    print(f"Loaded {len(chunks)} chunks from {filename}")
    return chunks

# Save the chunks
save_chunks(docs)

# Load the chunks (example)
# loaded_docs = load_chunks()
