# Document Summarization Tool

## Project Overview
This project builds a sophisticated document summarization system using different LangChain techniques.
It can:
- Summarize long documents that exceed LLM context limits
- Generate summaries in different styles (executive, technical, bullet points)
- Use multiple chain types (stuff, map_reduce, refine)
- Handle documents of any length

## Use Cases
- Research: Quickly understand research papers
- Business: Create executive summaries of reports
- Legal: Summarize contracts and legal documents
- Academic: Generate study notes from textbooks

## What You'll Learn
1. Different chain types (stuff, map_reduce, refine)
2. Prompt engineering for different summary styles
3. Handling long documents
4. Custom prompts with LangChain

## Step 1: Environment Setup

In [None]:
# Load environment variables
from dotenv import load_dotenv
import os

load_dotenv()
print("✅ Environment loaded")
print(f"OpenAI API Key found: {'OPENAI_API_KEY' in os.environ}")

## Step 2: Import Required Libraries

In [None]:
# Document loading and processing
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# LLM and chains
from langchain_openai import ChatOpenAI
from langchain_classic.chains.summarize import load_summarize_chain

# Prompts
from langchain_core.prompts import PromptTemplate

print("✅ All libraries imported successfully")

## Step 3: Load Document to Summarize

In [None]:
# Load PDF document
pdf_path = "../RAG/llm_fundamentals.pdf"
loader = PyPDFLoader(pdf_path)
documents = loader.load()

print(f"✅ Loaded {len(documents)} pages")
print(f"\nDocument info:")
print(f"   Total pages: {len(documents)}")
print(f"   First page preview: {documents[0].page_content[:100]}...")

## Step 4: Initialize LLM

In [None]:
# Initialize OpenAI LLM
llm = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0.3,  # Lower temperature for more focused summaries
    api_key=os.environ["OPENAI_API_KEY"]
)

print("✅ LLM initialized")

## Method 1: Stuff Chain (Simple - for short documents)

**How it works:**
- Puts ALL document content into a single prompt
- Best for documents that fit in LLM context window
- Fastest and most accurate for small documents
- **Limitation**: Fails if document is too long

In [None]:
# Create stuff chain
stuff_chain = load_summarize_chain(
    llm=llm,
    chain_type="stuff",  # Put everything in one prompt
    verbose=False  # Set to True to see the prompt being used
)

# Generate summary
print("Generating summary using STUFF chain...\n")
stuff_summary = stuff_chain.invoke(documents)

print("="*80)
print("STUFF CHAIN SUMMARY")
print("="*80)
print(stuff_summary['output_text'])
print("\n" + "="*80)

## Method 2: Map-Reduce Chain (For long documents)

**How it works:**
1. **Map**: Summarize each chunk independently
2. **Reduce**: Combine all chunk summaries into final summary

**Advantages:**
- Handles documents of ANY length
- Parallelizable (can process chunks simultaneously)

**Disadvantages:**
- May lose connections between chunks
- More API calls = higher cost

In [None]:
# Split document into chunks for map-reduce
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,      # Larger chunks for summarization
    chunk_overlap=200
)

split_docs = text_splitter.split_documents(documents)
print(f"Split into {len(split_docs)} chunks for map-reduce\n")

# Create map-reduce chain
map_reduce_chain = load_summarize_chain(
    llm=llm,
    chain_type="map_reduce",  # Summarize chunks then combine
    verbose=False
)

# Generate summary
print("Generating summary using MAP-REDUCE chain...\n")
map_reduce_summary = map_reduce_chain.invoke(split_docs)

print("="*80)
print("MAP-REDUCE CHAIN SUMMARY")
print("="*80)
print(map_reduce_summary['output_text'])
print("\n" + "="*80)

## Method 3: Refine Chain (For coherent long summaries)

**How it works:**
1. Summarize first chunk
2. Refine summary with second chunk
3. Refine again with third chunk
4. Continue until all chunks processed

**Advantages:**
- Maintains coherence across chunks
- Builds context iteratively
- Good for narrative documents

**Disadvantages:**
- Sequential (cannot parallelize)
- Slower than map-reduce
- More API calls

In [None]:
# Create refine chain
refine_chain = load_summarize_chain(
    llm=llm,
    chain_type="refine",  # Iteratively refine the summary
    verbose=False
)

# Generate summary
print("Generating summary using REFINE chain...\n")
refine_summary = refine_chain.invoke(split_docs)

print("="*80)
print("REFINE CHAIN SUMMARY")
print("="*80)
print(refine_summary['output_text'])
print("\n" + "="*80)

## Custom Prompts: Executive Summary Style

Create custom prompts to control the summary style and format.

In [None]:
# Custom prompt for executive summary
executive_prompt_template = """
You are a senior executive summarizing a document for C-level executives.

Create a concise executive summary with:
1. Key takeaways (3-5 bullet points)
2. Main themes
3. Actionable insights

Keep it under 150 words and focus on business value.

DOCUMENT:
{text}

EXECUTIVE SUMMARY:
"""

executive_prompt = PromptTemplate(
    template=executive_prompt_template,
    input_variables=["text"]
)

# Create chain with custom prompt
executive_chain = load_summarize_chain(
    llm=llm,
    chain_type="stuff",
    prompt=executive_prompt,
    verbose=False
)

# Generate executive summary
print("Generating EXECUTIVE SUMMARY...\n")
executive_summary = executive_chain.invoke(documents)

print("="*80)
print("EXECUTIVE SUMMARY")
print("="*80)
print(executive_summary['output_text'])
print("\n" + "="*80)

## Custom Prompts: Technical Summary Style

In [None]:
# Custom prompt for technical summary
technical_prompt_template = """
You are a technical expert summarizing a document for engineers and researchers.

Create a detailed technical summary including:
1. Core concepts and terminology
2. Technical approaches and methods
3. Key technical details
4. Implementation considerations

Use technical language and include specific details.

DOCUMENT:
{text}

TECHNICAL SUMMARY:
"""

technical_prompt = PromptTemplate(
    template=technical_prompt_template,
    input_variables=["text"]
)

# Create chain with custom prompt
technical_chain = load_summarize_chain(
    llm=llm,
    chain_type="stuff",
    prompt=technical_prompt,
    verbose=False
)

# Generate technical summary
print("Generating TECHNICAL SUMMARY...\n")
technical_summary = technical_chain.invoke(documents)

print("="*80)
print("TECHNICAL SUMMARY")
print("="*80)
print(technical_summary['output_text'])
print("\n" + "="*80)

## Custom Prompts: Bullet Points Style

In [None]:
# Custom prompt for bullet points
bullet_prompt_template = """
Summarize the following document as a structured list of bullet points.

Format:
• Main Point 1
  - Sub-point 1a
  - Sub-point 1b
• Main Point 2
  - Sub-point 2a
  - Sub-point 2b

Focus on the most important information and organize hierarchically.

DOCUMENT:
{text}

BULLET POINT SUMMARY:
"""

bullet_prompt = PromptTemplate(
    template=bullet_prompt_template,
    input_variables=["text"]
)

# Create chain with custom prompt
bullet_chain = load_summarize_chain(
    llm=llm,
    chain_type="stuff",
    prompt=bullet_prompt,
    verbose=False
)

# Generate bullet point summary
print("Generating BULLET POINT SUMMARY...\n")
bullet_summary = bullet_chain.invoke(documents)

print("="*80)
print("BULLET POINT SUMMARY")
print("="*80)
print(bullet_summary['output_text'])
print("\n" + "="*80)

## Comparing All Summary Methods

Let's compare the different approaches side by side.

In [None]:
def compare_summaries():
    """
    Display all summaries for comparison.
    """
    summaries = [
        ("STUFF CHAIN", stuff_summary['output_text']),
        ("MAP-REDUCE CHAIN", map_reduce_summary['output_text']),
        ("REFINE CHAIN", refine_summary['output_text']),
        ("EXECUTIVE STYLE", executive_summary['output_text']),
        ("TECHNICAL STYLE", technical_summary['output_text']),
        ("BULLET POINTS STYLE", bullet_summary['output_text'])
    ]
    
    print("\n" + "#"*80)
    print("SUMMARY COMPARISON")
    print("#"*80 + "\n")
    
    for method, summary in summaries:
        print(f"\n{'='*80}")
        print(f"{method}")
        print(f"{'='*80}")
        print(f"Length: {len(summary)} characters")
        print(f"\n{summary}")
        print()

# Display comparison
compare_summaries()

## Advanced: Map-Reduce with Custom Prompts

Combine map-reduce chain with custom prompts for long documents.

In [None]:
# Custom map prompt (for each chunk)
map_prompt_template = """
Write a concise summary of the following chunk:

{text}

CONCISE SUMMARY:
"""

map_prompt = PromptTemplate(
    template=map_prompt_template,
    input_variables=["text"]
)

# Custom combine prompt (for combining summaries)
combine_prompt_template = """
Combine the following summaries into a comprehensive final summary.
Organize by themes and eliminate redundancy.

{text}

COMPREHENSIVE SUMMARY:
"""

combine_prompt = PromptTemplate(
    template=combine_prompt_template,
    input_variables=["text"]
)

# Create map-reduce chain with custom prompts
custom_map_reduce_chain = load_summarize_chain(
    llm=llm,
    chain_type="map_reduce",
    map_prompt=map_prompt,        # Prompt for each chunk
    combine_prompt=combine_prompt, # Prompt for combining
    verbose=False
)

# Generate custom map-reduce summary
print("Generating CUSTOM MAP-REDUCE SUMMARY...\n")
custom_summary = custom_map_reduce_chain.invoke(split_docs)

print("="*80)
print("CUSTOM MAP-REDUCE SUMMARY")
print("="*80)
print(custom_summary['output_text'])
print("\n" + "="*80)

## Helper Function: Summarize Any Document

In [None]:
def summarize_document(pdf_path: str, style: str = "executive", chain_type: str = "stuff"):
    """
    Flexible function to summarize any document with chosen style and chain type.
    
    Args:
        pdf_path: Path to PDF file
        style: Summary style - "executive", "technical", "bullets", or "default"
        chain_type: Chain type - "stuff", "map_reduce", or "refine"
    
    Returns:
        Summary text
    """
    # Load document
    loader = PyPDFLoader(pdf_path)
    docs = loader.load()
    
    # Choose prompt based on style
    prompt_map = {
        "executive": executive_prompt,
        "technical": technical_prompt,
        "bullets": bullet_prompt
    }
    
    # Prepare documents based on chain type
    if chain_type in ["map_reduce", "refine"]:
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=2000,
            chunk_overlap=200
        )
        docs = text_splitter.split_documents(docs)
    
    # Create chain
    if style in prompt_map:
        chain = load_summarize_chain(
            llm=llm,
            chain_type=chain_type,
            prompt=prompt_map[style],
            verbose=False
        )
    else:
        chain = load_summarize_chain(
            llm=llm,
            chain_type=chain_type,
            verbose=False
        )
    
    # Generate summary
    result = chain.invoke(docs)
    return result['output_text']

# Example usage
print("\nTesting flexible summarization function:\n")
test_summary = summarize_document(
    pdf_path="../RAG/llm_fundamentals.pdf",
    style="bullets",
    chain_type="stuff"
)
print(test_summary)

## Summary

### What You've Built:
- ✅ Three chain types: stuff, map_reduce, refine
- ✅ Three summary styles: executive, technical, bullet points
- ✅ Custom prompts for different use cases
- ✅ Flexible summarization function

### Key Concepts Learned:
1. **Chain Types**:
   - **Stuff**: Fast, simple, for short documents
   - **Map-Reduce**: Handles long documents, parallelizable
   - **Refine**: Iterative, maintains coherence

2. **Prompt Engineering**:
   - Custom prompts control output style
   - Different prompts for different audiences
   - Map and combine prompts for map-reduce

3. **Trade-offs**:
   - Speed vs. quality
   - Cost vs. comprehensiveness
   - Document length considerations

### When to Use Each Method:
| Chain Type | Best For | Speed | Cost |
|------------|----------|-------|------|
| Stuff | Short docs (<4k tokens) | Fastest | Lowest |
| Map-Reduce | Long docs, any length | Medium | Medium |
| Refine | Coherent narrative summaries | Slowest | Highest |

### Next Steps:
- Try summarizing different types of documents (technical papers, articles, books)
- Experiment with different chunk sizes for map-reduce
- Create custom prompts for specific domains (legal, medical, etc.)
- Combine with translation for multilingual summaries