# Module 2: Document Types & Data Sources

## 🎯 Learning Objectives
By the end of this module, you will:
- Identify different document types and their unique challenges
- Use LangChain document loaders for various file formats
- Compare extraction quality across different methods
- Understand when to use traditional vs AI-powered parsing
- Handle real-world document processing challenges

## 📚 Key Concepts

### Why Document Processing Matters
In RAG systems, the quality of your document processing directly impacts your final results. **Garbage In = Garbage Out!**

### Document Types & Challenges

| Document Type | Common Issues | Best Approach |
|---------------|---------------|---------------|
| **Plain Text** | Encoding, structure | Simple loaders |
| **PDF** | Tables, images, layout | AI-powered parsing |
| **HTML** | Noise, dynamic content | Smart cleaning |
| **CSV/JSON** | Structure preservation | Specialized loaders |
| **Images** | OCR accuracy | Multimodal models |
| **Code** | Syntax preservation | Language-aware parsing |

### 2025 Breakthroughs 🚀
- **LlamaParse**: AI-powered PDF parsing with vision-language models
- **Hybrid Multimodal**: Combining traditional + AI approaches
- **Markdown Intermediate**: Better structure preservation
- **Unstructured Library**: Production-ready document processing


## 🛠️ Setup
Let's install the required packages for document processing.

In [None]:
# Install required packages
!pip install -q langchain langchain-community python-dotenv
!pip install -q pypdf beautifulsoup4 pandas requests
!pip install -q python-docx openpyxl
# For advanced parsing (optional)
# !pip install -q unstructured[local-inference] llamaparse

In [None]:
import os
import pandas as pd
from pathlib import Path
import requests
from io import BytesIO

# LangChain imports
from langchain.document_loaders import (
    TextLoader,
    PyPDFLoader,
    UnstructuredHTMLLoader,
    CSVLoader,
    JSONLoader,
    UnstructuredWordDocumentLoader
)
from langchain.schema import Document

# For web scraping
from bs4 import BeautifulSoup

print("✅ Setup complete!")
print("📁 Ready to process various document types")

## 📁 Exercise 1: Creating Sample Documents

Let's create different types of documents to work with.

In [None]:
# Create a sample documents directory
docs_dir = Path("sample_documents")
docs_dir.mkdir(exist_ok=True)

print(f"📁 Created directory: {docs_dir}")

In [None]:
# 1. Create a plain text document
text_content = """
COMPANY POLICY DOCUMENT
=======================

Remote Work Policy
------------------

Effective Date: January 1, 2024

Overview:
TechCorp Solutions supports flexible work arrangements to promote work-life balance 
and employee satisfaction while maintaining productivity and collaboration.

Eligibility:
- All full-time employees after 90-day probationary period
- Must have reliable internet connection (minimum 25 Mbps)
- Dedicated workspace meeting security requirements

Guidelines:
1. Maximum 3 remote days per week for hybrid workers
2. Core collaboration hours: 10 AM - 3 PM EST
3. Weekly team meeting attendance mandatory
4. Monthly in-person team building events

Equipment:
- Company provides laptop, monitor, and basic office supplies
- $500 annual stipend for home office improvements
- IT support available 9 AM - 6 PM EST

Performance Metrics:
- Same performance standards apply regardless of location
- Weekly check-ins with direct supervisor
- Quarterly 360-degree feedback reviews

Contact: hr@techcorp.com for questions
"""

with open(docs_dir / "remote_work_policy.txt", "w") as f:
    f.write(text_content)

print("✅ Created: remote_work_policy.txt")

In [None]:
# 2. Create an HTML document
html_content = """
<!DOCTYPE html>
<html>
<head>
    <title>Product Specifications</title>
    <style>
        .spec-table { border-collapse: collapse; width: 100%; }
        .spec-table th, .spec-table td { border: 1px solid #ddd; padding: 8px; }
    </style>
</head>
<body>
    <header>
        <h1>TechCorp Model-X Specifications</h1>
        <nav>
            <a href="#overview">Overview</a> | 
            <a href="#specs">Technical Specs</a> | 
            <a href="#pricing">Pricing</a>
        </nav>
    </header>
    
    <main>
        <section id="overview">
            <h2>Product Overview</h2>
            <p>Model-X is our flagship AI-powered analytics platform designed for enterprise customers. 
            It processes large datasets in real-time and provides actionable insights through 
            machine learning algorithms.</p>
            
            <h3>Key Features</h3>
            <ul>
                <li>Real-time data processing</li>
                <li>Advanced machine learning models</li>
                <li>Customizable dashboards</li>
                <li>API integrations</li>
                <li>Enterprise-grade security</li>
            </ul>
        </section>
        
        <section id="specs">
            <h2>Technical Specifications</h2>
            <table class="spec-table">
                <tr><th>Component</th><th>Specification</th></tr>
                <tr><td>Processing Speed</td><td>1M+ queries per second</td></tr>
                <tr><td>Uptime Guarantee</td><td>99.9% SLA</td></tr>
                <tr><td>Data Storage</td><td>Unlimited cloud storage</td></tr>
                <tr><td>API Rate Limit</td><td>10,000 requests/minute</td></tr>
                <tr><td>Supported Formats</td><td>JSON, CSV, XML, Parquet</td></tr>
            </table>
        </section>
        
        <section id="pricing">
            <h2>Pricing Tiers</h2>
            <div class="pricing-card">
                <h3>Starter: $99/month</h3>
                <p>Up to 100GB data processing, basic analytics</p>
            </div>
            <div class="pricing-card">
                <h3>Professional: $499/month</h3>
                <p>Up to 1TB data processing, advanced ML models</p>
            </div>
            <div class="pricing-card">
                <h3>Enterprise: Custom pricing</h3>
                <p>Unlimited processing, dedicated support</p>
            </div>
        </section>
    </main>
    
    <footer>
        <p>© 2024 TechCorp Solutions. All rights reserved.</p>
        <script>
            // Some JavaScript that should be ignored
            console.log("Page loaded");
        </script>
    </footer>
</body>
</html>
"""

with open(docs_dir / "product_specs.html", "w") as f:
    f.write(html_content)

print("✅ Created: product_specs.html")

In [None]:
# 3. Create a CSV document
employee_data = {
    'Employee_ID': ['EMP001', 'EMP002', 'EMP003', 'EMP004', 'EMP005'],
    'Name': ['Alice Johnson', 'Bob Smith', 'Carol Davis', 'David Wilson', 'Eva Brown'],
    'Department': ['Engineering', 'Marketing', 'Sales', 'HR', 'Engineering'],
    'Position': ['Senior Developer', 'Marketing Manager', 'Sales Rep', 'HR Specialist', 'DevOps Engineer'],
    'Salary': [120000, 85000, 65000, 70000, 110000],
    'Hire_Date': ['2020-03-15', '2021-07-20', '2019-11-30', '2022-01-10', '2023-05-08'],
    'Remote_Days': [3, 2, 1, 2, 4],
    'Performance_Rating': [4.8, 4.2, 3.9, 4.5, 4.7],
    'Skills': [
        'Python|JavaScript|Docker|AWS',
        'SEO|Content Marketing|Analytics|Adobe Creative',
        'CRM|Salesforce|Communication|Negotiation',
        'Recruitment|Training|Employee Relations|HRIS',
        'Kubernetes|CI/CD|Monitoring|Cloud Infrastructure'
    ]
}

df = pd.DataFrame(employee_data)
df.to_csv(docs_dir / "employee_data.csv", index=False)

print("✅ Created: employee_data.csv")
print(f"📊 Contains {len(df)} employee records")

In [None]:
# 4. Create a JSON document
api_config = {
    "api_version": "v2.1",
    "base_url": "https://api.techcorp.com",
    "authentication": {
        "type": "Bearer Token",
        "header_name": "Authorization",
        "token_prefix": "Bearer "
    },
    "endpoints": {
        "users": {
            "path": "/users",
            "methods": ["GET", "POST", "PUT", "DELETE"],
            "rate_limit": "1000/hour",
            "description": "Manage user accounts and profiles"
        },
        "analytics": {
            "path": "/analytics",
            "methods": ["GET", "POST"],
            "rate_limit": "500/hour",
            "description": "Access analytics data and reports"
        },
        "data": {
            "path": "/data",
            "methods": ["GET", "POST"],
            "rate_limit": "100/hour",
            "description": "Upload and retrieve datasets"
        }
    },
    "error_codes": {
        "400": "Bad Request - Invalid parameters",
        "401": "Unauthorized - Invalid or missing token",
        "403": "Forbidden - Insufficient permissions",
        "404": "Not Found - Resource does not exist",
        "429": "Too Many Requests - Rate limit exceeded",
        "500": "Internal Server Error - Contact support"
    },
    "supported_formats": ["JSON", "XML", "CSV"],
    "max_payload_size": "10MB",
    "timeout": "30 seconds"
}

import json
with open(docs_dir / "api_configuration.json", "w") as f:
    json.dump(api_config, f, indent=2)

print("✅ Created: api_configuration.json")

## 📖 Exercise 2: Loading Documents with LangChain

Now let's use LangChain's document loaders to process each document type.

In [None]:
def analyze_document(documents, doc_type):
    """Helper function to analyze loaded documents"""
    print(f"\n📄 {doc_type} Analysis:")
    print(f"   Number of documents: {len(documents)}")
    
    if documents:
        total_chars = sum(len(doc.page_content) for doc in documents)
        print(f"   Total characters: {total_chars:,}")
        print(f"   Average chars per doc: {total_chars // len(documents):,}")
        
        # Show first document preview
        first_doc = documents[0]
        print(f"   First 200 chars: {first_doc.page_content[:200]}...")
        print(f"   Metadata: {first_doc.metadata}")
    
    return documents

In [None]:
# 1. Load Plain Text Document
print("🔄 Loading Plain Text Document...")
try:
    text_loader = TextLoader(str(docs_dir / "remote_work_policy.txt"))
    text_docs = text_loader.load()
    analyze_document(text_docs, "Plain Text")
except Exception as e:
    print(f"❌ Error loading text: {e}")

In [None]:
# 2. Load HTML Document
print("\n🔄 Loading HTML Document...")
try:
    html_loader = UnstructuredHTMLLoader(str(docs_dir / "product_specs.html"))
    html_docs = html_loader.load()
    analyze_document(html_docs, "HTML")
    
    # Let's also see how much content was extracted vs original
    with open(docs_dir / "product_specs.html", "r") as f:
        original_html = f.read()
    
    print(f"   🎯 Content extraction efficiency:")
    print(f"      Original HTML: {len(original_html):,} chars")
    print(f"      Extracted text: {len(html_docs[0].page_content):,} chars")
    print(f"      Efficiency: {len(html_docs[0].page_content)/len(original_html)*100:.1f}% text extracted")
    
except Exception as e:
    print(f"❌ Error loading HTML: {e}")

In [None]:
# 3. Load CSV Document
print("\n🔄 Loading CSV Document...")
try:
    csv_loader = CSVLoader(str(docs_dir / "employee_data.csv"))
    csv_docs = csv_loader.load()
    analyze_document(csv_docs, "CSV")
    
    # CSV creates one document per row by default
    print(f"   📊 CSV Processing Details:")
    print(f"      Each row becomes a separate document")
    print(f"      Sample document content:")
    print(f"      {csv_docs[0].page_content}")
    
except Exception as e:
    print(f"❌ Error loading CSV: {e}")

In [None]:
# 4. Load JSON Document
print("\n🔄 Loading JSON Document...")
try:
    # JSONLoader requires a jq_schema to specify what to extract
    json_loader = JSONLoader(
        file_path=str(docs_dir / "api_configuration.json"),
        jq_schema=".",  # Extract everything
        text_content=False
    )
    json_docs = json_loader.load()
    analyze_document(json_docs, "JSON")
    
except Exception as e:
    print(f"❌ Error loading JSON: {e}")
    
    # Alternative approach for JSON
    print("\n🔄 Trying alternative JSON loading...")
    with open(docs_dir / "api_configuration.json", "r") as f:
        json_content = json.load(f)
    
    # Convert JSON to readable text format
    json_text = json.dumps(json_content, indent=2)
    json_doc = Document(
        page_content=json_text,
        metadata={"source": "api_configuration.json", "type": "json"}
    )
    
    analyze_document([json_doc], "JSON (Manual)")

## 🌐 Exercise 3: Web Content Loading

Let's try loading content from web pages.

In [None]:
# Simple web scraping function
def scrape_web_content(url, max_chars=2000):
    """
    Simple web scraping with error handling
    """
    try:
        print(f"🌐 Scraping: {url}")
        
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        }
        
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()
        
        # Parse HTML content
        soup = BeautifulSoup(response.content, 'html.parser')
        
        # Remove script and style elements
        for script in soup(["script", "style"]):
            script.decompose()
        
        # Get text content
        text = soup.get_text(separator=' ', strip=True)
        
        # Limit text length for demo
        if len(text) > max_chars:
            text = text[:max_chars] + "..."
        
        # Create LangChain document
        doc = Document(
            page_content=text,
            metadata={
                "source": url,
                "title": soup.title.string if soup.title else "No title",
                "type": "web_page"
            }
        )
        
        return [doc]
        
    except Exception as e:
        print(f"❌ Error scraping {url}: {e}")
        return []

# Test web scraping with a simple page
demo_urls = [
    "https://httpbin.org/html",  # Simple test page
    "https://example.com",      # Basic example page
]

for url in demo_urls:
    web_docs = scrape_web_content(url)
    if web_docs:
        analyze_document(web_docs, f"Web Content ({url})")
    print()

## 📊 Exercise 4: Comparing Document Processing Quality

Let's compare how well different approaches extract meaningful content.

In [None]:
def evaluate_extraction_quality(documents, doc_type):
    """
    Evaluate the quality of document extraction
    """
    if not documents:
        return {"quality_score": 0, "issues": ["No documents loaded"]}
    
    doc = documents[0]  # Analyze first document
    content = doc.page_content
    issues = []
    
    # Check for common issues
    if len(content.strip()) < 50:
        issues.append("Very short content")
    
    if content.count('\n\n\n') > 5:
        issues.append("Too many empty lines")
    
    if len(content.split()) < 10:
        issues.append("Very few words extracted")
    
    # Check for HTML artifacts
    html_artifacts = ['<', '>', '&nbsp;', '&amp;', '\xa0']
    if any(artifact in content for artifact in html_artifacts):
        issues.append("Contains HTML artifacts")
    
    # Calculate quality score (simple heuristic)
    word_count = len(content.split())
    char_count = len(content)
    
    quality_score = min(100, (
        (word_count / 10) +  # More words = better
        (char_count / 100) + # More chars = better
        (100 - len(issues) * 20)  # Fewer issues = better
    ) / 3)
    
    return {
        "quality_score": max(0, quality_score),
        "word_count": word_count,
        "char_count": char_count,
        "issues": issues
    }

# Evaluate all our loaded documents
document_sets = [
    (text_docs, "Plain Text"),
    (html_docs, "HTML"),
    (csv_docs, "CSV"),
    ([json_doc] if 'json_doc' in locals() else [], "JSON")
]

print("📊 DOCUMENT PROCESSING QUALITY REPORT")
print("=" * 50)

for docs, doc_type in document_sets:
    if docs:
        quality = evaluate_extraction_quality(docs, doc_type)
        
        print(f"\n📄 {doc_type}:")
        print(f"   Quality Score: {quality['quality_score']:.1f}/100")
        print(f"   Words: {quality['word_count']:,}")
        print(f"   Characters: {quality['char_count']:,}")
        
        if quality['issues']:
            print(f"   Issues: {', '.join(quality['issues'])}")
        else:
            print(f"   Issues: None ✅")
    else:
        print(f"\n📄 {doc_type}: No documents to evaluate")

## 🚀 Exercise 5: Advanced Document Processing (2025 Techniques)

Let's explore some advanced document processing techniques that are popular in 2025.

In [None]:
# Smart text cleaning and preprocessing
import re

def smart_text_cleaner(text):
    """
    Advanced text cleaning using 2025 best practices
    """
    # Remove extra whitespace but preserve paragraph breaks
    text = re.sub(r'\n\s*\n', '\n\n', text)  # Normalize paragraph breaks
    text = re.sub(r'[ \t]+', ' ', text)      # Normalize spaces
    
    # Remove common artifacts
    text = re.sub(r'\xa0', ' ', text)        # Non-breaking spaces
    text = re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\xff]', '', text)  # Control chars
    
    # Smart handling of bullet points and lists
    text = re.sub(r'^\s*[•\-\*]\s+', '- ', text, flags=re.MULTILINE)
    
    # Clean up multiple consecutive punctuation
    text = re.sub(r'[.]{3,}', '...', text)
    text = re.sub(r'[!]{2,}', '!', text)
    text = re.sub(r'[?]{2,}', '?', text)
    
    return text.strip()

def create_enhanced_document(original_doc, cleaning_type="standard"):
    """
    Create an enhanced version of a document with better processing
    """
    content = original_doc.page_content
    
    if cleaning_type == "aggressive":
        content = smart_text_cleaner(content)
    
    # Add processing metadata
    enhanced_metadata = original_doc.metadata.copy()
    enhanced_metadata.update({
        "processed": True,
        "cleaning_type": cleaning_type,
        "original_length": len(original_doc.page_content),
        "processed_length": len(content),
        "compression_ratio": len(content) / len(original_doc.page_content) if original_doc.page_content else 0
    })
    
    return Document(
        page_content=content,
        metadata=enhanced_metadata
    )

# Test enhanced processing
print("🔧 ENHANCED DOCUMENT PROCESSING")
print("=" * 40)

if html_docs:
    original_html = html_docs[0]
    enhanced_html = create_enhanced_document(original_html, "aggressive")
    
    print(f"\n📄 HTML Document Enhancement:")
    print(f"   Original length: {enhanced_html.metadata['original_length']:,} chars")
    print(f"   Processed length: {enhanced_html.metadata['processed_length']:,} chars")
    print(f"   Compression ratio: {enhanced_html.metadata['compression_ratio']:.2f}")
    
    print(f"\n📝 Content comparison:")
    print(f"   Original preview: {original_html.page_content[:150]}...")
    print(f"   Enhanced preview: {enhanced_html.page_content[:150]}...")

## 📚 Exercise 6: Document Metadata Enrichment

Let's add useful metadata to our documents for better RAG performance.

In [None]:
from datetime import datetime
import hashlib

def enrich_document_metadata(doc, additional_info=None):
    """
    Add comprehensive metadata to documents
    """
    content = doc.page_content
    
    # Calculate content statistics
    words = content.split()
    sentences = content.split('.')
    
    # Create content hash for deduplication
    content_hash = hashlib.md5(content.encode()).hexdigest()[:16]
    
    # Analyze content type
    content_type = "general"
    if any(keyword in content.lower() for keyword in ['policy', 'procedure', 'guideline']):
        content_type = "policy"
    elif any(keyword in content.lower() for keyword in ['spec', 'technical', 'api', 'configuration']):
        content_type = "technical"
    elif any(keyword in content.lower() for keyword in ['employee', 'salary', 'department']):
        content_type = "hr_data"
    
    # Enhance metadata
    enriched_metadata = doc.metadata.copy()
    enriched_metadata.update({
        # Content statistics
        "word_count": len(words),
        "sentence_count": len([s for s in sentences if s.strip()]),
        "character_count": len(content),
        "avg_word_length": sum(len(word) for word in words) / len(words) if words else 0,
        
        # Content identification
        "content_hash": content_hash,
        "content_type": content_type,
        
        # Processing metadata
        "processed_at": datetime.now().isoformat(),
        "extraction_method": "langchain",
        
        # Quality indicators
        "has_structure": "\n\n" in content or "\n-" in content,
        "readability_score": min(10, len(words) / max(1, len(sentences)) * 2),  # Simple readability
    })
    
    # Add any additional info
    if additional_info:
        enriched_metadata.update(additional_info)
    
    return Document(
        page_content=content,
        metadata=enriched_metadata
    )

# Enrich all our documents
enriched_docs = []

# Process each document type
if text_docs:
    enriched_text = enrich_document_metadata(
        text_docs[0], 
        {"department": "HR", "confidentiality": "internal"}
    )
    enriched_docs.append((enriched_text, "Policy Document"))

if html_docs:
    enriched_html = enrich_document_metadata(
        html_docs[0], 
        {"department": "Product", "public": True}
    )
    enriched_docs.append((enriched_html, "Product Specs"))

print("🔍 ENRICHED DOCUMENT METADATA")
print("=" * 40)

for doc, doc_name in enriched_docs:
    print(f"\n📄 {doc_name}:")
    
    # Display key metadata
    meta = doc.metadata
    print(f"   Content Type: {meta.get('content_type', 'unknown')}")
    print(f"   Word Count: {meta.get('word_count', 0):,}")
    print(f"   Readability Score: {meta.get('readability_score', 0):.1f}/10")
    print(f"   Has Structure: {'Yes' if meta.get('has_structure') else 'No'}")
    print(f"   Content Hash: {meta.get('content_hash', 'unknown')}")
    
    if 'department' in meta:
        print(f"   Department: {meta['department']}")
    if 'confidentiality' in meta:
        print(f"   Confidentiality: {meta['confidentiality']}")

## 🧠 Key Takeaways

From this module, you should now understand:

### 📄 Document Processing Insights:
1. **Different formats require different approaches** - CSV creates multiple docs, HTML needs cleaning, JSON needs structure preservation
2. **Metadata is crucial** - Rich metadata enables better filtering and relevance in RAG systems
3. **Quality varies significantly** - Some formats extract cleaner text than others
4. **Processing choices matter** - Aggressive cleaning vs. structure preservation trade-offs

### 🛠️ LangChain Document Loaders:
- **TextLoader**: Simple and reliable for plain text
- **PyPDFLoader**: Good for basic PDFs (we'll cover advanced PDF parsing later)
- **UnstructuredHTMLLoader**: Effectively removes HTML markup
- **CSVLoader**: Creates one document per row by default
- **JSONLoader**: Requires jq schema for complex JSON structures

### 🚀 2025 Best Practices:
1. **Smart text cleaning** - Remove artifacts while preserving structure
2. **Comprehensive metadata** - Add processing info, content type, quality metrics
3. **Content hashing** - Enable deduplication and change detection
4. **Quality assessment** - Automated evaluation of extraction quality

## 🎯 Next Steps

In **Module 3**, we'll learn how to break these documents into optimal chunks for RAG systems:
- Why chunking is necessary
- Different chunking strategies (fixed-size, semantic, adaptive)
- How to preserve context while splitting documents
- Handling overlap and metadata propagation

The quality of your document processing directly impacts your RAG system's performance!

## 🤔 Discussion Questions

1. Which document type presented the biggest processing challenges and why?
2. How would you handle documents with mixed content (text + tables + images)?
3. What additional metadata would be useful for your specific use case?
4. How would you handle document updates and versioning in a RAG system?

## 📝 Optional Exercise

Try processing documents from your own domain:
1. Find 2-3 documents in different formats from your work or projects
2. Process them using the techniques from this module
3. Evaluate the extraction quality and identify any domain-specific challenges
4. Create appropriate metadata schemas for your document types