# Module 3: Data Preparation for RAG

## Learning Objectives
- Understand why data preparation is critical for RAG success
- Learn data extraction and chunking strategies
- Explore embedding model selection
- Handle complex and unstructured documents
- Use Databricks tools for data preparation


## 1. Why is Data Prep Important for RAG?

### The Foundation of RAG Quality

**Data preparation is the foundation of RAG quality.** Poor data preparation leads to poor RAG performance, regardless of how sophisticated your retrieval or generation models are.

### Potential Issues When Data is Prepared Improperly

#### 1.1 Lost in the Middle

**Problem**: Information in the middle of long contexts may be overlooked by LLMs.

**Research**: 
- "Lost in the Middle: How Language Models Use Long Contexts" (Liu et al., 2023)
- "Needle in a Haystack" test demonstrates this issue

**Impact**: 
- Critical information may be missed
- Retrieval returns relevant chunks, but model doesn't use them effectively

**Solution**: 
- Proper chunking strategies
- Re-ranking retrieved chunks
- Limiting context length

#### 1.2 Inefficient Retrieval

**Problems**:
- **Poor chunking**: Chunks too large/small, breaking semantic units
- **Wrong embedding model**: Mismatch between query and document embeddings
- **Missing metadata**: Can't filter or rank effectively

**Impact**:
- Low retrieval precision
- Irrelevant chunks in context
- Poor final responses

#### 1.3 Exposing Data

**Security Risks**:
- Sensitive information in chunks
- PII (Personally Identifiable Information) leakage
- Confidential data exposure

**Impact**:
- Compliance violations
- Security breaches
- Privacy concerns

**Solution**:
- Data masking/redaction
- Access controls
- Chunk-level security

#### 1.4 Wrong Embedding Model

**Problems**:
- Model doesn't match domain (e.g., using general model for code)
- Model doesn't match language
- Model doesn't capture semantic relationships needed

**Impact**:
- Poor semantic similarity
- Mismatched query-document embeddings
- Low retrieval quality


## 2. Data Prep Process Overview

### Complete Data Preparation Pipeline

```
┌─────────────────────────────────────────────────────────────┐
│              Data Preparation Pipeline                        │
└─────────────────────────────────────────────────────────────┘

External Sources
    │
    ▼
┌─────────────────────────────────────────────────────────────┐
│  Ingestion and Pre-processing                                │
│  - Extract text from various formats                         │
│  - Clean and normalize                                       │
│  - Handle encoding issues                                     │
└──────┬───────────────────────────────────────────────────────┘
       │
       ▼
┌─────────────────────────────────────────────────────────────┐
│  Data Storage and Governance                                 │
│  - Delta Lake (storage)                                      │
│  - Unity Catalog (governance)                                │
└──────┬───────────────────────────────────────────────────────┘
       │
       ▼
┌─────────────────────────────────────────────────────────────┐
│  Chunking                                                    │
│  - Split documents into chunks                               │
│  - chunk1, chunk2, chunk3, ...                              │
└──────┬───────────────────────────────────────────────────────┘
       │
       ▼
┌─────────────────────────────────────────────────────────────┐
│  Embedding                                                   │
│  - Convert chunks to vectors                                 │
│  - Using embedding models                                    │
└──────┬───────────────────────────────────────────────────────┘
       │
       ▼
┌─────────────────────────────────────────────────────────────┐
│  Vector Store                                                │
│  - Store embeddings and metadata                             │
│  - Enable similarity search                                  │
└─────────────────────────────────────────────────────────────┘
```

### Key Stages

1. **Ingestion**: Bring data into the system
2. **Storage**: Store in Delta Lake with Unity Catalog governance
3. **Chunking**: Break into manageable pieces
4. **Embedding**: Convert to vector representations
5. **Indexing**: Store in vector database


## 3. Data Storage and Governance

### 3.1 What is Delta Lake?

**Delta Lake** is an open-source storage layer that brings ACID transactions to data lakes:

**Key Features**:
- **ACID Transactions**: Ensures data consistency
- **Time Travel**: Query historical versions of data
- **Schema Enforcement**: Prevents bad data from entering
- **Upserts and Deletes**: Efficient data updates
- **Scalability**: Handles petabytes of data

**Benefits for RAG**:
- Reliable document storage
- Version control for knowledge base
- Efficient updates and deletions
- Integration with Databricks ecosystem

### 3.2 Unity Catalog

**Unity Catalog** is Databricks' unified governance solution:

**Features**:
- **Centralized Metadata**: Single source of truth
- **Fine-grained Access Control**: Table, column, row-level security
- **Data Lineage**: Track data flow and transformations
- **Audit Logging**: Compliance and security
- **Cross-workspace Sharing**: Share data securely

**Benefits for RAG**:
- Secure access to documents
- Compliance with data regulations
- Track document sources and transformations
- Manage permissions at chunk level

### 3.3 Storage Architecture for RAG

```
┌─────────────────────────────────────────────────────────────┐
│              RAG Data Storage Architecture                     │
├─────────────────────────────────────────────────────────────┤
│                                                               │
│  Unity Catalog                                                │
│       │                                                       │
│       ├─── Documents Table (Delta)                           │
│       │    - document_id                                     │
│       │    - source_path                                     │
│       │    - content                                         │
│       │    - metadata                                        │
│       │                                                       │
│       ├─── Chunks Table (Delta)                              │
│       │    - chunk_id                                        │
│       │    - document_id                                     │
│       │    - chunk_text                                      │
│       │    - chunk_index                                     │
│       │    - metadata                                        │
│       │                                                       │
│       └─── Embeddings Table (Delta)                          │
│            - chunk_id                                        │
│            - embedding_vector                                │
│            - model_name                                      │
│                                                               │
└─────────────────────────────────────────────────────────────┘
```


## 4. Data Extraction and Chunking

### 4.1 Typical Process

The standard RAG data preparation process:

1. **Split document into chunks**
   - Break large documents into smaller pieces
   - Maintain semantic coherence

2. **Embed the chunks with a model**
   - Convert text to vector representations
   - Capture semantic meaning

3. **Store in a vector store**
   - Index for fast retrieval
   - Store with metadata

### 4.2 Constraints and Risks

#### Risk: Chunk Could Be Out of Context

**Problem**: Breaking documents arbitrarily can lose context

**Example**:
```
Original: "The company's revenue increased by 25% in Q3. This growth 
was driven by strong performance in the European market."

Bad Chunk 1: "The company's revenue increased by 25% in Q3."
Bad Chunk 2: "This growth was driven by strong performance in the 
European market."

Issue: Chunk 1 lacks context about what caused the growth.
```

**Solution**: Use semantic chunking or maintain context windows

### 4.3 How to Chunk Data?

#### Strategy 1: Semantic Chunking

**Approach**: Split based on semantic boundaries (sentences, paragraphs, sections)

**Advantages**:
- Preserves semantic meaning
- Natural boundaries
- Better for retrieval

**Implementation**:
- Use sentence transformers to find semantic boundaries
- Split at paragraph breaks
- Respect document structure (headers, sections)

#### Strategy 2: Fixed Size Chunking

**Approach**: Split into fixed-size chunks (e.g., 500 tokens, 1000 characters)

**Advantages**:
- Simple to implement
- Predictable chunk sizes
- Easy to manage

**Disadvantages**:
- May break semantic units
- Can lose context

**Implementation**:
```python
# Example: Fixed size chunking
def chunk_fixed_size(text, chunk_size=500, overlap=50):
    chunks = []
    for i in range(0, len(text), chunk_size - overlap):
        chunks.append(text[i:i + chunk_size])
    return chunks
```


### 4.4 Chunking Strategy is Use-Case Specific

**Key Principle**: There is no one-size-fits-all chunking strategy. It depends on:
- Document type
- Query patterns
- Use case requirements
- Model capabilities

#### Consider Your Document Length

**Questions to Ask**:
- How long are your documents?
- Are they single sentences, paragraphs, or multi-page documents?
- What's the typical query length?

#### Chunk Size Trade-offs

**1 Chunk = 1 Sentence**
- **Pros**: 
  - Very specific embeddings
  - Precise retrieval
- **Cons**: 
  - Embedding may lack broader context
  - May miss relationships between sentences

**1 Chunk = Multiple Paragraphs**
- **Pros**: 
  - Embeddings capture broader themes
  - Better for longer queries
- **Cons**: 
  - Less precise retrieval
  - May include irrelevant information

**Splitting by Headers**
- **Pros**: 
  - Respects document structure
  - Natural semantic boundaries
- **Cons**: 
  - Header sizes vary
  - May need additional splitting

#### Chunk Overlap

**Chunk Overlap** defines the amount of overlap between consecutive chunks, ensuring that no contextual information is lost between them.

**Example**:
```
Document: "A B C D E F G H I J"
Chunk size: 5
Overlap: 2

Chunk 1: "A B C D E"
Chunk 2: "D E F G H"  (overlaps with chunk 1)
Chunk 3: "G H I J"    (overlaps with chunk 2)
```

**Benefits**:
- Preserves context at boundaries
- Reduces information loss
- Improves retrieval for boundary queries

**Trade-off**: More overlap = more chunks = higher storage cost

#### Windowed Summarization

**Windowed Summarization** is a context-enriching chunking method where each chunk includes a windowed summary of previous few chunks.

**Example**:
```
Chunk 1: [Original content]
Chunk 2: [Summary of Chunk 1] + [Original content]
Chunk 3: [Summary of Chunks 1-2] + [Original content]
```

**Benefits**:
- Maintains context across chunks
- Better for long documents
- Improves retrieval quality

#### Query Pattern Considerations

**Prior knowledge of user's query patterns can be helpful:**

- **Longer queries**: May have better aligned embeddings to returned chunks
- **Shorter queries**: Could be more precise but may miss broader context

**Strategy**: 
- Analyze query patterns
- Adjust chunk size accordingly
- Consider hybrid approaches


### 4.5 Advanced Chunking Strategies

#### Summarization-Based Chunking

**Approach**: Create chunks with summaries of previous context

**Example**:
```
Original Document:
Section 1: Introduction to RAG
Section 2: Data Preparation
Section 3: Vector Stores

Chunk 1: 
[Section 1 content]

Chunk 2:
Summary: "The document introduces RAG (Retrieval-Augmented Generation) 
as a pattern for enhancing LLMs with external knowledge."
[Section 2 content]

Chunk 3:
Summary: "RAG requires careful data preparation including chunking and 
embedding. Vector stores enable efficient similarity search."
[Section 3 content]
```

**Benefits**:
- Maintains context across sections
- Better for hierarchical documents
- Improves retrieval for broad queries

#### Summarization with Metadata

**Approach**: Include structured metadata with each chunk

**Example**:
```json
{
  "chunk_id": "chunk_001",
  "document_id": "doc_123",
  "content": "[chunk text]",
  "metadata": {
    "section": "Introduction",
    "subsection": "What is RAG?",
    "page_number": 1,
    "summary": "This section introduces RAG...",
    "keywords": ["RAG", "retrieval", "augmentation"],
    "previous_context": "Summary of previous sections..."
  }
}
```

**Benefits**:
- Rich filtering capabilities
- Better ranking
- Improved retrieval precision
- Easier debugging

**Use Cases**:
- Technical documentation
- Research papers
- Legal documents
- Multi-section reports


## 5. Data Extraction and Chunking Challenges

### 5.1 Working with Complex Documents

Real-world documents often contain:

- **Mixed content**: Text, images, tables, prices, disclaimers
- **Irregular layouts**: Text mixed with images, irregular text placement
- **Color encoding**: Important information highlighted by color
- **Hierarchical information**: Charts with nested data
- **Multi-column layouts**: Order of columns is crucial
- **Images with context**: Images must stay with related text

### 5.2 Specific Challenges

#### Challenge 1: Preserving Information Order

**Problem**: Keeping the order of information is critical

**Example - Multi-column Text**:
```
Column 1          Column 2
Item A            Price: $10
Item B            Price: $20
```

If order is lost: "Price: $10" might be associated with "Item B"

**Solution**: 
- Use layout-aware extraction
- Preserve spatial relationships
- Maintain reading order

#### Challenge 2: Keeping Images with Related Information

**Problem**: Images provide crucial context that must be preserved

**Example**: 
- Product image with description
- Chart with explanation text
- Diagram with annotations

**Solution**:
- Extract image-text relationships
- Store images with metadata
- Use multimodal models

#### Challenge 3: Preserving Logical Sections

**Problem**: Some document types require structure preservation

**Example - HTML Documents**:
- HTML requires tag-based chunking to preserve logical structure
- Tables, lists, sections need special handling

**Challenge - HTML Tables**:
- HTML tables are token-heavy
- Example: 3x3 small table
  - Plain text: 11 tokens
  - JSON: 29 tokens  
  - HTML: 132 tokens

**Solution**: 
- Convert HTML to structured format
- Extract tables separately
- Use specialized parsers


### 5.3 General Approaches to Address Unstructured/Complex Documents

#### Approach 1: Traditional Libraries

**Libraries**: PyMuPDF, pdfplumber, python-docx

**Features**:
- Text extraction from PDFs, Word docs
- Basic layout detection
- Table extraction
- Metadata extraction

**Limitations**:
- May not preserve complex layouts
- Limited understanding of document structure
- Manual post-processing often needed

**Example - PyMuPDF**:
```python
import fitz  # PyMuPDF

doc = fitz.open("document.pdf")
for page in doc:
    text = page.get_text()
    # Process text...
```

#### Approach 2: Layout Models

**Libraries**: 
- Hugging Face's LayoutLMv3
- DocTR (Document Text Recognition)
- Unstructured.io
- PyPDF2

**Features**:
- Apply deep learning models for text extraction
- Context extraction from layout
- Better handling of complex documents
- Understanding of document structure

**Advantages**:
- Better accuracy for complex layouts
- Preserves relationships between elements
- Handles tables, forms, multi-column layouts

**Example - Unstructured**:
```python
from unstructured.partition.auto import partition

elements = partition(filename="document.pdf")
# Returns structured elements with layout information
```

#### Approach 3: Multimodal Models

**Models**:
- OpenAI GPT-4o (and beyond)
- Alphabet's Gemini 1.5 and beyond
- Open-source: Dolphin series, OpenFlamingo, LLaVA, OLMo

**Features**:
- Multimodal LLMs intrinsically understand images
- Can process text and images together
- Better context understanding

**Current Status**:
- More experimental at this stage
- Rapidly improving
- Best for documents with rich visual content

**Use Cases**:
- Documents with charts and graphs
- Screenshots with text
- Mixed media content


## 6. Embedding Models

### 6.1 What is an Embedding?

**Embedding** is a numerical representation of text (or other data) in a high-dimensional vector space where semantically similar items are close together.

**Visualization**: 
```
Text: "machine learning"
Embedding: [0.23, -0.45, 0.67, ..., 0.12]  (vector of numbers)

Similar texts have similar vectors:
"machine learning" ≈ "ML" ≈ "artificial intelligence"
```

### 6.2 Dimensionality Reduction for Visualization

**Concept**: High-dimensional embeddings (e.g., 768, 1536 dimensions) can be reduced to 2D/3D for visualization using techniques like:
- t-SNE
- UMAP
- PCA

**Purpose**: 
- Understand embedding space
- Visualize semantic relationships
- Debug embedding quality

**Example Visualization**:
```
2D Projection:
    [ML] ---- [AI]
      |         |
      |         |
   [Deep] --- [Neural]
```

### 6.3 Choosing the Right Model for Your Application

#### Factor 1: Data Text Properties

**Consider**:
- **Language**: Is your data in English, multilingual, or specific languages?
- **Domain**: General, technical, medical, legal, code?
- **Text Type**: Short queries, long documents, code snippets?

**Examples**:
- **Multilingual**: Use multilingual models (e.g., multilingual-MiniLM)
- **Code**: Use code-specific models (e.g., CodeBERT)
- **Medical**: Use domain-specific models (e.g., BioBERT)

#### Factor 2: Model Capabilities

**Consider**:
- **Embedding Dimension**: Higher dimensions = more capacity but more storage
- **Max Sequence Length**: How long can input be?
- **Training Data**: What was the model trained on?

**Common Dimensions**:
- 384: Smaller, faster (e.g., all-MiniLM-L6-v2)
- 768: Balanced (e.g., BERT-base)
- 1536: Larger, more capacity (e.g., OpenAI ada-002)

#### Factor 3: Practical Considerations

**Window Limits**:
- Model's maximum input length
- Your document/chunk sizes
- Query lengths

**Benchmarking**:
- Test on your specific data
- Measure retrieval quality
- Compare multiple models

### 6.4 Embedding Model Requirements

#### Requirement 1: Represent Both Queries and Documents

**Critical**: The embedding model must work well for both:
- **Query embeddings**: Short, question-like text
- **Document embeddings**: Longer, document-like text

**Problem**: Some models are optimized for one or the other

**Solution**: Use models trained for both (e.g., BGE, OpenAI ada-002)

#### Requirement 2: Ensure Similar Embedding Space

**Critical**: Query and document embeddings must be in the same embedding space

**Problem**: Using different models for queries and documents creates mismatched spaces

**Solution**: 
- Use the same model for both
- Use models trained together (e.g., query-document pairs)
- Fine-tune on your data if needed

**Example - What NOT to do**:
```python
# WRONG: Different models
query_embedding = model_A(query)      # Model A
doc_embedding = model_B(document)     # Model B
# These won't be comparable!
```

**Example - What to do**:
```python
# CORRECT: Same model
query_embedding = model(query)         # Same model
doc_embedding = model(document)       # Same model
# These are in the same space!
```


## 7. Unstructured Data Prep in Databricks

### Complete Pipeline

```
┌─────────────────────────────────────────────────────────────┐
│        Unstructured Data Prep in Databricks                  │
└─────────────────────────────────────────────────────────────┘

┌──────────────┐
│  Ingestion   │
│  (Tables &   │  → Load documents from various sources
│   Volumes)   │
└──────┬───────┘
       │
       ▼
┌─────────────────────────────────────────────────────────────┐
│  Document Processing                                         │
│  ┌──────────────────────────────────────────────────────┐  │
│  │  Workflow / DLT / Notebooks                          │  │
│  │  - Extract text from PDFs, DOCX, etc.                │  │
│  │  - Chunk documents                                    │  │
│  │  - Extract metadata                                   │  │
│  └──────────────────────────────────────────────────────┘  │
└──────┬───────────────────────────────────────────────────────┘
       │
       ▼
┌─────────────────────────────────────────────────────────────┐
│  Chunks and Features                                        │
│  - Chunk text                                               │
│  - Metadata                                                 │
│  - Document relationships                                   │
└──────┬───────────────────────────────────────────────────────┘
       │
       ▼
┌─────────────────────────────────────────────────────────────┐
│  Storage (Delta Tables)                                     │
│  - Store chunks                                             │
│  - Store metadata                                           │
│  - Unity Catalog governance                                 │
└──────┬───────────────────────────────────────────────────────┘
       │
       ▼
┌─────────────────────────────────────────────────────────────┐
│  Automatic Sync                                              │
│  ┌──────────────────────────────────────────────────────┐  │
│  │  Vector DB (Vector Search)                           │  │
│  │  - Databricks computes embeddings automatically     │  │
│  │  - Syncs with Delta tables                           │  │
│  └──────────────────────────────────────────────────────┘  │
└──────┬───────────────────────────────────────────────────────┘
       │
       ▼
┌─────────────────────────────────────────────────────────────┐
│  Model Serving                                               │
│  ┌──────────────────────────────────────────────────────┐  │
│  │  - Custom models                                       │  │
│  │  - External models (OpenAI ada-002)                   │  │
│  │  - Foundational models (BGE)                          │  │
│  └──────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘
```

### Key Databricks Features

1. **Ingestion**: 
   - Tables: Structured data storage
   - Volumes: Unstructured file storage

2. **Processing**:
   - Workflows: Orchestrate processing pipelines
   - DLT (Delta Live Tables): Declarative data pipelines
   - Notebooks: Custom processing logic

3. **Storage**:
   - Delta Tables: ACID transactions, versioning
   - Unity Catalog: Governance and security

4. **Vector Search**:
   - Automatic embedding computation
   - Delta table sync
   - Managed vector database

5. **Model Serving**:
   - Host embedding models
   - Support for various model types


## 8. Structured Data Prep in Databricks

### Structured vs Unstructured Data

**Structured Data**:
- Tables, databases
- CSV, Parquet, Delta tables
- Already in organized format

**Unstructured Data**:
- Documents, PDFs, images
- Free-form text
- Requires extraction

### Structured Data Preparation

**Process**:
1. **Data Cleaning**:
   - Handle missing values
   - Normalize formats
   - Remove duplicates

2. **Feature Engineering**:
   - Create embeddings from structured fields
   - Combine multiple columns
   - Extract metadata

3. **Storage**:
   - Delta tables
   - Unity Catalog

4. **Embedding**:
   - Convert structured data to text
   - Generate embeddings
   - Store in vector search

**Example - Product Catalog**:
```python
# Structured data
products = {
    "product_id": "123",
    "name": "Laptop",
    "description": "High-performance laptop",
    "category": "Electronics",
    "price": 999.99
}

# Convert to text for embedding
text = f"{products['name']} {products['description']} {products['category']}"
embedding = embedding_model(text)
```

### Hybrid Approaches

**Combine structured and unstructured**:
- Use structured metadata for filtering
- Use unstructured content for semantic search
- Best of both worlds


## 9. Summary and Next Steps

### Key Takeaways

1. **Data preparation is critical** - quality of RAG depends on data prep
2. **Chunking strategy matters** - use-case specific, requires experimentation
3. **Embedding model selection** - must match domain, language, and use case
4. **Complex documents** - require specialized tools and approaches
5. **Databricks provides** - integrated tools for the entire pipeline

### Common Pitfalls to Avoid

1. ❌ Using wrong chunk sizes (too large/small)
2. ❌ Not preserving context between chunks
3. ❌ Mismatched embedding models for queries and documents
4. ❌ Ignoring document structure
5. ❌ Not testing on your specific data

### Next Module: Vector Stores and Search

In the next module, we'll explore:
- What vector databases are and why they're important
- Vector similarity and distance metrics
- Vector search strategies
- How to filter and rank results
- Introduction to Databricks Vector Search


## Exercises

1. **Exercise 1**: Design a chunking strategy for a specific document type (e.g., research papers, product catalogs)
2. **Exercise 2**: Compare fixed-size vs semantic chunking on a sample document
3. **Exercise 3**: Select an appropriate embedding model for a given use case
4. **Exercise 4**: Design a data preparation pipeline for complex PDFs with tables and images
5. **Exercise 5**: Create a chunking strategy with metadata for a multi-section document
