# üöÄ Quick Start: RAG in 5 Minutes

**Already have Ollama installed with the models?** Jump straight to the "Load Dataset" section and run cells sequentially.

**First time?** Complete setup below (3 terminal commands), then run all cells.

```bash
# 1. Pull embedding model (creates vector representations)
ollama pull hf.co/CompendiumLabs/bge-base-en-v1.5-gguf

# 2. Pull language model (generates answers)
ollama pull hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF

# 3. Install Python packages
pip install ollama datasets jupyter
```

Then open this notebook and run all cells from top to bottom. No database setup needed!

---

# Foundation 01: Basic RAG (In-Memory)

This notebook demonstrates a simple Retrieval-Augmented Generation (RAG) system using Simple Wikipedia articles with **in-memory storage** and optional JSON file caching.

This is the **simple version** with minimal dependencies - perfect for learning RAG fundamentals without database setup.

**Ready for persistent storage?** See `foundation/02-rag-postgresql-persistent.ipynb` for the PostgreSQL version with durable embeddings and registry integration.

## Setup and Installation

Before running this notebook, you need to:

1. Install Ollama from [ollama.com](https://ollama.com/)
2. Download the required models by running these commands in your terminal:

```bash
ollama pull hf.co/CompendiumLabs/bge-base-en-v1.5-gguf
ollama pull hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF
```

3. Install the required Python packages:

```bash
pip install ollama datasets jupyter
```

## Learning Progression

- ‚úÖ **You are here:** foundation/01 - In-memory RAG basics
- ‚è≠Ô∏è  **Next:** foundation/02 - PostgreSQL persistent storage with registry
- üéØ **Path:** See `LEARNING_ROADMAP.md` for complete learning paths

## Import Dependencies

In [None]:
import ollama
from datasets import load_dataset
import json
import sys
import math

## Configuration

Set the target dataset size. The script will download articles until it reaches approximately this size.

In [None]:
# Target dataset size in MB (adjust as needed: 10, 20, 30, 40, 50)
TARGET_SIZE_MB = 10

# Maximum chunk size in characters (for splitting long articles)
MAX_CHUNK_SIZE = 1000

# Whether to save the dataset locally for reuse
SAVE_LOCALLY = True
LOCAL_DATASET_PATH = f'wikipedia_dataset_{TARGET_SIZE_MB}mb.json'

## Load and Filter the Wikipedia Dataset

We'll use Simple Wikipedia, which has cleaner, more concise articles. The dataset will be filtered to approximately your target size.

In [None]:
def estimate_size_mb(text):
    """Estimate the size of text in megabytes."""
    return sys.getsizeof(text) / (1024 * 1024)

def chunk_text(text, max_size=1000):
    """Split text into chunks of approximately max_size characters.
    
    Why chunking?
    - Long documents don't fit in embedding models efficiently
    - Smaller chunks retrieve more precise context
    - Overlapping chunks preserve semantic continuity
    
    Tries to break at paragraph boundaries when possible.
    """
    # EARLY EXIT: If text is already short enough, return as-is
    # This avoids unnecessary processing and preserves the original text format
    if len(text) <= max_size:
        return [text]
    
    chunks = []
    # Split by double newlines to respect document structure (paragraphs are semantic units)
    # This is key to the algorithm: paragraphs are natural boundaries in human-written text
    # Example: "Para1\n\nPara2\n\nPara3" ‚Üí ["Para1", "Para2", "Para3"]
    paragraphs = text.split('\n\n')
    current_chunk = ''
    
    for paragraph in paragraphs:
        # ALGORITHM STEP 1: Check if adding this paragraph would exceed the limit
        # We check BEFORE adding to ensure no chunk exceeds max_size
        # Pattern: accumulate paragraphs until next one would overflow
        if len(current_chunk) + len(paragraph) > max_size:
            # BOUNDARY DETECTION: Current chunk is "full", save it and start fresh
            if current_chunk:  # Only save if not empty (avoid empty chunks at boundaries)
                chunks.append(current_chunk.strip())
                current_chunk = ''
            
            # OVERFLOW HANDLING: If a single paragraph exceeds max_size, we must split further
            # Fall back to sentence-level splitting (finer granularity than paragraphs)
            if len(paragraph) > max_size:
                # Split by sentence boundary (period followed by space)
                # This is less ideal than paragraph boundaries but necessary for overflow handling
                sentences = paragraph.split('. ')
                for sentence in sentences:
                    # RECURSIVE OVERFLOW: Even sentences might be too large (rare but possible)
                    # Handle by accumulating sentences until hitting the limit
                    if len(current_chunk) + len(sentence) > max_size:
                        # Start a new sentence-level chunk
                        if current_chunk:
                            chunks.append(current_chunk.strip())
                        # New chunk starts with this sentence (add period back)
                        current_chunk = sentence + '. '
                    else:
                        # Accumulate this sentence with previous ones
                        current_chunk += sentence + '. '
            else:
                # CASE: Single paragraph fits in a chunk by itself
                # Assign it as the start of a new chunk (might accumulate more paragraphs)
                current_chunk = paragraph
        else:
            # ACCUMULATION: Paragraph fits within remaining space, add to current chunk
            # Preserve paragraph boundary by adding newlines (except for first paragraph)
            current_chunk += '\n\n' + paragraph if current_chunk else paragraph
    
    # FINALIZATION: Don't forget the last chunk accumulated
    # Edge case: last paragraph was already added to current_chunk but not committed
    if current_chunk:
        chunks.append(current_chunk.strip())
    
    return chunks


In [None]:
import osprint(f'Loading Wikipedia dataset (target size: {TARGET_SIZE_MB}MB)...')print('Please wait, this may take a minute...\n')# Check if we have a cached version locallyif SAVE_LOCALLY and os.path.exists(LOCAL_DATASET_PATH):    print(f'‚úì Found cached dataset: {LOCAL_DATASET_PATH}')    with open(LOCAL_DATASET_PATH, 'r', encoding='utf-8') as f:        saved_data = json.load(f)    dataset = saved_data['chunks']    print(f'‚úì Loaded {len(dataset)} chunks from cache')else:    # Load from HuggingFace datasets    print('Downloading Simple Wikipedia from HuggingFace...')    wikipedia = load_dataset("wikimedia/wikipedia", "20231101.simple", trust_remote_code=True)    articles = wikipedia['train']        # Filter and chunk articles to reach target size    dataset = []    total_size = 0    target_bytes = TARGET_SIZE_MB * 1024 * 1024        print(f'Processing articles (target: {TARGET_SIZE_MB}MB)...\n')        for i, article in enumerate(articles):        # Stop when we reach target size        if total_size >= target_bytes:            break                # Format: "Article: {title}\n\n{text}"        article_text = f"Article: {article['title']}\n\n{article['text']}"                # Chunk the article text        chunks = chunk_text(article_text, max_size=MAX_CHUNK_SIZE)        dataset.extend(chunks)                # Track size        article_size = sys.getsizeof(article_text)        total_size += article_size                # Progress update every 20 articles        if (i + 1) % 20 == 0:            progress_pct = (total_size / target_bytes) * 100            print(f'  Processed {i+1} articles, {len(dataset)} chunks ({progress_pct:.1f}% of target size)')        print(f'\n‚úì Dataset ready: {len(dataset)} chunks from {i+1} articles')    print(f'  Total size: {total_size / (1024*1024):.2f} MB\n')        # Save locally if requested    if SAVE_LOCALLY:        print(f'Saving dataset to {LOCAL_DATASET_PATH}...')        with open(LOCAL_DATASET_PATH, 'w', encoding='utf-8') as f:            json.dump({'chunks': dataset}, f, ensure_ascii=False)        print(f'‚úì Saved dataset locally for future reuse')

## Sample Data

Let's look at a few examples from our dataset:

In [None]:
print('Sample chunks from the dataset:\n')
for i, chunk in enumerate(dataset[:3]):
    print(f'--- Chunk {i+1} ---')
    print(chunk[:300] + '...' if len(chunk) > 300 else chunk)
    print()

## Configure Models

We'll use two models:
- **Embedding Model**: Converts text into vector representations
- **Language Model**: Generates responses based on retrieved context

In [None]:
EMBEDDING_MODEL = 'hf.co/CompendiumLabs/bge-base-en-v1.5-gguf'
LANGUAGE_MODEL = 'hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF'

## Implement the Vector Database

### Indexing Phase

In the indexing phase, we:
1. Break the dataset into chunks (already done during loading)
2. Calculate embedding vectors for each chunk
3. Store chunks with their embeddings in our vector database

Each element in `VECTOR_DB` will be a tuple: `(chunk, embedding)`

The embedding is a list of floats, for example: `[0.1, 0.04, -0.34, 0.21, ...]`

**Note**: This may take a few minutes depending on your dataset size.

In [None]:
# Each element in the VECTOR_DB will be a tuple (chunk, embedding)
# embedding = [0.12, -0.45, 0.78, ...] (768 dimensions for our model)
VECTOR_DB = []

def add_chunk_to_database(chunk):
    """Add a chunk and its embedding to the vector database.
    
    This is the critical step that makes RAG work:
    
    TEXT CHUNK              EMBEDDING MODEL            VECTOR (768 numbers)
    "Paris is the      ‚Üí    (BGE Model)        ‚Üí    [0.12, -0.45, 0.78, ...]
     capital of France"     
    
    The embedding captures semantic meaning. Similar chunks get similar vectors!
    """
    # Generate embedding vector from text (768-dimensional for BGE model)
    embedding = ollama.embed(model=EMBEDDING_MODEL, input=chunk)['embeddings'][0]
    # Store both the original text and its vector representation
    VECTOR_DB.append((chunk, embedding))

Now let's populate our vector database with all chunks from the dataset:

In [None]:
print(f'Building vector database with {len(dataset)} chunks...')
print('This may take a few minutes...\n')

for i, chunk in enumerate(dataset):
    add_chunk_to_database(chunk)
    
    # Progress update every 50 chunks
    if (i + 1) % 50 == 0:
        print(f'Embedded {i+1}/{len(dataset)} chunks ({(i+1)/len(dataset)*100:.1f}%)')

print(f'\n‚úì Vector database ready with {len(VECTOR_DB)} embeddings!')

## Implement the Retrieval Function

### Cosine Similarity

To find the most relevant chunks, we need to compare vector similarity. We'll use cosine similarity, which measures how "close" two vectors are in the vector space. Higher cosine similarity means more similar meaning.

In [None]:
def cosine_similarity(a, b):
    """Calculate cosine similarity between two vectors.
    
    MATHEMATICAL FOUNDATION:
    Formula: similarity = (A ¬∑ B) / (||A|| √ó ||B||)
    
    Result ranges from -1 to 1:
      1.0  = identical direction (perfect match)
      0.5  = 60¬∞ angle (moderately similar)
      0.0  = 90¬∞ angle (perpendicular, unrelated)
    
    Why cosine for text embeddings?
    - Measures DIRECTION not magnitude
    - Short text vs long text with same meaning ‚Üí same similarity
    - Ignores document length bias (unlike Euclidean distance)
    - Works in high dimensions (768D embeddings)
    
    See CONCEPTS.md Section 5 for detailed derivation
    """
    # STEP 1: Compute dot product (sum of element-wise products)
    # This measures how aligned the vectors are: A ¬∑ B = Œ£(a·µ¢ √ó b·µ¢)
    # Example: [1,2,3] ¬∑ [2,3,4] = 1√ó2 + 2√ó3 + 3√ó4 = 2+6+12 = 20
    dot_product = sum([x * y for x, y in zip(a, b)])
    
    # STEP 2: Compute magnitude (L2 norm) of vector A
    # Magnitude = ‚àö(a‚ÇÅ¬≤ + a‚ÇÇ¬≤ + ... + a‚Çô¬≤)
    # Represents the "length" of the vector
    # Example: [1,2,3] ‚Üí ‚àö(1+4+9) = ‚àö14 ‚âà 3.742
    norm_a = sum([x ** 2 for x in a]) ** 0.5
    
    # STEP 3: Compute magnitude of vector B
    # Same formula as above, applied to the second vector
    norm_b = sum([x ** 2 for x in b]) ** 0.5
    
    # STEP 4: Normalize by magnitudes to get cosine similarity
    # The division removes magnitude influence, keeping only direction
    # Cosine of angle between vectors = dot_product / (magnitude_a √ó magnitude_b)
    return dot_product / (norm_a * norm_b)


### Retrieval Function

The retrieval function:
1. Converts the query into an embedding vector
2. Compares it against all vectors in the database
3. Returns the top N most relevant chunks

In [None]:
def retrieve(query, top_n=3):
    """Retrieve the top N most relevant chunks for a given query.
    
    This is the RETRIEVAL phase of RAG:
    
    1. Convert query to embedding  ‚Üí  [query_vector]
    2. Compare to all stored embeddings using cosine similarity
    3. Return top N most similar chunks
    
    Why it works: Similar meanings produce similar vectors!
    If you ask "What is the capital of France?"
    It will find chunks about Paris because they're semantically similar.
    
    COMPLEXITY ANALYSIS:
    - Query embedding: O(token_count) in embedding model
    - Similarity computation: O(n √ó d) where n=chunk count, d=embedding dimension
    - Sorting: O(n log n) to find top-K
    - Total: O(n log n) for large n, dominated by sorting
    
    For large databases (>100k chunks), consider using a vector index
    (HNSW, Faiss) to reduce retrieval from O(n) to O(log n).
    """
    # PHASE 2A: EMBED THE QUERY
    # Convert user's natural language question into a vector in the same space as indexed chunks
    # Critical: MUST use the same embedding model as indexing phase!
    # Mismatch (e.g., indexing with BGE, retrieving with OpenAI) causes complete failure
    query_embedding = ollama.embed(model=EMBEDDING_MODEL, input=query)['embeddings'][0]
    
    # PHASE 2B: COMPUTE SIMILARITIES
    # For each stored chunk, measure how similar it is to the query
    # This is the semantic search: finding meaning-based matches, not keyword matches
    similarities = []
    for chunk, embedding in VECTOR_DB:
        # COSINE SIMILARITY: 1 = identical direction, 0 = unrelated, -1 = opposite
        # For text, values are typically 0.3-0.98 (rarely negative in practice)
        similarity = cosine_similarity(query_embedding, embedding)
        # Store both chunk content and its similarity score for ranking
        similarities.append((chunk, similarity))
    
    # PHASE 2C: TOP-K SELECTION
    # Sort all chunks by similarity (descending = highest first)
    # Time complexity: O(n log n) where n = chunk count
    # OPTIMIZATION: For n > 100k, use a heap-based selection: O(n log k)
    # Heap approach keeps only k smallest (reverse heap for largest k)
    similarities.sort(key=lambda x: x[1], reverse=True)
    
    # PHASE 2D: RETURN TOP N
    # Return the top_n most relevant chunks
    # Edge case: if fewer chunks than top_n exist, return all (no error)
    # Each chunk is returned with its similarity score for quality assessment
    return similarities[:top_n]


## Generation Phase

In the generation phase, the chatbot generates a response based on the retrieved knowledge. We construct a prompt that includes the relevant chunks and instruct the model to only use that context.

In [None]:
def ask_question(query, top_n=3, verbose=True):
    """Ask a question and get a response based on retrieved knowledge.
    
    This is the GENERATION phase of RAG:
    
    USER QUESTION
         ‚Üì
    [Retrieve relevant chunks]  ‚Üê uses semantic search
         ‚Üì
    [Build context prompt]  ‚Üê "Use only this knowledge to answer"
         ‚Üì
    [Generate response with LLM]  ‚Üê Llama model reads context and answers
         ‚Üì
    ANSWER (grounded in Wikipedia)
    
    TWO-STAGE ARCHITECTURE:
    Stage 1 (Retrieval): Find facts ‚Üí retrieve() function
    Stage 2 (Generation): Synthesize facts ‚Üí LLM with context
    
    This separation allows independent optimization:
    - Improve retrieval without changing LLM
    - Improve prompting without reindexing
    - Debug easily by testing each stage separately
    
    Args:
        query: The question to ask (natural language string)
        top_n: Number of relevant chunks to retrieve (trade-off: 3-5 typical)
               More chunks = more context but slower and noisier
        verbose: Whether to print retrieved knowledge and response
    
    Returns:
        The chatbot's response as a string (answer to the query)
    """
    # GENERATION PHASE STEP 1: RETRIEVE RELEVANT KNOWLEDGE
    # Use semantic search to find chunks similar to the query
    # These chunks become the "context window" for the LLM
    retrieved_knowledge = retrieve(query, top_n=top_n)
    
    if verbose:
        # DEBUGGING OUTPUT: Show what was retrieved
        # Helps diagnose retrieval failures (low similarity scores indicate poor matches)
        print('Retrieved knowledge:')
        for i, (chunk, similarity) in enumerate(retrieved_knowledge):
            # Extract article title from chunk format: "Article: Title\\n\\n..."
            title_line = chunk.split('\n')[0]
            # Show snippet (first 200 chars) for human review
            preview = chunk[:200].replace('\n', ' ') + '...' if len(chunk) > 200 else chunk
            # Similarity score: 1.0 = perfect match, 0.0 = unrelated
            # Typically see 0.7-0.95 for relevant chunks
            print(f'  [{i+1}] (similarity: {similarity:.3f}) {preview}')
        print()
    
    # GENERATION PHASE STEP 2: BUILD INSTRUCTION PROMPT
    # This is where we implement "grounded generation"
    # The system message constrains the LLM: "Use ONLY provided context, don't hallucinate"
    # This is the key to RAG: retrieval + constraint = grounded answers
    instruction_prompt = f"""You are a helpful chatbot that answers questions based on Wikipedia articles.
Use only the following pieces of context to answer the question. Don't make up any new information.
If the context doesn't contain enough information to answer the question, say so.

Context:
{chr(10).join([f'{i+1}. {chunk.strip()}' for i, (chunk, _) in enumerate(retrieved_knowledge)])}
"""
    
    # GENERATION PHASE STEP 3: SEND TO LANGUAGE MODEL FOR GENERATION
    # The LLM has:
    # - System prompt: constraints and role ("use only provided context")
    # - Retrieved context: the facts it should use
    # - User query: what they're asking about
    # The model synthesizes these into a coherent answer
    # Using stream=True for faster feedback (token-by-token generation)
    stream = ollama.chat(
        model=LANGUAGE_MODEL,
        messages=[
            {'role': 'system', 'content': instruction_prompt},
            {'role': 'user', 'content': query},
        ],
        stream=True,  # Stream response token-by-token for faster feedback
                      # Alternative: stream=False gets full response at once
    )
    
    # GENERATION PHASE STEP 4: COLLECT AND DISPLAY THE RESPONSE
    # As the LLM generates tokens, collect them into a complete response
    # Streaming shows output in real-time (better UX) vs waiting for full response
    if verbose:
        print('Chatbot response:')
    
    response = ''
    for chunk in stream:
        # Extract the text content from the chunk
        content = chunk['message']['content']
        response += content
        if verbose:
            # Print token-by-token as it arrives (better UX than waiting)
            print(content, end='', flush=True)
    
    if verbose:
        print('\n')  # ensure a newline after the streamed response
    
    return response

## Try It Out!

Now let's ask some questions. The quality of answers will depend on which articles were included in your dataset sample.

In [None]:
ask_question("What is the capital of France?")

In [None]:
ask_question("Tell me about Albert Einstein")

In [None]:
ask_question("What is Python programming language?")

In [None]:
ask_question("How does photosynthesis work?")

## Interactive Chat

You can also use this cell to ask your own questions:

In [None]:
# Ask your own question here
your_question = "What is the solar system?"
ask_question(your_question)

## Export Embeddings for Vector Databases

Export your embeddings to use with various vector database platforms. Choose the format that matches your target platform:

In [None]:
def export_embeddings(chunks=None, embeddings=None, output_path='embeddings_export.json', format='generic'):
    """
    Export embeddings in a generic format compatible with multiple vector databases.
    
    This function removes vendor lock-in by providing a standard export format
    that works with any PostgreSQL-compatible vector database.
    
    Args:
        chunks: List of text chunks (optional, will use VECTOR_DB if not provided)
        embeddings: List of embedding vectors (optional, will use VECTOR_DB if not provided)
        output_path (str): Path to output JSON or SQL file
        format (str): Export format - 'generic' (default), 'pgvector', or 'pinecone'
    
    Returns:
        dict: Export statistics (count, dimension, file_size_mb, format, path)
    
    Supports:
        - PostgreSQL with pgvector (local, Neon, Supabase, RDS)
        - Pinecone vector database
        - Generic JSON for custom integrations
        
    Examples:
        # Export as generic JSON
        stats = export_embeddings(format='generic')
        
        # Export as PostgreSQL INSERT statements
        stats = export_embeddings(format='pgvector')
        
        # Export for Pinecone
        stats = export_embeddings(format='pinecone')
    """
    # Use provided chunks/embeddings or default to VECTOR_DB
    if chunks is None or embeddings is None:
        if not VECTOR_DB:
            raise ValueError('No embeddings available. Generate embeddings first or provide chunks and embeddings.')
        chunks = [chunk for chunk, _ in VECTOR_DB]
        embeddings = [emb for _, emb in VECTOR_DB]
    
    if len(chunks) != len(embeddings):
        raise ValueError('chunks and embeddings must have the same length')
    
    embedding_dimension = len(embeddings[0]) if embeddings else 0
    
    if format == 'generic':
        # Generic JSON format: standard structure without vendor lock-in
        export_data = {
            'metadata': {
                'model': EMBEDDING_MODEL,
                'dimension': embedding_dimension,
                'count': len(embeddings),
                'created_at': __import__('datetime').datetime.now().isoformat() + 'Z',
                'format_type': 'generic'
            },
            'embeddings': [
                {
                    'id': f'chunk_{i}',
                    'vector': embedding,
                    'metadata': {
                        'text': chunk,
                        'source': chunk.split('\n')[0].replace('Article: ', '') if chunk.startswith('Article: ') else 'unknown'
                    }
                }
                for i, (chunk, embedding) in enumerate(zip(chunks, embeddings))
            ]
        }
        
        with open(output_path, 'w', encoding='utf-8') as f:
            json.dump(export_data, f, ensure_ascii=False, indent=2)
        
        file_size_mb = __import__('os').path.getsize(output_path) / (1024 * 1024)
        
        print(f'‚úì Exported {len(embeddings)} embeddings in generic JSON format')
        print(f'  Output: {output_path}')
        print(f'  File size: {file_size_mb:.2f} MB')
        print(f'  Dimension: {embedding_dimension}')
        print(f'\nThis format works with:')
        print(f'  - PostgreSQL with pgvector (via JSON import)')
        print(f'  - Neon PostgreSQL')
        print(f'  - Supabase (PostgreSQL + pgvector)')
        print(f'  - AWS RDS with pgvector')
        print(f'  - Custom vector database integrations')
        
    elif format == 'pgvector':
        # PostgreSQL pgvector format: SQL INSERT statements
        sql_lines = ['-- PostgreSQL pgvector export', '-- Insert into embeddings table', '']
        
        for i, (chunk, embedding) in enumerate(zip(chunks, embeddings)):
            # Extract title for metadata
            title = chunk.split('\n')[0].replace('Article: ', '') if chunk.startswith('Article: ') else 'unknown'
            
            # Escape single quotes in text and title
            safe_chunk = chunk.replace("'", "''")
            safe_title = title.replace("'", "''")
            
            # Format embedding as PostgreSQL vector
            vector_str = '[' + ','.join(str(v) for v in embedding) + ']'
            
            # Build INSERT statement
            sql = f"INSERT INTO embeddings (chunk_id, chunk_text, embedding, source) VALUES ('{i}', E'{safe_chunk}', '{vector_str}'::vector, '{safe_title}');"
            sql_lines.append(sql)
        
        with open(output_path, 'w', encoding='utf-8') as f:
            f.write('\n'.join(sql_lines))
        
        file_size_mb = __import__('os').path.getsize(output_path) / (1024 * 1024)
        
        print(f'‚úì Exported {len(embeddings)} embeddings as PostgreSQL INSERT statements')
        print(f'  Output: {output_path}')
        print(f'  File size: {file_size_mb:.2f} MB')
        print(f'\nTo import into PostgreSQL:')
        print(f'  psql -U postgres -d your_database -f {output_path}')
        print(f'\nWorks with:')
        print(f'  - Local PostgreSQL + pgvector')
        print(f'  - Neon PostgreSQL')
        print(f'  - Supabase')
        print(f'  - AWS RDS with pgvector')
        
    elif format == 'pinecone':
        # Pinecone format: vectors with metadata
        export_data = {
            'vectors': [
                {
                    'id': f'chunk_{i}',
                    'values': embedding,
                    'metadata': {
                        'text': chunk,
                        'source': chunk.split('\n')[0].replace('Article: ', '') if chunk.startswith('Article: ') else 'unknown'
                    }
                }
                for i, (chunk, embedding) in enumerate(zip(chunks, embeddings))
            ]
        }
        
        with open(output_path, 'w', encoding='utf-8') as f:
            json.dump(export_data, f, ensure_ascii=False, indent=2)
        
        file_size_mb = __import__('os').path.getsize(output_path) / (1024 * 1024)
        
        print(f'‚úì Exported {len(embeddings)} embeddings in Pinecone format')
        print(f'  Output: {output_path}')
        print(f'  File size: {file_size_mb:.2f} MB')
        print(f'\nTo import into Pinecone:')
        print(f'  1. Parse the JSON file')
        print(f'  2. Use Pinecone upsert API to insert vectors')
        
    else:
        raise ValueError(f"Unknown format '{format}'. Use 'generic', 'pgvector', or 'pinecone'")
    
    return {
        'count': len(embeddings),
        'dimension': embedding_dimension,
        'file_size_mb': file_size_mb,
        'format': format,
        'path': output_path
    }


# Example usage:
# Generic export (recommended for portability)
# stats = export_embeddings(format='generic')
# print(f"Exported {stats['count']} embeddings to {stats['path']}")

# PostgreSQL export
# stats = export_embeddings(format='pgvector')

# Pinecone export
# stats = export_embeddings(format='pinecone')

## Next Steps and Improvements

### Upgrade to Persistent Storage with PostgreSQL

**Current limitation**: Embeddings are lost when the notebook restarts, requiring 50+ minutes to regenerate.

**Solution**: Upgrade to the advanced version with PostgreSQL + pgvector:

1. **See** `wikipedia-rag-tutorial-advanced.ipynb` for the PostgreSQL version
2. **Benefits**:
   - Generate embeddings once, reuse across experiments
   - Store multiple embedding models for comparison
   - Run analyses in minutes instead of regenerating
   - Easy migration path to production databases

3. **Quick start**:
   ```bash
   # Start PostgreSQL
   docker run -d --name pgvector-rag \
     -e POSTGRES_PASSWORD=postgres \
     -e POSTGRES_DB=rag_db \
     -p 5432:5432 \
     -v pgvector_data:/var/lib/postgresql/data \
     pgvector/pgvector:pg16
   
   # Install PostgreSQL adapter
   pip install psycopg2-binary
   
   # Open the advanced notebook
   jupyter notebook wikipedia-rag-tutorial-advanced.ipynb
   ```

### Other RAG Improvements

1. **Hybrid Search**: Combine vector similarity with keyword search (BM25)
   - Better for specific terminology and exact matches
   - Combine results using reciprocal rank fusion

2. **Reranking**: Use a [reranking model](https://www.pinecone.io/learn/series/rag/rerankers/)
   - Cross-encoder models for better relevance
   - Re-score top 10-20 results from initial retrieval

3. **Query Expansion**: Generate multiple query variations
   - Use LLM to create related questions
   - Retrieve for each and merge results

4. **Better Chunking**:
   - Semantic chunking (split by meaning)
   - Overlapping chunks for better context
   - Parent-child chunks (retrieve child, return parent)

5. **Citation Support**:
   - Track which chunks were used
   - Provide Wikipedia URLs as sources
   - Show confidence scores

### Advanced RAG Patterns

- **Graph RAG**: Build knowledge graphs from Wikipedia links
- **Agentic RAG**: Let the LLM decide when to retrieve more information
- **Multi-hop RAG**: Follow reasoning chains across multiple documents
- **RAG Fusion**: Combine multiple retrieval strategies

### Learn More

- [HuggingFace RAG Guide](https://huggingface.co/blog/ngxson/make-your-own-rag)
- [Pinecone Learning Center](https://www.pinecone.io/learn/)
- Our documentation: See `POSTGRESQL_SETUP.md` for detailed setup instructions

## Dataset Statistics

View statistics about your loaded dataset:

In [None]:
def print_dataset_stats():
    """Print statistics about the current dataset."""
    total_chars = sum(len(chunk) for chunk in dataset)
    avg_chunk_size = total_chars / len(dataset) if dataset else 0
    
    # Count unique articles
    articles = set()
    for chunk in dataset:
        if chunk.startswith('Article: '):
            title = chunk.split('\n')[0].replace('Article: ', '')
            articles.add(title)
    
    print('Dataset Statistics:')
    print(f'  Total chunks: {len(dataset):,}')
    print(f'  Unique articles: {len(articles):,}')
    print(f'  Total characters: {total_chars:,}')
    print(f'  Average chunk size: {avg_chunk_size:.0f} characters')
    print(f'  Estimated size: {sys.getsizeof(str(dataset)) / (1024*1024):.2f} MB')
    print(f'\n  Embeddings in database: {len(VECTOR_DB):,}')
    print(f'  Embedding dimension: {len(VECTOR_DB[0][1]) if VECTOR_DB else 0}')

print_dataset_stats()