# RAG Tutorial with Wikipedia Dataset

This notebook demonstrates a simple Retrieval-Augmented Generation (RAG) system using Simple Wikipedia articles. The dataset is configurable by size, making it easy to experiment with different amounts of data while staying within free tier limits.

## Setup and Installation

Before running this notebook, you need to:

1. Install Ollama from [ollama.com](https://ollama.com/)
2. Download the required models by running these commands in your terminal:

```bash
ollama pull hf.co/CompendiumLabs/bge-base-en-v1.5-gguf
ollama pull hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF
```

3. Install the required Python packages:

```bash
pip install ollama datasets ipywidgets jupyter
```

## Import Dependencies

In [3]:
import ollama
from datasets import load_dataset
import json
import sys
import math

In [None]:
import psycopg2
from psycopg2.extras import execute_values
import time

## Configuration

Set the target dataset size. The script will download articles until it reaches approximately this size.

## Install Additional Dependencies

If you plan to use PostgreSQL for persistent storage, install the additional dependency:

```bash
pip install psycopg2-binary
```

Or if you're using a virtual environment (recommended):

```bash
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install psycopg2-binary
```

In [4]:
# Target dataset size in MB (adjust as needed: 10, 20, 30, 40, 50)
TARGET_SIZE_MB = 10

# Maximum chunk size in characters (for splitting long articles)
MAX_CHUNK_SIZE = 1000

# Whether to save the dataset locally for reuse
SAVE_LOCALLY = True
LOCAL_DATASET_PATH = f'wikipedia_dataset_{TARGET_SIZE_MB}mb.json'

In [None]:
# Storage backend configuration
STORAGE_BACKEND = 'memory'  # Options: 'memory', 'json', 'postgresql'

# PostgreSQL configuration (only used if STORAGE_BACKEND == 'postgresql')
POSTGRES_CONFIG = {
    'host': 'localhost',
    'port': 5432,
    'database': 'rag_db',
    'user': 'postgres',
    'password': 'postgres',
}

# Table name for this embedding model (allows storing multiple models)
# Table name will be: embeddings_{EMBEDDING_MODEL_ALIAS}
EMBEDDING_MODEL_ALIAS = 'bge_base_en_v1.5'

## Load and Filter the Wikipedia Dataset

We'll use Simple Wikipedia, which has cleaner, more concise articles. The dataset will be filtered to approximately your target size.

In [5]:
def estimate_size_mb(text):
    """Estimate the size of text in megabytes."""
    return sys.getsizeof(text) / (1024 * 1024)

def chunk_text(text, max_size=1000):
    """Split text into chunks of approximately max_size characters.
    
    Tries to break at paragraph boundaries when possible.
    """
    if len(text) <= max_size:
        return [text]
    
    chunks = []
    paragraphs = text.split('\n\n')
    current_chunk = ''
    
    for paragraph in paragraphs:
        # If adding this paragraph would exceed max_size
        if len(current_chunk) + len(paragraph) > max_size:
            if current_chunk:  # Save current chunk if not empty
                chunks.append(current_chunk.strip())
                current_chunk = ''
            
            # If single paragraph is too large, split it
            if len(paragraph) > max_size:
                sentences = paragraph.split('. ')
                for sentence in sentences:
                    if len(current_chunk) + len(sentence) > max_size:
                        if current_chunk:
                            chunks.append(current_chunk.strip())
                        current_chunk = sentence + '. '
                    else:
                        current_chunk += sentence + '. '
            else:
                current_chunk = paragraph
        else:
            current_chunk += '\n\n' + paragraph if current_chunk else paragraph
    
    if current_chunk:
        chunks.append(current_chunk.strip())
    
    return chunks

def load_wikipedia_dataset(target_size_mb, local_path=None):
    """Load and filter Wikipedia dataset to target size.
    
    Args:
        target_size_mb: Target dataset size in megabytes
        local_path: Path to save/load dataset locally
    
    Returns:
        List of text chunks
    """
    # Try to load from local cache first
    if local_path:
        try:
            print(f'Attempting to load cached dataset from {local_path}...')
            with open(local_path, 'r', encoding='utf-8') as f:
                data = json.load(f)
                print(f'✓ Loaded {len(data["chunks"])} chunks from cache')
                print(f'  Estimated size: {data["size_mb"]:.2f} MB')
                return data['chunks']
        except FileNotFoundError:
            print('No cached dataset found, downloading from HuggingFace...')
    
    # Load Simple Wikipedia dataset
    print('Loading Simple Wikipedia dataset (this may take a minute)...')
    dataset = load_dataset('wikimedia/wikipedia', '20231101.simple', split='train', streaming=True)
    
    chunks = []
    current_size_mb = 0
    target_bytes = target_size_mb * 1024 * 1024
    article_count = 0
    
    print(f'\nCollecting articles (target: {target_size_mb} MB)...')
    
    for article in dataset:
        # Skip very short articles
        if len(article['text']) < 200:
            continue
        
        # Create metadata-enriched chunks
        article_chunks = chunk_text(article['text'], MAX_CHUNK_SIZE)
        
        for chunk in article_chunks:
            # Add title context to help with retrieval
            enriched_chunk = f"Article: {article['title']}\n\n{chunk}"
            chunk_size = sys.getsizeof(enriched_chunk)
            
            chunks.append(enriched_chunk)
            current_size_mb += chunk_size
            
            # Check if we've reached target size
            if current_size_mb >= target_bytes:
                break
        
        article_count += 1
        
        # Progress update every 50 articles
        if article_count % 50 == 0:
            print(f'  Progress: {current_size_mb / (1024*1024):.2f} MB ({article_count} articles, {len(chunks)} chunks)')
        
        if current_size_mb >= target_bytes:
            break
    
    final_size_mb = current_size_mb / (1024 * 1024)
    print(f'\n✓ Dataset loaded: {len(chunks)} chunks from {article_count} articles')
    print(f'  Estimated size: {final_size_mb:.2f} MB')
    
    # Save locally if requested
    if local_path:
        print(f'\nSaving dataset to {local_path}...')
        with open(local_path, 'w', encoding='utf-8') as f:
            json.dump({
                'size_mb': final_size_mb,
                'chunk_count': len(chunks),
                'article_count': article_count,
                'chunks': chunks
            }, f, ensure_ascii=False)
        print('✓ Dataset saved for future use')
    
    return chunks

# Load the dataset
dataset = load_wikipedia_dataset(
    TARGET_SIZE_MB, 
    LOCAL_DATASET_PATH if SAVE_LOCALLY else None
)

print(f'\nReady to build vector database with {len(dataset)} chunks!')

Attempting to load cached dataset from wikipedia_dataset_10mb.json...
No cached dataset found, downloading from HuggingFace...
Loading Simple Wikipedia dataset (this may take a minute)...

Collecting articles (target: 10 MB)...

Collecting articles (target: 10 MB)...
  Progress: 0.28 MB (50 articles, 323 chunks)
  Progress: 0.52 MB (100 articles, 589 chunks)
  Progress: 0.75 MB (150 articles, 856 chunks)
  Progress: 1.01 MB (200 articles, 1135 chunks)
  Progress: 1.26 MB (250 articles, 1402 chunks)
  Progress: 1.54 MB (300 articles, 1721 chunks)
  Progress: 1.75 MB (350 articles, 1951 chunks)
  Progress: 2.05 MB (400 articles, 2255 chunks)
  Progress: 2.30 MB (450 articles, 2541 chunks)
  Progress: 2.49 MB (500 articles, 2761 chunks)
  Progress: 2.82 MB (550 articles, 3094 chunks)
  Progress: 3.07 MB (600 articles, 3346 chunks)
  Progress: 3.41 MB (650 articles, 3680 chunks)
  Progress: 3.60 MB (700 articles, 3892 chunks)
  Progress: 3.85 MB (750 articles, 4174 chunks)
  Progress: 4.10

In [None]:
# Database helper functions for PostgreSQL storage

class PostgreSQLVectorDB:
    """Helper class to manage embeddings in PostgreSQL with pgvector."""
    
    def __init__(self, config, table_name):
        """Initialize database connection.
        
        Args:
            config: Dictionary with host, port, database, user, password
            table_name: Name of the table for this embedding model
        """
        self.config = config
        self.table_name = table_name
        self.conn = None
        self.connect()
        self.setup_table()
    
    def connect(self):
        """Establish database connection."""
        try:
            self.conn = psycopg2.connect(
                host=self.config['host'],
                port=self.config['port'],
                database=self.config['database'],
                user=self.config['user'],
                password=self.config['password']
            )
            print(f'✓ Connected to PostgreSQL at {self.config["host"]}:{self.config["port"]}')
        except psycopg2.OperationalError as e:
            print(f'✗ Failed to connect to PostgreSQL: {e}')
            print('Make sure PostgreSQL is running with pgvector support.')
            print('Start it with: docker run -d --name pgvector-rag \\')
            print('  -e POSTGRES_PASSWORD=postgres -e POSTGRES_DB=rag_db \\')
            print('  -p 5432:5432 -v pgvector_data:/var/lib/postgresql/data \\')
            print('  pgvector/pgvector:pg16')
            raise
    
    def setup_table(self):
        """Create table if it doesn't exist."""
        with self.conn.cursor() as cur:
            # Enable pgvector extension
            cur.execute('CREATE EXTENSION IF NOT EXISTS vector')
            
            # Create table with vector column
            cur.execute(f'''
                CREATE TABLE IF NOT EXISTS {self.table_name} (
                    id SERIAL PRIMARY KEY,
                    chunk_text TEXT NOT NULL,
                    embedding vector(768),
                    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
                )
            ''')
            
            # Create index for fast similarity search
            index_name = f'{self.table_name}_embedding_idx'
            cur.execute(f'''
                CREATE INDEX IF NOT EXISTS {index_name}
                ON {self.table_name} USING hnsw (embedding vector_cosine_ops)
            ''')
            
            self.conn.commit()
            print(f'✓ Table "{self.table_name}" ready for embeddings')
    
    def insert_embedding(self, chunk, embedding):
        """Insert a chunk and its embedding into the database.
        
        Args:
            chunk: The text chunk
            embedding: The embedding vector (list of floats)
        """
        with self.conn.cursor() as cur:
            cur.execute(f'''
                INSERT INTO {self.table_name} (chunk_text, embedding)
                VALUES (%s, %s)
            ''', (chunk, embedding))
            self.conn.commit()
    
    def insert_batch(self, chunks_embeddings):
        """Batch insert multiple chunks and embeddings.
        
        Args:
            chunks_embeddings: List of (chunk, embedding) tuples
        """
        with self.conn.cursor() as cur:
            execute_values(cur, f'''
                INSERT INTO {self.table_name} (chunk_text, embedding)
                VALUES %s
            ''', chunks_embeddings, page_size=100)
            self.conn.commit()
    
    def get_chunk_count(self):
        """Get the number of stored chunks."""
        with self.conn.cursor() as cur:
            cur.execute(f'SELECT COUNT(*) FROM {self.table_name}')
            return cur.fetchone()[0]
    
    def similarity_search(self, query_embedding, top_n=3):
        """Find most similar chunks using pgvector.
        
        Args:
            query_embedding: The query embedding vector
            top_n: Number of results to return
        
        Returns:
            List of (chunk_text, similarity_score) tuples
        """
        with self.conn.cursor() as cur:
            cur.execute(f'''
                SELECT chunk_text, 
                       1 - (embedding <=> %s::vector) as similarity
                FROM {self.table_name}
                ORDER BY embedding <=> %s::vector
                LIMIT %s
            ''', (embedding, embedding, top_n))
            
            results = cur.fetchall()
            return [(chunk, score) for chunk, score in results]
    
    def close(self):
        """Close database connection."""
        if self.conn:
            self.conn.close()


def get_storage_backend(backend_type, config=None, table_name=None):
    """Factory function to get the appropriate storage backend.
    
    Args:
        backend_type: 'memory', 'json', or 'postgresql'
        config: PostgreSQL config dict (required if backend_type is 'postgresql')
        table_name: Table name (required if backend_type is 'postgresql')
    
    Returns:
        Storage backend instance
    """
    if backend_type == 'postgresql':
        if not config or not table_name:
            raise ValueError('PostgreSQL backend requires config and table_name')
        return PostgreSQLVectorDB(config, table_name)
    return None

## Sample Data

Let's look at a few examples from our dataset:

In [10]:
print('Sample chunks from the dataset:\n')
for i, chunk in enumerate(dataset[:3]):
    print(f'--- Chunk {i+1} ---')
    print(chunk[:300] + '...' if len(chunk) > 300 else chunk)
    print()

Sample chunks from the dataset:

--- Chunk 1 ---
Article: April

April (Apr.) is the fourth month of the year in the Julian and Gregorian calendars, and comes between March and May. It is one of the four months to have 30 days.

April always begins on the same day of the week as July, and additionally, January in leap years. April always ends on t...

--- Chunk 2 ---
Article: April

In common years, April starts on the same day of the week as October of the previous year, and in leap years, May of the previous year. In common years, April finishes on the same day of the week as July of the previous year, and in leap years, February and October of the previous ye...

--- Chunk 3 ---
Article: April

April is a spring month in the Northern Hemisphere and an autumn/fall month in the Southern Hemisphere. In each hemisphere, it is the seasonal equivalent of October in the other.

It is unclear as to where April got its name. A common theory is that it comes from the Latin word "aper...



## Configure Models

We'll use two models:
- **Embedding Model**: Converts text into vector representations
- **Language Model**: Generates responses based on retrieved context

In [7]:
EMBEDDING_MODEL = 'hf.co/CompendiumLabs/bge-base-en-v1.5-gguf'
LANGUAGE_MODEL = 'hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF'

## Implement the Vector Database

### Indexing Phase

In the indexing phase, we:
1. Break the dataset into chunks (already done during loading)
2. Calculate embedding vectors for each chunk
3. Store chunks with their embeddings in our vector database

Each element in `VECTOR_DB` will be a tuple: `(chunk, embedding)`

The embedding is a list of floats, for example: `[0.1, 0.04, -0.34, 0.21, ...]`

**Note**: This may take a few minutes depending on your dataset size.

In [None]:
# Each element in the VECTOR_DB will be a tuple (chunk, embedding)
VECTOR_DB = []

# Initialize storage backend if using PostgreSQL
PG_DB = None
if STORAGE_BACKEND == 'postgresql':
    table_name = f'embeddings_{EMBEDDING_MODEL_ALIAS.replace(".", "_")}'
    PG_DB = get_storage_backend('postgresql', POSTGRES_CONFIG, table_name)

def add_chunk_to_database(chunk):
    """Add a chunk and its embedding to the vector database.
    
    Stores in memory and/or PostgreSQL depending on STORAGE_BACKEND.
    """
    embedding = ollama.embed(model=EMBEDDING_MODEL, input=chunk)['embeddings'][0]
    
    if STORAGE_BACKEND == 'memory' or STORAGE_BACKEND == 'json':
        VECTOR_DB.append((chunk, embedding))
    elif STORAGE_BACKEND == 'postgresql':
        PG_DB.insert_embedding(chunk, embedding)

Now let's populate our vector database with all chunks from the dataset:

## Optional: Persistent Storage with PostgreSQL & pgvector

**Note on Performance**: Embedding generation takes significant time (~50 minutes for 10MB of data). Consider using PostgreSQL with pgvector for durable storage so you can reuse embeddings across multiple experiments without regenerating them.

### Why PostgreSQL + pgvector?

- **Reusable Embeddings**: Generate embeddings once, use them across multiple notebooks and experiments
- **Multiple Models**: Store embeddings from different embedding models in separate tables for comparison
- **Durable Storage**: Embeddings survive notebook restarts
- **Scalability**: Move to production vector databases more easily

### Quick Start with Docker

1. Install [Docker Desktop](https://www.docker.com/products/docker-desktop) if you haven't already
2. Run PostgreSQL with pgvector:

```bash
docker run --name pgvector-rag \
  -e POSTGRES_PASSWORD=postgres \
  -e POSTGRES_DB=rag_db \
  -p 5432:5432 \
  -v pgvector_data:/var/lib/postgresql/data \
  pgvector/pgvector:pg16
```

This creates a persistent volume (`pgvector_data`) so your data survives container restarts.

### Configuration for Persistent Storage

Set the storage backend in the configuration section below. Choose:
- `'memory'` - In-memory only (fast but lost on notebook restart)
- `'json'` - Local JSON file (persists but slower for large datasets)
- `'postgresql'` - PostgreSQL with pgvector (recommended for experiments)

In [11]:
print(f'Building vector database with {len(dataset)} chunks...')
print('This may take a few minutes...\n')

for i, chunk in enumerate(dataset):
    add_chunk_to_database(chunk)
    
    # Progress update every 50 chunks
    if (i + 1) % 50 == 0:
        print(f'Embedded {i+1}/{len(dataset)} chunks ({(i+1)/len(dataset)*100:.1f}%)')

print(f'\n✓ Vector database ready with {len(VECTOR_DB)} embeddings!')

Building vector database with 10402 chunks...
This may take a few minutes...

Embedded 50/10402 chunks (0.5%)
Embedded 50/10402 chunks (0.5%)
Embedded 100/10402 chunks (1.0%)
Embedded 100/10402 chunks (1.0%)
Embedded 150/10402 chunks (1.4%)
Embedded 150/10402 chunks (1.4%)
Embedded 200/10402 chunks (1.9%)
Embedded 200/10402 chunks (1.9%)
Embedded 250/10402 chunks (2.4%)
Embedded 250/10402 chunks (2.4%)
Embedded 300/10402 chunks (2.9%)
Embedded 300/10402 chunks (2.9%)
Embedded 350/10402 chunks (3.4%)
Embedded 350/10402 chunks (3.4%)
Embedded 400/10402 chunks (3.8%)
Embedded 400/10402 chunks (3.8%)
Embedded 450/10402 chunks (4.3%)
Embedded 450/10402 chunks (4.3%)
Embedded 500/10402 chunks (4.8%)
Embedded 500/10402 chunks (4.8%)
Embedded 550/10402 chunks (5.3%)
Embedded 550/10402 chunks (5.3%)
Embedded 600/10402 chunks (5.8%)
Embedded 600/10402 chunks (5.8%)
Embedded 650/10402 chunks (6.2%)
Embedded 650/10402 chunks (6.2%)
Embedded 700/10402 chunks (6.7%)
Embedded 700/10402 chunks (6.7%)


## Implement the Retrieval Function

### Cosine Similarity

To find the most relevant chunks, we need to compare vector similarity. We'll use cosine similarity, which measures how "close" two vectors are in the vector space. Higher cosine similarity means more similar meaning.

In [12]:
def cosine_similarity(a, b):
    """Calculate cosine similarity between two vectors."""
    dot_product = sum([x * y for x, y in zip(a, b)])
    norm_a = sum([x ** 2 for x in a]) ** 0.5
    norm_b = sum([x ** 2 for x in b]) ** 0.5
    return dot_product / (norm_a * norm_b)

### Retrieval Function

The retrieval function:
1. Converts the query into an embedding vector
2. Compares it against all vectors in the database
3. Returns the top N most relevant chunks

In [None]:
def retrieve(query, top_n=3):
    """Retrieve the top N most relevant chunks for a given query.
    
    Uses the configured storage backend (memory, JSON, or PostgreSQL).
    """
    query_embedding = ollama.embed(model=EMBEDDING_MODEL, input=query)['embeddings'][0]
    
    if STORAGE_BACKEND == 'postgresql':
        # Use PostgreSQL pgvector for similarity search
        return PG_DB.similarity_search(query_embedding, top_n)
    else:
        # Use in-memory cosine similarity
        # temporary list to store (chunk, similarity) pairs
        similarities = []
        for chunk, embedding in VECTOR_DB:
            similarity = cosine_similarity(query_embedding, embedding)
            similarities.append((chunk, similarity))
        # sort by similarity in descending order, because higher similarity means more relevant chunks
        similarities.sort(key=lambda x: x[1], reverse=True)
        # finally, return the top N most relevant chunks
        return similarities[:top_n]

## Generation Phase

In the generation phase, the chatbot generates a response based on the retrieved knowledge. We construct a prompt that includes the relevant chunks and instruct the model to only use that context.

In [14]:
def ask_question(query, top_n=3, verbose=True):
    """Ask a question and get a response based on retrieved knowledge.
    
    Args:
        query: The question to ask
        top_n: Number of relevant chunks to retrieve
        verbose: Whether to print retrieved knowledge
    
    Returns:
        The chatbot's response as a string
    """
    # Retrieve relevant knowledge
    retrieved_knowledge = retrieve(query, top_n=top_n)
    
    if verbose:
        print('Retrieved knowledge:')
        for i, (chunk, similarity) in enumerate(retrieved_knowledge):
            # Extract title from chunk
            title_line = chunk.split('\n')[0]
            preview = chunk[:200].replace('\n', ' ') + '...' if len(chunk) > 200 else chunk
            print(f'  [{i+1}] (similarity: {similarity:.3f}) {preview}')
        print()
    
    # Construct the instruction prompt with retrieved context
    instruction_prompt = f'''You are a helpful chatbot that answers questions based on Wikipedia articles.
Use only the following pieces of context to answer the question. Don't make up any new information.
If the context doesn't contain enough information to answer the question, say so.

Context:
{chr(10).join([f'{i+1}. {chunk.strip()}' for i, (chunk, _) in enumerate(retrieved_knowledge)])}
'''
    
    # Generate response
    stream = ollama.chat(
        model=LANGUAGE_MODEL,
        messages=[
            {'role': 'system', 'content': instruction_prompt},
            {'role': 'user', 'content': query},
        ],
        stream=True,
    )
    
    # Collect and print the response
    if verbose:
        print('Chatbot response:')
    
    response = ''
    for chunk in stream:
        content = chunk['message']['content']
        response += content
        if verbose:
            print(content, end='', flush=True)
    
    if verbose:
        print('\n')  # ensure a newline after the streamed response
    
    return response

## Try It Out!

Now let's ask some questions. The quality of answers will depend on which articles were included in your dataset sample.

In [15]:
ask_question("What is the capital of France?")

Retrieved knowledge:
  [1] (similarity: 0.686) Article: Paris  Paris (nicknamed the "City of light") is the capital city of France, and the largest city in France. The area is , and around 2.15 million people live there. If suburbs are counted, th...
  [2] (similarity: 0.644) Article: France  France ( or ; ), officially the French Republic (, ), is a country in Western Europe. It also includes various departments and territories of France overseas.   Mainland France extend...
  [3] (similarity: 0.617) Article: France  France was one of the first members of the European Union, and has the largest land area of all members. It is also a founding member of the United Nations, and a member of the Franco...

Chatbot response:
The article does not directly state that Paris is the capital of France. It mentions "the capital city of France" but does not specify which one. However, it also states that the area around Paris is the largest in France and has a population of 10.7 million people, sug

'The article does not directly state that Paris is the capital of France. It mentions "the capital city of France" but does not specify which one. However, it also states that the area around Paris is the largest in France and has a population of 10.7 million people, suggesting that Paris may be considered the capital due to its size and influence.'

In [16]:
ask_question("Tell me about Albert Einstein")

Retrieved knowledge:
  [1] (similarity: 0.777) Article: Albert Einstein  Albert Einstein (14 March 1879 – 18 April 1955) was a German-born American scientist. He worked on theoretical physics. He developed the theory of relativity. He received the...
  [2] (similarity: 0.764) Article: Albert Einstein  He is now thought to be one of the greatest scientists of all time.  His contributions helped lay the foundations for all modern branches of physics, including quantum mechan...
  [3] (similarity: 0.760) Article: Albert Einstein  Later life  In spring of 1914, he moved back to Germany, and became ordinary member of the Prussian Academy and director of a newly established institute for physics of the K...

Chatbot response:
According to Wikipedia, Albert Einstein (1879-1955) was a renowned German-born physicist who is best known for his theory of relativity. Here are some key facts about his life:

**Early Life**

Einstein was born in Munich, Germany on March 14, 1879. He was the youngest 

'According to Wikipedia, Albert Einstein (1879-1955) was a renowned German-born physicist who is best known for his theory of relativity. Here are some key facts about his life:\n\n**Early Life**\n\nEinstein was born in Munich, Germany on March 14, 1879. He was the youngest of three children to Hermann and Pauline Einstein.\n\n**Education and Career**\n\nEinstein\'s early education was at a Catholic elementary school, followed by a private Jewish boarding school. In 1894, he enrolled at the Swiss Federal Polytechnic University in Zurich, where he studied physics and mathematics.\n\nIn 1900, Einstein moved to Berlin, Germany to pursue his graduate studies at the University of Berlin. He earned his Ph.D. in 1905 while working as a patent clerk.\n\n**Theory of Relativity**\n\nEinstein\'s most famous contribution is his theory of relativity, which revolutionized our understanding of space and time. The theory consists of two main components:\n\n1. **Special Relativity**: Einstein proposed 

In [17]:
ask_question("What is Python programming language?")

Retrieved knowledge:
  [1] (similarity: 0.766) Article: Programming language  A programming language is a type of written language that tells computers what to do. Examples are: Python, Ruby, Java, JavaScript, C, C++, and C#. Programming languages...
  [2] (similarity: 0.714) Article: Programming language  Usually, the programming language uses real words for some of the commands (e.g. "if... then... else...", "and", "or"), so that the language is easier for a human to und...
  [3] (similarity: 0.656) Article: Visual Basic  Visual Basic (VB) is a programming language developed by Microsoft for their operating system Windows. The BASIC language is said to be easier to read than other languages.   Vi...

Chatbot response:
Python is a popular programming language that tells computers what to do. It's like a set of commands that tell the computer how to do things. Python usually uses real words for some of the commands, making it easier for humans to understand.

The article doesn't mentio

"Python is a popular programming language that tells computers what to do. It's like a set of commands that tell the computer how to do things. Python usually uses real words for some of the commands, making it easier for humans to understand.\n\nThe article doesn't mention anything about the syntax or features of the Python programming language itself, but I can tell you that:\n\n* Python is often used for general-purpose programming and data analysis.\n* It's written using simple English-like words and syntax, which makes it easy for humans to read and write.\n* The language has a vast range of applications, including web development, scientific computing, artificial intelligence, and more."

In [18]:
ask_question("How does photosynthesis work?")

Retrieved knowledge:
  [1] (similarity: 0.822) Article: Photosynthesis  Photosynthesis is how plants and some microorganisms make  carbohydrates. It is an endothermic (takes in heat) chemical process which uses sunlight to turn carbon dioxide into...
  [2] (similarity: 0.784) Article: Photosynthesis  6 CO2(g) + 6 H2O + photons → C6H12O6(aq) + 6 O2(g) carbon dioxide + water + light energy → glucose + oxygen Carbon dioxide enters the leaf through the stomata by diffusion fro...
  [3] (similarity: 0.775) Article: Photosynthesis  Glucose is used in respiration (to release energy in cells). It is stored in the form of starch (which is converted back to glucose for respiration in the dark). Glucose can a...

Chatbot response:
Photosynthesis works by using light energy from the sun to convert carbon dioxide (CO2) and water (H2O) into glucose (C6H12O6) and oxygen (O2). The process can be broken down into two main phases: light-dependent reactions and light-independent reactions.

**Light-Depen

'Photosynthesis works by using light energy from the sun to convert carbon dioxide (CO2) and water (H2O) into glucose (C6H12O6) and oxygen (O2). The process can be broken down into two main phases: light-dependent reactions and light-independent reactions.\n\n**Light-Dependent Reactions**\n\n1. Light energy from the sun is absorbed by pigments such as chlorophyll in the thylakoid membranes of chloroplasts.\n2. This energy excites electrons, which are then passed through a series of electron transport chains.\n3. The electrons ultimately reduce oxygen (O2) and water (H2O) to form a molecule called hydroquinone.\n\n**Light-Independent Reactions**\n\n1. Hydroquinone is converted into flavin mononucleotide (FMN) by the enzyme photosystem I.\n2. This process is known as electron transport, where energy from light is passed along a series of protein complexes and electron carriers.\n3. The electrons are then used to produce ATP (adenosine triphosphate), which is a molecule that provides ener

## Interactive Chat

You can also use this cell to ask your own questions:

In [21]:
# Ask your own question here
your_question = "What is the solar system?"
ask_question(your_question)

Retrieved knowledge:
  [1] (similarity: 0.777) Article: Solar System  The Solar System is the Sun and all the objects that orbit around it. The Sun is orbited by planets, asteroids, comets and other things.   The Solar System is about 4.568 billio...
  [2] (similarity: 0.743) Article: Solar System  The Solar System also contains other things. There are asteroid belts, mostly between Mars and Jupiter. Further out than Neptune, there is the Kuiper belt and the scattered disc...
  [3] (similarity: 0.741) Article: Solar System  There are eight planets in the Solar System. From closest to farthest from the Sun, they are: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus and Neptune. The first four pl...

Chatbot response:
According to Article 2: The Solar System, the Solar System refers to:

"The entire system consisting of all objects that orbit around the Sun. It also includes the asteroid belt, the Kuiper belt and the scattered disc, as well as dwarf planets, moons, asteroids, comets,

'According to Article 2: The Solar System, the Solar System refers to:\n\n"The entire system consisting of all objects that orbit around the Sun. It also includes the asteroid belt, the Kuiper belt and the scattered disc, as well as dwarf planets, moons, asteroids, comets, centaurs, and interplanetary dust."\n\n(Note: This definition is very concise and doesn\'t provide much detail about what makes up the Solar System.)'

## Export Dataset for Other Platforms

You can export the dataset for use with Neon (Vercel) or Cloudflare D1 with Vectorize:

In [None]:
def export_for_vectorize(output_path='wikipedia_export.json'):
    """Export dataset in a format ready for Cloudflare Vectorize or Neon.
    
    The output format includes:
    - id: unique identifier
    - text: the chunk content
    - embedding: the vector (optional, can be generated on the platform)
    """
    export_data = []
    
    for i, (chunk, embedding) in enumerate(VECTOR_DB):
        # Extract title from chunk
        lines = chunk.split('\n')
        title = lines[0].replace('Article: ', '') if lines[0].startswith('Article: ') else 'Unknown'
        
        export_data.append({
            'id': f'chunk_{i}',
            'text': chunk,
            'title': title,
            'embedding': embedding  # Include if you want pre-computed embeddings
        })
    
    with open(output_path, 'w', encoding='utf-8') as f:
        json.dump(export_data, f, ensure_ascii=False, indent=2)
    
    print(f'✓ Exported {len(export_data)} chunks to {output_path}')
    print(f'  File size: {sys.getsizeof(json.dumps(export_data)) / (1024*1024):.2f} MB')
    print('\nYou can now use this file with:')
    print('  - Neon (PostgreSQL with pgvector)')
    print('  - Cloudflare D1 with Vectorize')
    print('  - Any other vector database')

# Uncomment to export:
# export_for_vectorize('wikipedia_vectorize_export.json')

✓ Exported 10402 chunks to wikipedia_vectorize_export.json
  File size: 110.47 MB

You can now use this file with:
  - Neon (PostgreSQL with pgvector)
  - Cloudflare D1 with Vectorize
  - Any other vector database


## Load Embeddings from PostgreSQL

If you've previously generated embeddings and stored them in PostgreSQL, you can load them without regenerating:

**Use this in a new notebook to:**
- Run experiments with existing embeddings (avoiding 50+ minute regeneration)
- Compare different embedding models stored in different tables
- Analyze embedding quality without reprocessing


In [None]:
def load_embeddings_from_postgres(config, embedding_model_alias):
    """Load previously generated embeddings from PostgreSQL.
    
    Useful for running new experiments without regenerating embeddings.
    
    Args:
        config: PostgreSQL connection config
        embedding_model_alias: Alias used when the embeddings were generated
    
    Returns:
        PostgreSQLVectorDB instance ready for retrieval
    """
    table_name = f'embeddings_{embedding_model_alias.replace(".", "_")}'
    
    try:
        db = PostgreSQLVectorDB(config, table_name)
        count = db.get_chunk_count()
        print(f'✓ Loaded {count} embeddings from table "{table_name}"')
        return db
    except psycopg2.ProgrammingError:
        print(f'✗ Table "{table_name}" not found in database')
        print('Run the main notebook first to generate and store embeddings.')
        raise
    except Exception as e:
        print(f'✗ Error loading embeddings: {e}')
        raise


# Example: Uncomment to load existing embeddings in a new notebook
# loaded_db = load_embeddings_from_postgres(POSTGRES_CONFIG, 'bge_base_en_v1.5')
# Then use: loaded_db.similarity_search(query_embedding, top_n=3)

## Next Steps and Improvements

### Migrate to Production Vector Databases

**Neon (with Vercel):**
```sql
-- Create table with pgvector
CREATE TABLE wikipedia_chunks (
  id SERIAL PRIMARY KEY,
  title TEXT,
  text TEXT,
  embedding vector(768)  -- dimension depends on your model
);

-- Create index for fast similarity search
CREATE INDEX ON wikipedia_chunks 
USING ivfflat (embedding vector_cosine_ops);
```

**Cloudflare D1 with Vectorize:**
```javascript
// Use Vectorize for embeddings, D1 for metadata
await env.VECTORIZE.insert([
  {
    id: 'chunk_1',
    values: embedding,
    metadata: { title: 'Article Title', text: 'chunk text' }
  }
]);
```

### Other Improvements

1. **Hybrid Search**: Combine vector similarity with keyword search (BM25) for better retrieval

2. **Reranking**: Use a [reranking model](https://www.pinecone.io/learn/series/rag/rerankers/) to re-score retrieved chunks

3. **Query Expansion**: Generate multiple variations of the user's question for better coverage

4. **Metadata Filtering**: Filter by article categories, dates, or other metadata before similarity search

5. **Better Chunking**: Implement semantic chunking that preserves context better

6. **Citation Support**: Track which chunks were used and provide Wikipedia URLs as sources

### Advanced RAG Architectures

- **Graph RAG**: Build knowledge graphs from Wikipedia's link structure
- **Hybrid RAG**: Combine vectors, graphs, and keyword search
- **Agentic RAG**: Let the LLM decide when to retrieve more information

### Performance Optimization

- **Batch Embeddings**: Embed multiple chunks at once for faster indexing
- **Approximate Search**: Use FAISS, Annoy, or HNSW for faster similarity search
- **Caching**: Cache frequent queries and their results

Learn more about RAG patterns in the [HuggingFace RAG guide](https://huggingface.co/blog/ngxson/make-your-own-rag).

## Dataset Statistics

View statistics about your loaded dataset:

In [22]:
def print_dataset_stats():
    """Print statistics about the current dataset."""
    total_chars = sum(len(chunk) for chunk in dataset)
    avg_chunk_size = total_chars / len(dataset) if dataset else 0
    
    # Count unique articles
    articles = set()
    for chunk in dataset:
        if chunk.startswith('Article: '):
            title = chunk.split('\n')[0].replace('Article: ', '')
            articles.add(title)
    
    print('Dataset Statistics:')
    print(f'  Total chunks: {len(dataset):,}')
    print(f'  Unique articles: {len(articles):,}')
    print(f'  Total characters: {total_chars:,}')
    print(f'  Average chunk size: {avg_chunk_size:.0f} characters')
    print(f'  Estimated size: {sys.getsizeof(str(dataset)) / (1024*1024):.2f} MB')
    print(f'\n  Embeddings in database: {len(VECTOR_DB):,}')
    print(f'  Embedding dimension: {len(VECTOR_DB[0][1]) if VECTOR_DB else 0}')

print_dataset_stats()

Dataset Statistics:
  Total chunks: 10,402
  Unique articles: 1,993
  Total characters: 7,968,175
  Average chunk size: 766 characters
  Estimated size: 30.98 MB

  Embeddings in database: 10,402
  Embedding dimension: 768
