# RAG Tutorial with Wikipedia Dataset

This notebook demonstrates a simple Retrieval-Augmented Generation (RAG) system using Simple Wikipedia articles. The dataset is configurable by size, making it easy to experiment with different amounts of data while staying within free tier limits.

## Setup and Installation

Before running this notebook, you need to:

1. Install Ollama from [ollama.com](https://ollama.com/)
2. Download the required models by running these commands in your terminal:

```bash
ollama pull hf.co/CompendiumLabs/bge-base-en-v1.5-gguf
ollama pull hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF
```

3. Install the required Python packages:

```bash
pip install ollama datasets ipywidgets jupyter
```

## Import Dependencies

In [None]:
import ollama
from datasets import load_dataset
import json
import sys
import math

In [None]:
import psycopg2
from psycopg2.extras import execute_values
import time

## Configuration

Set the target dataset size. The script will download articles until it reaches approximately this size.

## Install Additional Dependencies

If you plan to use PostgreSQL for persistent storage, install the additional dependency:

```bash
pip install psycopg2-binary
```

Or if you're using a virtual environment (recommended):

```bash
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install psycopg2-binary
```

In [None]:
# Target dataset size in MB (adjust as needed: 10, 20, 30, 40, 50)
TARGET_SIZE_MB = 10

# Maximum chunk size in characters (for splitting long articles)
MAX_CHUNK_SIZE = 1000

# Whether to save the dataset locally for reuse
SAVE_LOCALLY = True
LOCAL_DATASET_PATH = f'wikipedia_dataset_{TARGET_SIZE_MB}mb.json'

In [None]:
# Storage backend configuration
STORAGE_BACKEND = 'postgresql'  # Options: 'memory', 'json', 'postgresql'

# PostgreSQL configuration (only used if STORAGE_BACKEND == 'postgresql')
POSTGRES_CONFIG = {
    'host': 'localhost',
    'port': 5432,
    'database': 'rag_db',
    'user': 'postgres',
    'password': 'postgres',
}

# Table name for this embedding model (allows storing multiple models)
# Table name will be: embeddings_{EMBEDDING_MODEL_ALIAS}
EMBEDDING_MODEL_ALIAS = 'bge_base_en_v1.5'

## Load and Filter the Wikipedia Dataset

We'll use Simple Wikipedia, which has cleaner, more concise articles. The dataset will be filtered to approximately your target size.

In [None]:
def estimate_size_mb(text):
    """Estimate the size of text in megabytes."""
    return sys.getsizeof(text) / (1024 * 1024)

def chunk_text(text, max_size=1000):
    """Split text into chunks of approximately max_size characters.
    
    Tries to break at paragraph boundaries when possible.
    """
    if len(text) <= max_size:
        return [text]
    
    chunks = []
    paragraphs = text.split('\n\n')
    current_chunk = ''
    
    for paragraph in paragraphs:
        # If adding this paragraph would exceed max_size
        if len(current_chunk) + len(paragraph) > max_size:
            if current_chunk:  # Save current chunk if not empty
                chunks.append(current_chunk.strip())
                current_chunk = ''
            
            # If single paragraph is too large, split it
            if len(paragraph) > max_size:
                sentences = paragraph.split('. ')
                for sentence in sentences:
                    if len(current_chunk) + len(sentence) > max_size:
                        if current_chunk:
                            chunks.append(current_chunk.strip())
                        current_chunk = sentence + '. '
                    else:
                        current_chunk += sentence + '. '
            else:
                current_chunk = paragraph
        else:
            current_chunk += '\n\n' + paragraph if current_chunk else paragraph
    
    if current_chunk:
        chunks.append(current_chunk.strip())
    
    return chunks

def load_wikipedia_dataset(target_size_mb, local_path=None):
    """Load and filter Wikipedia dataset to target size.
    
    Args:
        target_size_mb: Target dataset size in megabytes
        local_path: Path to save/load dataset locally
    
    Returns:
        List of text chunks
    """
    # Try to load from local cache first
    if local_path:
        try:
            print(f'Attempting to load cached dataset from {local_path}...')
            with open(local_path, 'r', encoding='utf-8') as f:
                data = json.load(f)
                print(f'✓ Loaded {len(data["chunks"])} chunks from cache')
                print(f'  Estimated size: {data["size_mb"]:.2f} MB')
                return data['chunks']
        except FileNotFoundError:
            print('No cached dataset found, downloading from HuggingFace...')
    
    # Load Simple Wikipedia dataset
    print('Loading Simple Wikipedia dataset (this may take a minute)...')
    dataset = load_dataset('wikimedia/wikipedia', '20231101.simple', split='train', streaming=True)
    
    chunks = []
    current_size_mb = 0
    target_bytes = target_size_mb * 1024 * 1024
    article_count = 0
    
    print(f'\nCollecting articles (target: {target_size_mb} MB)...')
    
    for article in dataset:
        # Skip very short articles
        if len(article['text']) < 200:
            continue
        
        # Create metadata-enriched chunks
        article_chunks = chunk_text(article['text'], MAX_CHUNK_SIZE)
        
        for chunk in article_chunks:
            # Add title context to help with retrieval
            enriched_chunk = f"Article: {article['title']}\n\n{chunk}"
            chunk_size = sys.getsizeof(enriched_chunk)
            
            chunks.append(enriched_chunk)
            current_size_mb += chunk_size
            
            # Check if we've reached target size
            if current_size_mb >= target_bytes:
                break
        
        article_count += 1
        
        # Progress update every 50 articles
        if article_count % 50 == 0:
            print(f'  Progress: {current_size_mb / (1024*1024):.2f} MB ({article_count} articles, {len(chunks)} chunks)')
        
        if current_size_mb >= target_bytes:
            break
    
    final_size_mb = current_size_mb / (1024 * 1024)
    print(f'\n✓ Dataset loaded: {len(chunks)} chunks from {article_count} articles')
    print(f'  Estimated size: {final_size_mb:.2f} MB')
    
    # Save locally if requested
    if local_path:
        print(f'\nSaving dataset to {local_path}...')
        with open(local_path, 'w', encoding='utf-8') as f:
            json.dump({
                'size_mb': final_size_mb,
                'chunk_count': len(chunks),
                'article_count': article_count,
                'chunks': chunks
            }, f, ensure_ascii=False)
        print('✓ Dataset saved for future use')
    
    return chunks

# Load the dataset
dataset = load_wikipedia_dataset(
    TARGET_SIZE_MB, 
    LOCAL_DATASET_PATH if SAVE_LOCALLY else None
)

print(f'\nReady to build vector database with {len(dataset)} chunks!')

In [None]:
# Database helper functions for PostgreSQL storage

class PostgreSQLVectorDB:
    """Helper class to manage embeddings in PostgreSQL with pgvector."""
    
    def __init__(self, config, table_name):
        """Initialize database connection.
        
        Args:
            config: Dictionary with host, port, database, user, password
            table_name: Name of the table for this embedding model
        """
        self.config = config
        self.table_name = table_name
        self.conn = None
        self.connect()
        self.setup_table()
    
    def connect(self):
        """Establish database connection."""
        try:
            self.conn = psycopg2.connect(
                host=self.config['host'],
                port=self.config['port'],
                database=self.config['database'],
                user=self.config['user'],
                password=self.config['password']
            )
            print(f'✓ Connected to PostgreSQL at {self.config["host"]}:{self.config["port"]}')
        except psycopg2.OperationalError as e:
            print(f'✗ Failed to connect to PostgreSQL: {e}')
            print('Make sure PostgreSQL is running with pgvector support.')
            print('Start it with: docker run -d --name pgvector-rag \\')
            print('  -e POSTGRES_PASSWORD=postgres -e POSTGRES_DB=rag_db \\')
            print('  -p 5432:5432 -v pgvector_data:/var/lib/postgresql/data \\')
            print('  pgvector/pgvector:pg16')
            raise
    
    def setup_table(self):
        """Drop and recreate table for fresh experimental runs."""
        with self.conn.cursor() as cur:
            # Enable pgvector extension
            cur.execute('CREATE EXTENSION IF NOT EXISTS vector')
            
            # Drop existing table and its index
            cur.execute(f'DROP TABLE IF EXISTS {self.table_name} CASCADE')
            
            # Create table with vector column
            cur.execute(f'''
                CREATE TABLE {self.table_name} (
                    id SERIAL PRIMARY KEY,
                    chunk_text TEXT NOT NULL,
                    embedding vector(768),
                    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
                )
            ''')
            
            # Create index for fast similarity search
            index_name = f'{self.table_name}_embedding_idx'
            cur.execute(f'''
                CREATE INDEX {index_name}
                ON {self.table_name} USING hnsw (embedding vector_cosine_ops)
            ''')
            
            self.conn.commit()
            print(f'✓ Table "{self.table_name}" created (fresh start)')
    
    def insert_embedding(self, chunk, embedding):
        """Insert a chunk and its embedding into the database.
        
        Args:
            chunk: The text chunk
            embedding: The embedding vector (list of floats)
        """
        with self.conn.cursor() as cur:
            cur.execute(f'''
                INSERT INTO {self.table_name} (chunk_text, embedding)
                VALUES (%s, %s)
            ''', (chunk, embedding))
            self.conn.commit()
    
    def insert_batch(self, chunks_embeddings):
        """Batch insert multiple chunks and embeddings.
        
        Args:
            chunks_embeddings: List of (chunk, embedding) tuples
        """
        with self.conn.cursor() as cur:
            execute_values(cur, f'''
                INSERT INTO {self.table_name} (chunk_text, embedding)
                VALUES %s
            ''', chunks_embeddings, page_size=100)
            self.conn.commit()
    
    def get_chunk_count(self):
        """Get the number of stored chunks."""
        with self.conn.cursor() as cur:
            cur.execute(f'SELECT COUNT(*) FROM {self.table_name}')
            return cur.fetchone()[0]
    
    def similarity_search(self, query_embedding, top_n=3):
        """Find most similar chunks using pgvector.
        
        Args:
            query_embedding: The query embedding vector
            top_n: Number of results to return
        
        Returns:
            List of (chunk_text, similarity_score) tuples
        """
        with self.conn.cursor() as cur:
            cur.execute(f'''
                SELECT chunk_text, 
                       1 - (embedding <=> %s::vector) as similarity
                FROM {self.table_name}
                ORDER BY embedding <=> %s::vector
                LIMIT %s
            ''', (query_embedding, query_embedding, top_n))
            
            results = cur.fetchall()
            return [(chunk, score) for chunk, score in results]
    
    def close(self):
        """Close database connection."""
        if self.conn:
            self.conn.close()


def get_storage_backend(backend_type, config=None, table_name=None):
    """Factory function to get the appropriate storage backend.
    
    Args:
        backend_type: 'memory', 'json', or 'postgresql'
        config: PostgreSQL config dict (required if backend_type is 'postgresql')
        table_name: Table name (required if backend_type is 'postgresql')
    
    Returns:
        Storage backend instance
    """
    if backend_type == 'postgresql':
        if not config or not table_name:
            raise ValueError('PostgreSQL backend requires config and table_name')
        return PostgreSQLVectorDB(config, table_name)
    return None

## Sample Data

Let's look at a few examples from our dataset:

In [None]:
print('Sample chunks from the dataset:\n')
for i, chunk in enumerate(dataset[:3]):
    print(f'--- Chunk {i+1} ---')
    print(chunk[:300] + '...' if len(chunk) > 300 else chunk)
    print()

## Configure Models

We'll use two models:
- **Embedding Model**: Converts text into vector representations
- **Language Model**: Generates responses based on retrieved context

In [None]:
EMBEDDING_MODEL = 'hf.co/CompendiumLabs/bge-base-en-v1.5-gguf'
LANGUAGE_MODEL = 'hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF'

## Implement the Vector Database

### Indexing Phase

In the indexing phase, we:
1. Break the dataset into chunks (already done during loading)
2. Calculate embedding vectors for each chunk
3. Store chunks with their embeddings in our vector database

Each element in `VECTOR_DB` will be a tuple: `(chunk, embedding)`

The embedding is a list of floats, for example: `[0.1, 0.04, -0.34, 0.21, ...]`

**Note**: This may take a few minutes depending on your dataset size.

In [None]:
# Each element in the VECTOR_DB will be a tuple (chunk, embedding)
VECTOR_DB = []

# Initialize storage backend if using PostgreSQL
PG_DB = None
if STORAGE_BACKEND == 'postgresql':
    table_name = f'embeddings_{EMBEDDING_MODEL_ALIAS.replace(".", "_")}'
    PG_DB = get_storage_backend('postgresql', POSTGRES_CONFIG, table_name)

def add_chunk_to_database(chunk):
    """Add a chunk and its embedding to the vector database.
    
    Stores in memory and/or PostgreSQL depending on STORAGE_BACKEND.
    """
    embedding = ollama.embed(model=EMBEDDING_MODEL, input=chunk)['embeddings'][0]
    
    if STORAGE_BACKEND == 'memory' or STORAGE_BACKEND == 'json':
        VECTOR_DB.append((chunk, embedding))
    elif STORAGE_BACKEND == 'postgresql':
        PG_DB.insert_embedding(chunk, embedding)

Now let's populate our vector database with all chunks from the dataset:

## Optional: Persistent Storage with PostgreSQL & pgvector

**Note on Performance**: Embedding generation takes significant time (~50 minutes for 10MB of data). Consider using PostgreSQL with pgvector for durable storage so you can reuse embeddings across multiple experiments without regenerating them.

### Why PostgreSQL + pgvector?

- **Reusable Embeddings**: Generate embeddings once, use them across multiple notebooks and experiments
- **Multiple Models**: Store embeddings from different embedding models in separate tables for comparison
- **Durable Storage**: Embeddings survive notebook restarts
- **Scalability**: Move to production vector databases more easily

### Quick Start with Docker

1. Install [Docker Desktop](https://www.docker.com/products/docker-desktop) if you haven't already
2. Run PostgreSQL with pgvector:

```bash
docker run --name pgvector-rag \
  -e POSTGRES_PASSWORD=postgres \
  -e POSTGRES_DB=rag_db \
  -p 5432:5432 \
  -v pgvector_data:/var/lib/postgresql/data \
  pgvector/pgvector:pg16
```

This creates a persistent volume (`pgvector_data`) so your data survives container restarts.

### Configuration for Persistent Storage

Set the storage backend in the configuration section below. Choose:
- `'memory'` - In-memory only (fast but lost on notebook restart)
- `'json'` - Local JSON file (persists but slower for large datasets)
- `'postgresql'` - PostgreSQL with pgvector (recommended for experiments)

In [None]:
print(f'Building vector database with {len(dataset)} chunks...')
print('This may take a few minutes...\n')

for i, chunk in enumerate(dataset):
    add_chunk_to_database(chunk)
    
    # Progress update every 50 chunks
    if (i + 1) % 50 == 0:
        print(f'Embedded {i+1}/{len(dataset)} chunks ({(i+1)/len(dataset)*100:.1f}%)')

# Get the correct count based on storage backend
if STORAGE_BACKEND == 'postgresql':
    embedding_count = PG_DB.get_chunk_count()
else:
    embedding_count = len(VECTOR_DB)

print(f'\n✓ Vector database ready with {embedding_count} embeddings!')

## Implement the Retrieval Function

### Cosine Similarity

To find the most relevant chunks, we need to compare vector similarity. We'll use cosine similarity, which measures how "close" two vectors are in the vector space. Higher cosine similarity means more similar meaning.

In [None]:
def cosine_similarity(a, b):
    """Calculate cosine similarity between two vectors."""
    dot_product = sum([x * y for x, y in zip(a, b)])
    norm_a = sum([x ** 2 for x in a]) ** 0.5
    norm_b = sum([x ** 2 for x in b]) ** 0.5
    return dot_product / (norm_a * norm_b)

### Retrieval Function

The retrieval function:
1. Converts the query into an embedding vector
2. Compares it against all vectors in the database
3. Returns the top N most relevant chunks

In [None]:
def retrieve(query, top_n=3):
    """Retrieve the top N most relevant chunks for a given query.
    
    Uses the configured storage backend (memory, JSON, or PostgreSQL).
    """
    query_embedding = ollama.embed(model=EMBEDDING_MODEL, input=query)['embeddings'][0]
    
    if STORAGE_BACKEND == 'postgresql':
        # Use PostgreSQL pgvector for similarity search
        return PG_DB.similarity_search(query_embedding, top_n)
    else:
        # Use in-memory cosine similarity
        # temporary list to store (chunk, similarity) pairs
        similarities = []
        for chunk, embedding in VECTOR_DB:
            similarity = cosine_similarity(query_embedding, embedding)
            similarities.append((chunk, similarity))
        # sort by similarity in descending order, because higher similarity means more relevant chunks
        similarities.sort(key=lambda x: x[1], reverse=True)
        # finally, return the top N most relevant chunks
        return similarities[:top_n]

## Generation Phase

In the generation phase, the chatbot generates a response based on the retrieved knowledge. We construct a prompt that includes the relevant chunks and instruct the model to only use that context.

In [None]:
def ask_question(query, top_n=3, verbose=True):
    """Ask a question and get a response based on retrieved knowledge.
    
    Args:
        query: The question to ask
        top_n: Number of relevant chunks to retrieve
        verbose: Whether to print retrieved knowledge
    
    Returns:
        The chatbot's response as a string
    """
    # Retrieve relevant knowledge
    retrieved_knowledge = retrieve(query, top_n=top_n)
    
    if verbose:
        print('Retrieved knowledge:')
        for i, (chunk, similarity) in enumerate(retrieved_knowledge):
            # Extract title from chunk
            title_line = chunk.split('\n')[0]
            preview = chunk[:200].replace('\n', ' ') + '...' if len(chunk) > 200 else chunk
            print(f'  [{i+1}] (similarity: {similarity:.3f}) {preview}')
        print()
    
    # Construct the instruction prompt with retrieved context
    instruction_prompt = f'''You are a helpful chatbot that answers questions based on Wikipedia articles.
Use only the following pieces of context to answer the question. Don't make up any new information.
If the context doesn't contain enough information to answer the question, say so.

Context:
{chr(10).join([f'{i+1}. {chunk.strip()}' for i, (chunk, _) in enumerate(retrieved_knowledge)])}
'''
    
    # Generate response
    stream = ollama.chat(
        model=LANGUAGE_MODEL,
        messages=[
            {'role': 'system', 'content': instruction_prompt},
            {'role': 'user', 'content': query},
        ],
        stream=True,
    )
    
    # Collect and print the response
    if verbose:
        print('Chatbot response:')
    
    response = ''
    for chunk in stream:
        content = chunk['message']['content']
        response += content
        if verbose:
            print(content, end='', flush=True)
    
    if verbose:
        print('\n')  # ensure a newline after the streamed response
    
    return response

## Try It Out!

Now let's ask some questions. The quality of answers will depend on which articles were included in your dataset sample.

In [None]:
ask_question("What is the capital of France?")

In [None]:
ask_question("Tell me about Albert Einstein")

In [None]:
ask_question("What is Python programming language?")

In [None]:
ask_question("How does photosynthesis work?")

## Interactive Chat

You can also use this cell to ask your own questions:

In [None]:
# Ask your own question here
your_question = "What is the solar system?"
ask_question(your_question)

## Export Dataset for Other Platforms

You can export the dataset for use with Neon (Vercel) or Cloudflare D1 with Vectorize:

In [None]:
def export_for_vectorize(output_path='wikipedia_export.json'):
    """Export dataset in a format ready for Cloudflare Vectorize or Neon.
    
    The output format includes:
    - id: unique identifier
    - text: the chunk content
    - embedding: the vector (optional, can be generated on the platform)
    """
    export_data = []
    
    for i, (chunk, embedding) in enumerate(VECTOR_DB):
        # Extract title from chunk
        lines = chunk.split('\n')
        title = lines[0].replace('Article: ', '') if lines[0].startswith('Article: ') else 'Unknown'
        
        export_data.append({
            'id': f'chunk_{i}',
            'text': chunk,
            'title': title,
            'embedding': embedding  # Include if you want pre-computed embeddings
        })
    
    with open(output_path, 'w', encoding='utf-8') as f:
        json.dump(export_data, f, ensure_ascii=False, indent=2)
    
    print(f'✓ Exported {len(export_data)} chunks to {output_path}')
    print(f'  File size: {sys.getsizeof(json.dumps(export_data)) / (1024*1024):.2f} MB')
    print('\nYou can now use this file with:')
    print('  - Neon (PostgreSQL with pgvector)')
    print('  - Cloudflare D1 with Vectorize')
    print('  - Any other vector database')

# Uncomment to export:
# export_for_vectorize('wikipedia_vectorize_export.json')

## Load Embeddings from PostgreSQL

If you've previously generated embeddings and stored them in PostgreSQL, you can load them without regenerating:

**Use this in a new notebook to:**
- Run experiments with existing embeddings (avoiding 50+ minute regeneration)
- Compare different embedding models stored in different tables
- Analyze embedding quality without reprocessing


In [None]:
def load_embeddings_from_postgres(config, embedding_model_alias):
    """Load previously generated embeddings from PostgreSQL.
    
    Useful for running new experiments without regenerating embeddings.
    
    Args:
        config: PostgreSQL connection config
        embedding_model_alias: Alias used when the embeddings were generated
    
    Returns:
        PostgreSQLVectorDB instance ready for retrieval
    """
    table_name = f'embeddings_{embedding_model_alias.replace(".", "_")}'
    
    try:
        db = PostgreSQLVectorDB(config, table_name)
        count = db.get_chunk_count()
        print(f'✓ Loaded {count} embeddings from table "{table_name}"')
        return db
    except psycopg2.ProgrammingError:
        print(f'✗ Table "{table_name}" not found in database')
        print('Run the main notebook first to generate and store embeddings.')
        raise
    except Exception as e:
        print(f'✗ Error loading embeddings: {e}')
        raise


# Example: Uncomment to load existing embeddings in a new notebook
# loaded_db = load_embeddings_from_postgres(POSTGRES_CONFIG, 'bge_base_en_v1.5')
# Then use: loaded_db.similarity_search(query_embedding, top_n=3)

## Next Steps and Improvements

### Migrate to Production Vector Databases

**Neon (with Vercel):**
```sql
-- Create table with pgvector
CREATE TABLE wikipedia_chunks (
  id SERIAL PRIMARY KEY,
  title TEXT,
  text TEXT,
  embedding vector(768)  -- dimension depends on your model
);

-- Create index for fast similarity search
CREATE INDEX ON wikipedia_chunks 
USING ivfflat (embedding vector_cosine_ops);
```

**Cloudflare D1 with Vectorize:**
```javascript
// Use Vectorize for embeddings, D1 for metadata
await env.VECTORIZE.insert([
  {
    id: 'chunk_1',
    values: embedding,
    metadata: { title: 'Article Title', text: 'chunk text' }
  }
]);
```

### Other Improvements

1. **Hybrid Search**: Combine vector similarity with keyword search (BM25) for better retrieval

2. **Reranking**: Use a [reranking model](https://www.pinecone.io/learn/series/rag/rerankers/) to re-score retrieved chunks

3. **Query Expansion**: Generate multiple variations of the user's question for better coverage

4. **Metadata Filtering**: Filter by article categories, dates, or other metadata before similarity search

5. **Better Chunking**: Implement semantic chunking that preserves context better

6. **Citation Support**: Track which chunks were used and provide Wikipedia URLs as sources

### Advanced RAG Architectures

- **Graph RAG**: Build knowledge graphs from Wikipedia's link structure
- **Hybrid RAG**: Combine vectors, graphs, and keyword search
- **Agentic RAG**: Let the LLM decide when to retrieve more information

### Performance Optimization

- **Batch Embeddings**: Embed multiple chunks at once for faster indexing
- **Approximate Search**: Use FAISS, Annoy, or HNSW for faster similarity search
- **Caching**: Cache frequent queries and their results

Learn more about RAG patterns in the [HuggingFace RAG guide](https://huggingface.co/blog/ngxson/make-your-own-rag).

## Dataset Statistics

View statistics about your loaded dataset:

In [None]:
def print_dataset_stats():
    """Print statistics about the current dataset."""
    total_chars = sum(len(chunk) for chunk in dataset)
    avg_chunk_size = total_chars / len(dataset) if dataset else 0
    
    # Count unique articles
    articles = set()
    for chunk in dataset:
        if chunk.startswith('Article: '):
            title = chunk.split('\n')[0].replace('Article: ', '')
            articles.add(title)
    
    print('Dataset Statistics:')
    print(f'  Total chunks: {len(dataset):,}')
    print(f'  Unique articles: {len(articles):,}')
    print(f'  Total characters: {total_chars:,}')
    print(f'  Average chunk size: {avg_chunk_size:.0f} characters')
    print(f'  Estimated size: {sys.getsizeof(str(dataset)) / (1024*1024):.2f} MB')
    print(f'\n  Embeddings in database: {len(VECTOR_DB):,}')
    print(f'  Embedding dimension: {len(VECTOR_DB[0][1]) if VECTOR_DB else 0}')

print_dataset_stats()