# From Documents to Embeddings: A Complete Tutorial

This notebook demonstrates how to convert text documents into vector embeddings for use in Retrieval-Augmented Generation (RAG) systems. We'll explore two different embedding techniques and compare their characteristics.

## What are Word Embeddings?

**Word embeddings are numerical representations of words or text chunks.** They transform text into vectors (lists of numbers) that capture the meaning, context, and relationships between words.

### Key Concepts:
- Embeddings turn text into **vectors** that can be mathematically compared
- Similar pieces of text have similar vector representations
- These vectors enable **semantic search and retrieval** — critical components in RAG systems

### In Simple Words:
**Embeddings let machines "understand" and "measure" how similar two pieces of text are — even if they use different words.**

---

## Common Word Embedding Techniques

There are several ways to generate embeddings. We'll explore two widely-used methods:

### a) Word2Vec
- A traditional method using shallow neural networks
- Captures relationships between words based on how often they appear together
- Learns embeddings like: "king - man + woman = queen"
- **Custom dimension size** (e.g., 100 dimensions)
- Best for basic word-level relationships

### b) OpenAI using Langchain Embeddings
- Uses powerful pre-trained models from OpenAI
- The most common model: **text-embedding-ada-002**
  - Small, fast, and very good quality
  - Produces a **1536-dimensional vector** for each chunk
  - Ideal for search, retrieval, and similarity tasks

### How It Works:
1. You send a chunk of text to the OpenAI API
2. The model returns a fixed-length embedding (vector) that captures the **semantic meaning** of the text

**OpenAI embeddings using Langchain are widely used in production RAG pipelines**, especially when quality and ease-of-use matter.

---

## Embedding Dimensions Comparison

| Method | Dimension Size | Notes |
|--------|---------------|-------|
| Word2Vec | Custom (e.g., 100) | Based on training setup; basic context |
| OpenAI (text-embedding-ada-002) | 1536 | Very high quality; best for production use |

**All methods turn text into vectors**, but the quality and context-awareness improve as you move from Word2Vec → OpenAI.

---

## Problem Statement

We've successfully extracted and chunked content from a PDF document:

**"Digital Transformation of the Healthcare Value Chain: Emergence of Medical Internet of Things (MIoT) may need an Integrated Clinical Environment, ICE Platform."**

### Our Goals:
1. **Generate vector representations (embeddings)** for each chunk of text
2. **Try out two different embedding methods**: Word2Vec and OpenAI Embedding Model
3. **Compare the differences** and prepare the data for vector-based retrieval in a RAG system

These embeddings will later help us **match a user's question to the most relevant part of the document**.

---

## Setup: Import Required Libraries

We'll need several libraries for this tutorial:
- **numpy**: For numerical operations on vectors
- **os**: For environment variable management
- **openai**: For interacting with OpenAI's API
- **dotenv**: For loading environment variables from .env files
- **nltk**: Natural Language Toolkit for text processing
- **gensim**: For Word2Vec model training
- **langchain**: For document loading, text splitting, and embeddings
- **tenacity**: For retry logic when calling APIs

In [3]:
import numpy as np
import os
import openai
from dotenv import load_dotenv
import nltk
from nltk.tokenize import sent_tokenize
from gensim.models import Word2Vec
from langchain_openai import OpenAIEmbeddings
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from tenacity import (
    retry,
    stop_after_attempt,
    wait_random_exponential,
)
import warnings
warnings.filterwarnings("ignore")

---

## Download NLTK Data

We need NLTK's punkt tokenizer to split text into sentences. This data package helps NLTK understand sentence boundaries in different languages.

In [11]:
try:
    nltk.data.find('tokenizers/punkt')
    print("✅ NLTK punkt tokenizer already available")
except LookupError:
    print("📥 Downloading NLTK punkt tokenizer...")
    nltk.download('punkt')
    print("✅ NLTK punkt tokenizer downloaded successfully")

# If you see errors about 'punkt_tab', download it as well:
try:
    nltk.data.find('tokenizers/punkt_tab')
    print("✅ NLTK punkt_tab tokenizer already available")
except LookupError:
    print("📥 Downloading NLTK punkt_tab tokenizer...")
    nltk.download('punkt_tab')
    print("✅ NLTK punkt_tab tokenizer downloaded successfully")

✅ NLTK punkt tokenizer already available
📥 Downloading NLTK punkt_tab tokenizer...


[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\gknerr\AppData\Roaming\nltk_data...


✅ NLTK punkt_tab tokenizer downloaded successfully


[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


---

## Authentication: OpenAI Setup

The following code sets up authentication to connect securely to OpenAI. This process:

1. Retrieves the OpenAI API key from an OS environment variable (`OPENAI_API_KEY`)
2. Initializes the OpenAI client for embedding generation

This authentication enables us to:
- Call the embedding model securely
- Get vector embeddings for input text
- Use the embeddings for semantic search and similarity matching

In [12]:
# Load OpenAI API key from environment variable
openai_api_key = os.getenv("OPENAI_API_KEY")
if not openai_api_key:
    raise ValueError("OPENAI_API_KEY environment variable not set.")

# Set OpenAI API key for the openai library
openai.api_key = openai_api_key

# Set up LangChain OpenAIEmbeddings
embeddings_client = OpenAIEmbeddings(model="text-embedding-ada-002", openai_api_key=openai_api_key)

---

## Define Function: Get Embeddings from OpenAI

This function returns vector embeddings for the provided text chunks using OpenAI's embedding model.

### How it works:
1. Takes a list of text chunks as input
2. Calls the OpenAI embeddings API with retry logic (using tenacity)
3. Returns the `.data` field containing the embedding vectors

### Retry Mechanism:
The `@retry` decorator ensures that if the API call fails (due to rate limits or network issues), it will automatically retry with exponential backoff (waiting longer between each attempt).

In [13]:
# This function returns vector embeddings for the provided text chunks.

@retry(wait=wait_random_exponential(min=45, max=120), stop=stop_after_attempt(6))
def get_embeddings(texts_chunk):
    return embeddings_client.embed_documents(texts_chunk)

---

## Load and Extract PDF Content

We'll use **LangChain's PyMuPDFLoader** to extract text from our healthcare PDF document.

### What this function does:
1. Uses LangChain's built-in PDF loader
2. Loads the PDF into LangChain's document format
3. Returns a list of document chunks with their content

This is the first step in our pipeline: getting raw text from the PDF before we can create embeddings.

In [15]:
# Define a function to load and extract text from PDF
def load_pdf_with_langchain(pdf_path):
    
    # Use LangChain's built-in loader
    loader = PyMuPDFLoader(pdf_path)
    
    # Load the PDF into LangChain's document format
    documents = loader.load()
    
    print(f"Successfully loaded {len(documents)} document chunks from the PDF.")
    return documents

### Load the Healthcare Document

Now we'll load our specific PDF about healthcare transformation and medical IoT. Update the path if your PDF is located elsewhere.

In [16]:
# Path to the uploaded PDF (replace with your actual file path)
pdf_path = "./data/41598_2020_Article_64454.pdf"

# Extract the document chunks
docs = load_pdf_with_langchain(pdf_path)

Successfully loaded 13 document chunks from the PDF.


---

## Approach 1: Sentence-Based Chunking for Word2Vec

Word2Vec works best when trained on individual sentences. This chunking strategy:

### Why sentence-based chunking?
- **Word2Vec learns from context**: Words that appear near each other have similar meanings
- **Sentences provide natural context boundaries**: Each sentence is a complete thought
- **Better training data**: The model learns more accurate word relationships

### How the function works:
1. Extract page content from each document
2. Use NLTK's `sent_tokenize` to split text into sentences
3. Group sentences together (e.g., 3 sentences per chunk)
4. Return a list of text chunks ready for Word2Vec training

This approach is particularly useful for preparing text inputs for Word2Vec embeddings.

In [17]:
def sentence_based_chunking(docs, sentences_per_chunk=3):
    chunks = []
    for doc in docs:
        sentences = sent_tokenize(doc.page_content)
        for i in range(0, len(sentences), sentences_per_chunk):
            chunk = " ".join(sentences[i:i + sentences_per_chunk])
            chunks.append(chunk)
    return chunks

# Generate sentence chunks
text_chunks = sentence_based_chunking(docs)

---

## Generate Word2Vec Embeddings

Now we'll train a Word2Vec model on our text chunks and generate embeddings.

### Understanding the function:

1. **Tokenize each chunk into words**: Split text into individual words
   - Example: "The patient needs care" → ["The", "patient", "needs", "care"]

2. **Train a Word2Vec model** with these parameters:
   - `sentences=tokenized`: Training data (list of word lists)
   - `vector_size=100`: Each word becomes a 100-dimensional vector
   - `window=5`: Look at 5 words before and after each word for context
   - `min_count=1`: Include words that appear at least once
   - `workers=3`: Use 3 CPU cores for parallel training

3. **Generate embeddings for each chunk**:
   - Get the vector for each word in the chunk from the trained model
   - Average all word vectors to create a single chunk vector
   - If a word isn't in the model's vocabulary, use a zero vector

### Key Insight:
**Word2Vec creates chunk embeddings by averaging the individual word vectors.** This gives us a fixed-size representation (100 dimensions) for each chunk, regardless of chunk length.

In [18]:
# Define a function to train Word2Vec on the given chunks and returns vector averages for each chunk.
def word2vec_embedding(chunks):
    
    # Tokenize each chunk into words
    tokenized = [chunk.split() for chunk in chunks]
    
    # Train a Word2Vec model
    model = Word2Vec(sentences=tokenized, vector_size=100, window=5, min_count=1, workers=3)
    
    embeddings = []
    for words in tokenized:
        vectors = [model.wv[word] for word in words if word in model.wv]
        # Take average vector for each chunk
        chunk_vector = np.mean(vectors, axis=0) if vectors else np.zeros(100)
        embeddings.append(chunk_vector)
    
    return embeddings

# Run Word2Vec embeddings
w2v_embeddings = word2vec_embedding(text_chunks)
print(f"Generated {len(w2v_embeddings)} Word2Vec chunk embeddings.")
print(f"Generated Word2Vec first chunk embedding dimension {w2v_embeddings[0].shape}.")

Generated 240 Word2Vec chunk embeddings.
Generated Word2Vec first chunk embedding dimension (100,).


### Inspect a Word2Vec Embedding

Let's look at the first 10 values of the first chunk's embedding vector to see what Word2Vec produces.

In [19]:
w2v_embeddings[0][0:10]

array([-0.00612958,  0.00719691,  0.00240686, -0.00054739, -0.0003562 ,
       -0.01261691,  0.00319221,  0.01749715, -0.00551196, -0.00324367],
      dtype=float32)

---

## Observation: Word2Vec Embeddings

Let's analyze what we just accomplished:

### What happened:
- The `word2vec_embedding()` function **tokenized our text into 251 chunks** and **trained a Word2Vec model** on them
- It generated a **fixed-size vector (100 dimensions)** for each word, then **averaged them** to represent the entire chunk
- This method works well for capturing **word-level relationships** like similarity and analogies

### Limitations:
- ⚠️ **Does not consider sentence structure or word order**: "dog bites man" and "man bites dog" would have very similar embeddings
- ⚠️ **May miss deeper meaning or context**: It focuses on word proximity, not semantic understanding

---

## Good for:
✅ **Quick, lightweight embedding generation**  
✅ **Projects focused on word-level understanding** (e.g., finding synonyms, related terms)

## Limitations:
❌ **Not ideal for understanding full sentences or complex context**  
❌ **Accuracy depends heavily on the size and quality of training data**

For production RAG systems where **semantic understanding matters**, we typically use more advanced models like OpenAI's embeddings.

---

## Approach 2: OpenAI using Langchain Embeddings

Now let's use OpenAI's powerful pre-trained embedding model. This approach provides:
- **Higher quality embeddings** trained on massive datasets
- **Better semantic understanding** of context and meaning
- **1536-dimensional vectors** that capture rich information

OpenAI provides powerful pre-trained models for creating embeddings using their API.

### Step 1: Load and Chunk Using RecursiveCharacterTextSplitter

Unlike sentence-based chunking for Word2Vec, we'll use **RecursiveCharacterTextSplitter** here. This is a smarter chunking strategy:

### Why RecursiveCharacterTextSplitter?
- **Preserves context better**: Tries to keep related information together
- **Configurable overlap**: Ensures no information is lost at chunk boundaries
- **Optimized for semantic models**: Works better with advanced embedding models like OpenAI

### Function parameters:
- `chunk_size=500`: Each chunk will be approximately 500 characters
- `chunk_overlap=50`: Chunks will overlap by 50 characters to preserve context

### How it works:
1. Load the PDF using PyMuPDFLoader
2. Create a RecursiveCharacterTextSplitter with specified parameters
3. Split documents into chunks with overlap
4. Extract just the text content (page_content) from each chunk
5. Return a list of text strings ready for embedding

In [21]:
# Step 1: Load and chunk using RecursiveCharacterTextSplitter
def get_recursive_chunks(pdf_path, chunk_size=500, chunk_overlap=50):
    loader = PyMuPDFLoader(pdf_path)
    raw_docs = loader.load()
    
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap
    )
    chunks = splitter.split_documents(raw_docs)
    
    # Extract just the text part for embedding
    return [chunk.page_content for chunk in chunks]

# Example: Running it all together
pdf_path = "./data/41598_2020_Article_64454.pdf"  # Update this if needed
text_chunks = get_recursive_chunks(pdf_path)
openai_embeddings = get_embeddings(text_chunks)

print(f"Generated {len(openai_embeddings)} OpenAI embeddings.")
print(f"First chunk embedding size: {len(openai_embeddings[0])}")
print(f"First chunk embedding : {openai_embeddings[0][0:10]}")

Generated 135 OpenAI embeddings.
First chunk embedding size: 1536
First chunk embedding : [-0.01231426652520895, -0.017526544630527496, 0.012570381164550781, -0.02590218558907509, -0.014536234550178051, 0.031730521470308304, -0.004132443573325872, -0.017277352511882782, -0.00802953913807869, -0.01603139005601406]


---

## Observation: OpenAI Embeddings (text-embedding-ada-002)

Let's analyze what makes OpenAI embeddings superior for production use:

### What we did:
- We used the `get_embeddings()` function to generate vector embeddings using **LangChain's OpenAIEmbeddings** wrapper
- For each chunk of text, the function called OpenAI's **text-embedding-ada-002** model and returned a **1536-dimensional vector**
- These embeddings are **context-rich** and designed for tasks like:
  - Semantic search
  - Question answering
  - Document retrieval

---

## What this means:

### 🎯 Deep Semantic Understanding
OpenAI embeddings capture the **deep meaning of entire text chunks** — not just word proximity or surface-level patterns. The model understands:
- **Intent**: What the text is trying to convey
- **Topic**: The subject matter and domain
- **Semantic structure**: How concepts relate to each other

### 🚀 Optimized for Retrieval
These embeddings are specifically optimized for **retrieval use cases**, making them an excellent choice for building robust RAG pipelines.

### 📊 High-Dimensional Representation
The **1536-dimensional vector** holds rich information about the intent, topic, and semantic structure of the input text. This density of information enables:
- More accurate similarity matching
- Better handling of nuanced queries
- Improved retrieval of relevant context

---

## Comparison: Word2Vec vs OpenAI Embeddings

| Aspect | Word2Vec | OpenAI (text-embedding-ada-002) |
|--------|----------|----------------------------------|
| **Dimension Size** | Custom (e.g., 100) | Fixed (1536) |
| **Training** | Trained on your data | Pre-trained on massive datasets |
| **Context Understanding** | Word proximity only | Deep semantic meaning |
| **Best For** | Word-level relationships | Semantic search, RAG, QA |
| **Production Ready** | Requires careful tuning | Ready to use out-of-the-box |
| **Quality** | Depends on training data | Consistently high quality |

### 💡 Key Takeaway:
For production RAG systems, **OpenAI embeddings are the preferred choice** due to their superior semantic understanding, consistency, and optimization for retrieval tasks.

---

## Summary and Next Steps

In this tutorial, we've covered:

### ✅ What We Learned:
1. **What embeddings are** and why they're critical for RAG systems
2. **Two embedding approaches**:
   - **Word2Vec**: Fast, lightweight, word-level understanding
   - **OpenAI (text-embedding-ada-002)**: Deep semantic understanding, production-ready
3. **Different chunking strategies**:
   - Sentence-based chunking for Word2Vec
   - RecursiveCharacterTextSplitter for OpenAI embeddings
4. **How to generate and compare embeddings** from both methods

### 🎯 Key Insights:
- **Word2Vec** is good for quick prototypes and word-level analysis
- **OpenAI embeddings** are superior for production RAG systems
- **Chunking strategy matters**: Match your chunking approach to your embedding method
- **1536 dimensions** provide much richer semantic information than 100

### 🚀 Next Steps:
1. **Store embeddings in a vector database** (e.g., Pinecone, Weaviate, FAISS)
2. **Implement similarity search** to find relevant chunks for user queries
3. **Build a complete RAG pipeline** that:
   - Takes user questions
   - Converts questions to embeddings
   - Finds similar document chunks
   - Generates answers using retrieved context
4. **Experiment with different chunk sizes** to optimize retrieval quality
5. **Add metadata** to chunks (page numbers, section titles) for better context

### 📚 Additional Resources:
- [LangChain Documentation](https://python.langchain.com/docs/get_started/introduction)
- [OpenAI Embeddings Guide](https://platform.openai.com/docs/guides/embeddings)
- [Word2Vec Tutorial](https://radimrehurek.com/gensim/models/word2vec.html)
- [RAG Best Practices](https://www.pinecone.io/learn/retrieval-augmented-generation/)

---

**Congratulations!** You now understand how to transform documents into embeddings for RAG systems. 🎉