# Embedding Model Demonstration

This notebook demonstrates the core concept behind Retrieval Augmented Generation - **Vector Embeddings**.


In [1]:
%load_ext dotenv
%dotenv
import sys
from pathlib import Path

sys.path.append(str(Path().resolve().parent))
import plotly.express as px
import pandas as pd
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from src.util.env_check import get_embed_model

Loading the Embedding Model

In [2]:
embedding_model = get_embed_model()
print(f"Loaded embedding model: {embedding_model}")

Loaded embedding model: model='qwen3-embedding:0.6b' base_url=None client_kwargs={} mirostat=None mirostat_eta=None mirostat_tau=None num_ctx=None num_gpu=None num_thread=None repeat_last_n=None repeat_penalty=None temperature=None stop=None tfs_z=None top_k=None top_p=None


### Simple Sentence Demonstration

Embedding models are machine learning models used to capture meaning behind data. They are often based on transformer architectures now but they started off as simple shallow neural networks. They convert complex unstructured data (text, images, or audio) into dense numerical vectors (embeddings). These vectors capture semantic meaning, allowing computers to identify relationships, context, and similarities, essential for applications like semantic search, recommendation systems, and Retrieval-Augmented Generation (RAG)

To demonstrate embedding models on a small example lets embed 2 sets of sentences:
1.  Animals/Nature
2.  Computer Science/Data Mining

We expect the model to group semantically similar sentences together.

In [3]:
sentences = [
    #Animals
    "The quick brown fox jumps over the dog.",
    "The neigbourhood dog barks all night",
    "My cat likes to chase dogs in the yard.",
    "Cheetas are very fast.",
    
    #Data Mining
    "Data mining involves discovering patterns in large datasets.",
    "Neural networks are a subset of machine learning algorithms.",
    "Clustering algorithms group similar data points together.",
    "Principal component analysis reduces the dimensionality of data."
]


# For color coding
labels = ["Animal","Animal","Animal","Animal","Tech","Tech","Tech","Tech"]

# Generate embeddings
embeddings = embedding_model.embed_documents(sentences)
embeddings_array = np.array(embeddings)

print(f"Generated embeddings shape: {embeddings_array.shape}")

Generated embeddings shape: (8, 1024)


In [4]:
embeddings_array[0]

array([ 0.02487724, -0.04650597, -0.00378125, ...,  0.04063228,
       -0.00916167, -0.00062493])

### Visualization (PCA)
Embeddings are high-dimensional vectors. Lets use Principal Component Analysis (PCA) to reduce them to 3 dimensions so we can plot them.

You can interact with the plot.

In [5]:
pca = PCA(n_components=3)
reduced_embeddings = pca.fit_transform(embeddings_array)

df = pd.DataFrame({
    'x': reduced_embeddings[:, 0],
    'y': reduced_embeddings[:, 1],
    'z': reduced_embeddings[:, 2],
    'sentence': sentences,
    'category': labels
})

fig = px.scatter_3d(
    df, 
    x='x', y='y', z='z',
    color='category',
    hover_data={'x': False, 'y': False, 'z': False, 'sentence': True},
    title="3D Sentence Embeddings (interactive plot)",
    color_discrete_map={'Animal': 'green', 'Tech': 'blue'}
)

fig.update_traces(marker=dict(size=5))
fig.update_layout(margin=dict(l=0, r=0, b=0, t=40))

fig.show()

### Textbook Analysis

Now we apply this to the Textbook:
1.  Load the Textbook.
2.  Use the `contents.json` map to split it into chapters.
3.  Embed a sample of text chunks from each chapter.
4.  Visualize if chapters cluster separately in vector space.

In [6]:
import json
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

json_path = "../data/processed/contents.json"
pdf_path = "../data/raw/Textbook.pdf"

with open(json_path, "r") as f:
    chapters_json = json.load(f)

loader = PyPDFLoader(pdf_path)
pages = loader.load()

print(f"Loaded {len(pages)} pages from the textbook.")

Loaded 746 pages from the textbook.


### Processing Chapters
Using the logic from `advanced_ingest.py` to split the book by chapters.

In [7]:
PAGE_OFFSET = 26 
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)

chapter_chunks = []
chapter_labels = []

# Processing only the first 6 chapters for clarity in the plot
target_chapters = chapters_json[:6]

for chapter in target_chapters:
    start_page = chapter["start_page"]
    end_page = chapter["end_page"]
    ch_title = chapter["title"]
    ch_num = chapter["chapter_number"]
    
    start_idx = start_page + PAGE_OFFSET - 1
    if end_page is not None:
        end_idx = end_page + PAGE_OFFSET - 1
        current_pages = pages[start_idx : end_idx + 1]
    else:
        current_pages = pages[start_idx : ]
        
    full_text = "\n".join([p.page_content for p in current_pages])
    
    doc = Document(page_content=full_text, metadata={"chapter": f"Ch {ch_num}"})
    chunks = text_splitter.split_documents([doc])
    
    # Take some chunks out of each chapter
    sample_chunks = chunks[:15]
    
    for chunk in sample_chunks:
        contextualized_text = f"Chapter {ch_num}: {ch_title}\n{chunk.page_content}"
        chapter_chunks.append(contextualized_text)
        chapter_labels.append(f"Ch {ch_num}")

print(f"Total chunks prepared for embedding: {len(chapter_chunks)}")

Total chunks prepared for embedding: 90


Generate Chapter Embeddings

In [8]:
book_embeddings = embedding_model.embed_documents(chapter_chunks)
book_embeddings_array = np.array(book_embeddings)

### 3D Visualization of chapters
We plot the chunks in 3D space, coloring them by chapter. We should see chunks from the same chapters cluster together.

In [9]:
pca = PCA(n_components=3)
reduced_book = pca.fit_transform(book_embeddings_array)

df = pd.DataFrame({
    'x': reduced_book[:, 0],
    'y': reduced_book[:, 1],
    'z': reduced_book[:, 2],
    'chapter': chapter_labels,
    'text': chapter_chunks
})

fig = px.scatter_3d(
    df, 
    x='x', y='y', z='z',
    color='chapter',
    hover_data={'x': False, 'y': False, 'z': False, 'chapter': True},
    title="Semantic Clusters of Textbook Chapters",
    color_discrete_sequence=px.colors.qualitative.Plotly
)

fig.update_traces(
    marker=dict(size=4, opacity=0.8, line=dict(width=0))
)

fig.update_layout(
    margin=dict(l=0, r=0, b=0, t=40),
    legend_title_text='Chapter',
    scene=dict(
        xaxis_title='PC1',
        yaxis_title='PC2',
        zaxis_title='PC3'
    )
)

fig.show()

The plot above demonstrates that semantically related text clusters together in vector space.

- Chunks from the same chapter appear close to each other.
- Chunks from different topics are separated.


### How This Applies to RAG (Retrieval Augmented Generation)

When a user asks a question, a RAG system performs the following steps:
1. Embed the query: It takes the users raw text question and passes it through the exact same embedding model used for the database.
2. Calculate similarity: It compares the querys vector against every chunks vector in the database to find the closest matches. 
3. Retrieve and generate: It takes the top K most similar chunks and feeds them to an LLM as context to formulate an accurate answer.

**Cosine Similarity**

To find the closest vectors, the most common metric used is cosine similarity. Instead of measuring the physical distance between two points, it measures the angle between two vectors. This makes it highly effective for text, as it focuses on the orientation (semantic meaning) rather than the magnitude (length of the document).

The formula for Cosine Similarity between vectors $A$ and $B$ is:
$$\text{Cosine Similarity} = \frac{A \cdot B}{\|A\| \|B\|}$$

A score of **1** means the vectors point in the exact same direction (highly similar), **0** means they are unrelated, and **-1** means they are completely opposite.

### Simple demonstration

In [10]:
def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    return dot_product / (norm_vec1 * norm_vec2)

user_query = "What cat is super quick?"

query_embedding = embedding_model.embed_documents([user_query])[0]
query_vector = np.array(query_embedding)
query_vector



array([ 0.04200577, -0.01553882, -0.00555002, ...,  0.02724347,
       -0.023457  , -0.01733154])

This is essentially what happens when we run similarity search on a vectorstore

In [11]:

similarities = []
for sentence_vector in embeddings_array:
    sim = cosine_similarity(query_vector, sentence_vector)
    similarities.append(sim)

best_match_idx = np.argmax(similarities)
best_score = similarities[best_match_idx]
best_sentence = sentences[best_match_idx]

print(f"User Query: '{user_query}'\n")
print("-" * 50)
print(f"Top Database Match (Similarity Score: {best_score:.4f}):")
print(f"'{best_sentence}'")
print("-" * 50)

print("\nAll scores ranked:")
ranked_indices = np.argsort(similarities)[::-1]
for idx in ranked_indices:
    print(f"[{similarities[idx]:.4f}] {sentences[idx]}")

User Query: 'What cat is super quick?'

--------------------------------------------------
Top Database Match (Similarity Score: 0.6869):
'Cheetas are very fast.'
--------------------------------------------------

All scores ranked:
[0.6869] Cheetas are very fast.
[0.5760] My cat likes to chase dogs in the yard.
[0.5484] The quick brown fox jumps over the dog.
[0.4342] The neigbourhood dog barks all night
[0.3019] Clustering algorithms group similar data points together.
[0.2867] Data mining involves discovering patterns in large datasets.
[0.2864] Principal component analysis reduces the dimensionality of data.
[0.2815] Neural networks are a subset of machine learning algorithms.
