# Embedding Model Demonstration

This notebook demonstrates the core concept behind **Retrieval-Augmented Generation (RAG)**: **Vector Embeddings**.

We will:
1.  Load the project's embedding model.
2.  Embed simple sentences to visualize how semantic meaning is captured in vector space.
3.  Process the entire **Data Mining Textbook**, split it into chapters, and visualize how different topics cluster together in 3D space.

**Note:** This notebook requires `scikit-learn`, `matplotlib`, and `pandas` to be installed in your environment.

## 1. Setup and Imports
We load necessary libraries and set up the project environment to access our shared utilities.

In [None]:
%load_ext dotenv
%dotenv
import sys
import os
from pathlib import Path

# Add project root to path to import src modules
sys.path.append(str(Path().resolve().parent))

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.decomposition import PCA
from src.util.env_check import get_embed_model

## 2. Load the Embedding Model
We use the helper function `get_embed_model` to load the model defined in your `.env` file (e.g., `qwen3-embedding` or OpenAI's `text-embedding-3-small`).

In [None]:
embedding_model = get_embed_model()
print(f"Loaded embedding model: {embedding_model}")

## 3. Simple Sentence Demonstration
To understand embeddings, let's start small. We define two groups of sentences:
1.  **Animals/Nature**
2.  **Computer Science/Data Mining**

We expect the model to group semantically similar sentences together.

In [None]:
sentences = [
    # Group A: Animals
    "The quick brown fox jumps over the dog.",
    "A dog is a man's best friend.",
    "The cat sleeps on the warm windowsill.",
    "Wild wolves hunt in packs in the forest.",
    
    # Group B: Data Mining
    "Data mining involves discovering patterns in large datasets.",
    "Neural networks are a subset of machine learning algorithms.",
    "Clustering algorithms group similar data points together.",
    "Retrieval augmented generation uses vector databases."
]

labels = ["Animal"] * 4 + ["Tech"] * 4

# Generate embeddings
embeddings = embedding_model.embed_documents(sentences)
embeddings_array = np.array(embeddings)

print(f"Generated embeddings shape: {embeddings_array.shape}")

### Visualization (PCA)
Embeddings are high-dimensional vectors (often 768+ dimensions). We use **Principal Component Analysis (PCA)** to reduce them to 3 dimensions so we can plot them.

In [None]:
pca = PCA(n_components=3)
reduced_embeddings = pca.fit_transform(embeddings_array)

fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')

colors = {'Animal': 'green', 'Tech': 'blue'}

for i, sentence in enumerate(sentences):
    x, y, z = reduced_embeddings[i]
    label = labels[i]
    ax.scatter(x, y, z, c=colors[label], s=100, label=label if i in [0, 4] else "")
    ax.text(x, y, z, sentence[:20] + "...", fontsize=9)

ax.set_title("3D Visualization of Sentence Embeddings")
ax.set_xlabel("PC1")
ax.set_ylabel("PC2")
ax.set_zlabel("PC3")
ax.legend()
plt.show()

## 4. Textbook Analysis
Now we apply this to the actual course material. We will:
1.  Load the **Textbook PDF**.
2.  Use the `contents.json` map to split it into **Chapters**.
3.  Embed a sample of text chunks from each chapter.
4.  Visualize if chapters cluster separately in vector space.

In [None]:
import json
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

# Paths
json_path = "../data/processed/contents.json"
pdf_path = "../data/raw/Textbook.pdf"

# Load Chapter Map
with open(json_path, "r") as f:
    chapters_json = json.load(f)

# Load PDF Pages
loader = PyPDFLoader(pdf_path)
pages = loader.load()

print(f"Loaded {len(pages)} pages from the textbook.")

### Processing Chapters
We use the logic from `advanced_ingest.py` to extract text by chapter boundaries.

In [None]:
PAGE_OFFSET = 26  # Offset for PDF vs Book page numbers
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)

chapter_chunks = []
chapter_labels = []

# We will process only the first 6 chapters for clarity in the plot
target_chapters = chapters_json[:6]

print(f"Processing {len(target_chapters)} chapters...")

for chapter in target_chapters:
    start_page = chapter["start_page"]
    end_page = chapter["end_page"]
    ch_title = chapter["title"]
    ch_num = chapter["chapter_number"]
    
    # Calculate PDF indices
    start_idx = start_page + PAGE_OFFSET - 1
    if end_page is not None:
        end_idx = end_page + PAGE_OFFSET - 1
        current_pages = pages[start_idx : end_idx + 1]
    else:
        current_pages = pages[start_idx : ]
        
    # Merge text for the chapter
    full_text = "\n".join([p.page_content for p in current_pages])
    
    # Create a document and split it
    doc = Document(page_content=full_text, metadata={"chapter": f"Ch {ch_num}"})
    chunks = text_splitter.split_documents([doc])
    
    # Take a sample of chunks (e.g., first 15) to keep the plot readable
    sample_chunks = chunks[:15]
    
    for chunk in sample_chunks:
        # Inject context like in our advanced ingest strategy
        contextualized_text = f"Chapter {ch_num}: {ch_title}\n{chunk.page_content}"
        chapter_chunks.append(contextualized_text)
        chapter_labels.append(f"Ch {ch_num}")

print(f"Total chunks prepared for embedding: {len(chapter_chunks)}")

### Generate Chapter Embeddings
This may take a minute depending on your hardware (local CPU vs GPU).

In [None]:
book_embeddings = embedding_model.embed_documents(chapter_chunks)
book_embeddings_array = np.array(book_embeddings)

### 3D Visualization of Chapters
We plot the chunks in 3D space, coloring them by chapter. You should see distinct clusters corresponding to different topics (e.g., "Data Preparation" vs "Clustering" vs "Classification").

In [None]:
pca = PCA(n_components=3)
reduced_book = pca.fit_transform(book_embeddings_array)

# Create DataFrame for easier plotting
df = pd.DataFrame(reduced_book, columns=['x', 'y', 'z'])
df['label'] = chapter_labels

fig = plt.figure(figsize=(12, 10))
ax = fig.add_subplot(111, projection='3d')

# Color map
unique_labels = sorted(list(set(chapter_labels)))
colors = plt.cm.jet(np.linspace(0, 1, len(unique_labels)))
color_map = dict(zip(unique_labels, colors))

for label in unique_labels:
    subset = df[df['label'] == label]
    ax.scatter(subset['x'], subset['y'], subset['z'], c=[color_map[label]], label=label, s=40, alpha=0.7)

ax.set_title("Semantic Clusters of Textbook Chapters")
ax.set_xlabel("PC1")
ax.set_ylabel("PC2")
ax.set_zlabel("PC3")
ax.legend(title="Chapter")
plt.show()

### Conclusion
The plot above demonstrates that **semantically related text clusters together** in vector space.

- Chunks from the same chapter (e.g., "Data Preparation") appear close to each other.
- Chunks from different topics are separated.

This spatial property allows our RAG system to find relevant information by simply looking for the "closest" vectors to the user's question.