# Build Embeddings & FAISS Index (Slides + Labs)

The goal of this notebook is to convert our processed text chunks (from slides and labs) into numerical embeddings and organize them in FAISS indexes. These indexes will let us perform fast semantic search later when generating personalized study plans.

Inputs (from `data/processed/`)

- `slides_chunks.parquet:` cleaned text chunks from lecture slides
- `labs_chunks.parquet:` cleaned text chunks from lab notebooks

**What this notebook does**

1. Loads the chunked text data for slides and labs.

2. Uses a sentence transformer embedding model to turn each chunk of text into a numerical vector.

3. Normalizes embeddings so cosine similarity = inner product.

4. Creates two FAISS indexes (one for slides, one for labs) and saves them to disk.

5. Defines helper functions to search the indexes for the most relevant chunks given a query.

6. Runs a smoke test query (`"SQL joins inner left right"`) to check that both indexes return reasonable matches.

**Outputs (written to `data/processed/`)**

- `faiss_slides.index:` FAISS index for slide chunks
- `faiss_labs.index:` FAISS index for lab chunks

**Explanation**

- Embeddings turn unstructured text into vectors that can be compared mathematically.
- FAISS makes these comparisons very fast, even with thousands of chunks.
- With these indexes, we can map student feedback like *“I lost points on SQL joins”* directly to the most relevant slides or lab examples.

In [24]:
# libraries
from pathlib import Path
import os
import pandas as pd
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer



In [25]:
# Folders
PROJECT_ROOT = Path("..").resolve()
RAW_FOLDER = PROJECT_ROOT / "data" / "raw"
PROCESSED_FOLDER = PROJECT_ROOT / "data" / "processed"

# Input files 
SLIDES_CHUNKS_PATH = PROCESSED_FOLDER / "slides_chunks.parquet"
LABS_CHUNKS_PATH   = PROCESSED_FOLDER / "labs_chunks.parquet"

# Output files (Indexes)
FAISS_INDEX_PATH = PROCESSED_FOLDER / "faiss_slides.index"
FAISS_LABS_INDEX_PATH = PROCESSED_FOLDER / "faiss_labs.index"

print("Slides chunks exist? ", SLIDES_CHUNKS_PATH.exists(), SLIDES_CHUNKS_PATH)
print("Labs chunks exist?   ", LABS_CHUNKS_PATH.exists(), LABS_CHUNKS_PATH)



Slides chunks exist?  True C:\Users\julmo\OneDrive - University of Rochester\TKH Labs\Grades2Goals_Planner\data\processed\slides_chunks.parquet
Labs chunks exist?    True C:\Users\julmo\OneDrive - University of Rochester\TKH Labs\Grades2Goals_Planner\data\processed\labs_chunks.parquet


In [26]:
# Load chunked data
slides_df = pd.read_parquet(SLIDES_CHUNKS_PATH)
labs_df   = pd.read_parquet(LABS_CHUNKS_PATH)

# Drop empty rows
slides_df = slides_df.dropna(subset=['text']).reset_index(drop=True)
labs_df   = labs_df.dropna(subset=['text']).reset_index(drop=True)

print("Slides rows:", len(slides_df))
print("Labs rows:  ", len(labs_df))


Slides rows: 40
Labs rows:   1410


In [27]:
# Embedding model
# embedder = SentenceTransformer("bert-base-nli-mean-tokens"); class example
embedder = SentenceTransformer("all-MiniLM-L6-v2")  # smaller and faster model

def encode_normalized(texts):
    """Convert a single query string (what the student types) into a normalized float32 vector.
    - normalize_embeddings=True ensures vectors have length 1, so FAISS inner product ≈ cosine similarity.
    - We return a NumPy array with dtype float32 because FAISS expects float32 vectors."""
    embeddings = embedder.encode(texts, normalize_embeddings=True, show_progress_bar=True)
    return np.asarray(embeddings, dtype="float32")

print("Embedding model loaded:", embedder)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Embedding model loaded: SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False, 'architecture': 'BertModel'})
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)


In [28]:
# Create 2 FAISS indices, one for slides and one for labs

# Slides index
slides_embeddings = encode_normalized(slides_df['text'].tolist())
slides_index = faiss.IndexFlatIP(slides_embeddings.shape[1]) # IndexFlatIP for inner product (cosine similarity)
slides_index.add(slides_embeddings)
faiss.write_index(slides_index, FAISS_INDEX_PATH.as_posix())

# Labs index
labs_embeddings = encode_normalized(labs_df['text'].tolist())
labs_index = faiss.IndexFlatIP(labs_embeddings.shape[1])
labs_index.add(labs_embeddings)
faiss.write_index(labs_index, FAISS_LABS_INDEX_PATH.as_posix())


Batches:   0%|          | 0/2 [00:00<?, ?it/s]

Batches:   0%|          | 0/45 [00:00<?, ?it/s]

In [29]:
def search_slides(query_text,top_k =5):
    """
    Search ONLY the slides index and return a small, readable DataFrame of results.

    Steps:
    1) Encode the query using the same embedding model and normalization.
    2) Ask FAISS for the top_k most similar vectors from the slides index.
    3) For each match, look up the original row in slides_df to get metadata (file, page, text).
    4) Build a small results table with source_type, file, page, text, and the similarity score.
    5) Sort by score descending (higher ≈ more similar).
    """
    # Embed query 
    query_vector = encode_normalized([query_text])

    # FAISS search
    distances, indices = slides_index.search(query_vector, top_k)

    # Turn indicies into a list of rows from slides_df
    rows = []
    for score, idx in zip(distances[0], indices[0]):
        if idx < 0:  # FAISS returns -1 for empty results
            continue
        row = slides_df.iloc[idx] # get the mathching slide chunks
        # result structure
        rows.append({
            "source_type": "slide",
            "file": row['file'], # slide filename
            "page": row['page'], # slide page number
            "text": row['text'], # slide text chunk that matched
            "score": score # similarity score  (higher the better)
    
        })

    # Create a DataFrame and sort by score descending
    results_df = pd.DataFrame(rows).sort_values(by="score", ascending=False).reset_index(drop=True)
    return results_df

In [30]:
def search_labs(query_text,top_k =5):
    """
    Search ONLY the labs index and return a small, readable DataFrame of results.

    Steps:
    1) Encode the query using the same embedding model and normalization.
    2) Ask FAISS for the top_k most similar vectors from the labs index.
    3) For each match, look up the original row in labs_df to get metadata (file, page, text).
    4) Build a small results table with source_type, file, page, text, and the similarity score.
    5) Sort by score descending (higher ≈ more similar).
    """
    # Embed query 
    query_vector = encode_normalized([query_text])

    # FAISS search
    distances, indices = labs_index.search(query_vector, top_k)

    # Turn indicies into a list of rows from labs_df
    rows = []
    for score, idx in zip(distances[0], indices[0]):
        if idx < 0:  # FAISS returns -1 for empty results
            continue
        row = labs_df.iloc[idx] # get the mathching slide chunks
        # result structure
        rows.append({
            "source_type": "lab",
            "file": row['file'], # lab filename
            "text": row['text'], # lab text chunk that matched
            "score": score # similarity score  (higher the better)
    
        })

    # Create a DataFrame and sort by score descending
    results_df = pd.DataFrame(rows).sort_values(by="score", ascending=False).reset_index(drop=True)
    return results_df


In [31]:
display(search_slides("SQL joins inner left right", top_k=5))
display(search_labs("SQL joins inner left right", top_k=5))

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,source_type,file,page,text,score
0,slide,Introduction to Structured Databases II.pdf,42,Wednesday\nMore SQL Practice!\n●\nSQL Leetcode...,0.353842
1,slide,Introduction to Structured Databases I (1).pdf,62,Tuesday\nOn Tuesday we will review…\n●\nWhat i...,0.259988
2,slide,Advanced SQL II.pptx.pdf,41,Wednesday\nSQL Leetcode Review\n●\nAnother Lee...,0.25129
3,slide,SQL Review.pdf,16,Next Week…\nNext week will entail:\n●\nMonday:...,0.231525
4,slide,Advanced SQL I.pdf,43,Tuesday\nSQL + Python\n●\nHow do we design a d...,0.179609


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,source_type,file,text,score
0,lab,sql_refresher.ipynb,### SQL JOINs (Review)\nJOINs are used to comb...,0.636397
1,lab,w9-class2.ipynb,### Joins\n\ninner joins:\n```sql\nSELECT *\nF...,0.577176
2,lab,sql_refresher.ipynb,#### SQL JOINs\nJOINs combine rows from two or...,0.523579
3,lab,w9-class2.ipynb,#### Examples of Joins,0.518481
4,lab,w10-class1.ipynb,"```sql\nSELECT s1.name AS student1, s2.name AS...",0.437297
