---

### ðŸŽ“ **Professor**: Apostolos Filippas

### ðŸ“˜ **Class**: AI Engineering

### ðŸ“‹ **Homework 4**: Embeddings & Semantic Search

### ðŸ“… **Due Date**: Day of Lecture 5, 11:59 PM


**Note**: You are not allowed to share the contents of this notebook with anyone outside this class without written permission by the professor.

---

In this homework, you'll build on Homework 3 (BM25 search) by adding **embedding-based semantic search**.

You will:
1. **Generate embeddings** using both local (Hugging Face) and API (OpenAI) models
2. **Implement cosine similarity** from scratch
3. **Implement semantic search** from scratch
4. **Compare BM25 vs semantic search** using Recall
5. **Compare different embedding models** and analyze their differences

**Total Points: 95**

---

## Instructions

- Complete all tasks by filling in code where you see `# YOUR CODE HERE`
- You may use ChatGPT, Claude, documentation, Stack Overflow, etc.
- When using external resources, briefly cite them in a comment
- Run all cells before submitting to ensure they work

**Submission:**
1. Create a branch called `homework-4`
2. Commit and push your work
3. Create a PR and merge to main
4. Submit the `.ipynb` file on Blackboard

---

## Task 1: Environment Setup (10 points)

### 1a. Imports (5 pts)

Import the required libraries and load the WANDS data.

In [80]:
# ruff: noqa: E402

# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
import warnings
warnings.filterwarnings("ignore")

# Import ONLY data loading from helpers
#sys.path.append('../scripts') 
#from helpers import load_wands_products, load_wands_queries, load_wands_labels

import importlib.util
import sys
import os

#### used chatGPT to import helpers.py because I was getting errors. 
helpers_path = os.path.abspath("../scripts/helpers.py")

spec = importlib.util.spec_from_file_location("helpers", helpers_path)
helpers = importlib.util.module_from_spec(spec)
sys.modules["helpers"] = helpers
spec.loader.exec_module(helpers)

from helpers import load_wands_products, load_wands_queries, load_wands_labels

# Embedding libraries - we use these directly
from sentence_transformers import SentenceTransformer
import litellm

# Load environment variables for API keys
from dotenv import load_dotenv
load_dotenv()

pd.set_option('display.max_colwidth', 80)
print("All imports successful!")

All imports successful!


In [81]:
# Load the WANDS dataset
products = load_wands_products()
queries = load_wands_queries()
labels = load_wands_labels()

print(f"Products: {len(products):,}")
print(f"Queries: {len(queries):,}")
print(f"Labels: {len(labels):,}")

Products: 42,994
Queries: 480
Labels: 233,448


### 1b. Copy BM25 functions from HW3 (5 pts)

Copy your BM25 implementation from Homework 3. We'll use it to compare against semantic search.

In [82]:
# Copy your BM25 functions from Homework 3
import Stemmer
import string

stemmer = Stemmer.Stemmer('english')
punct_trans = str.maketrans({key: ' ' for key in string.punctuation})

def snowball_tokenize(text: str) -> list[str]:
    """
    Tokenize text with Snowball stemming.
    
    Args:
        text: The text to tokenize
        
    Returns:
        List of stemmed tokens
    """
    if pd.isna(text) or text is None:
        return []
    text = str(text).translate(punct_trans)
    tokens = text.lower().split()
    return [stemmer.stemWord(token) for token in tokens]

def build_index(docs: list[str], tokenizer) -> tuple[dict, list[int]]:
    """
    Build an inverted index from a list of documents.
    
    Args:
        docs: List of document strings to index
        tokenizer: Function that takes text and returns list of tokens
        
    Returns:
        index: dict mapping term -> {doc_id: term_count}
        doc_lengths: list of document lengths (in tokens)
    """
    index = {}
    doc_lengths = []
    
    for doc_id, doc in enumerate(docs):
        tokens = tokenizer(doc)
        doc_lengths.append(len(tokens))
        term_counts = Counter(tokens)
        
        for term, count in term_counts.items():
            if term not in index:
                index[term] = {}
            index[term][doc_id] = count
    
    return index, doc_lengths

def get_tf(term: str, doc_id: int, index: dict) -> int:
    """
    Get term frequency for a term in a document.
    
    Args:
        term: The term to look up
        doc_id: The document ID
        index: The inverted index
        
    Returns:
        Term frequency (count), or 0 if not found
    """
    if term in index and doc_id in index[term]:
        return index[term][doc_id]
    return 0

def get_df(term: str, index: dict) -> int:
    """
    Get document frequency for a term.
    
    Args:
        term: The term to look up
        index: The inverted index
        
    Returns:
        Number of documents containing the term
    """
    if term in index:
        return len(index[term])
    return 0

def bm25_idf(df: int, num_docs: int) -> float:
    """
    BM25 IDF formula.
    
    Args:
        df: Document frequency
        num_docs: Total number of documents
        
    Returns:
        IDF score
    """
    return np.log((num_docs - df + 0.5) / (df + 0.5) + 1)

def bm25_tf(tf: int, doc_len: int, avg_doc_len: float, k1: float = 1.2, b: float = 0.75) -> float:
    """
    BM25 TF normalization.
    
    Args:
        tf: Term frequency
        doc_len: Document length in tokens
        avg_doc_len: Average document length
        k1: Saturation parameter (default 1.2)
        b: Length normalization (default 0.75)
        
    Returns:
        Normalized TF score
    """
    return (tf * (k1 + 1)) / (tf + k1 * (1 - b + b * doc_len / avg_doc_len))

def score_bm25(query: str, index: dict, num_docs: int, doc_lengths: list[int], 
               tokenizer, k1: float = 1.2, b: float = 0.75) -> np.ndarray:
    """
    Score all documents using BM25.
    
    Args:
        query: The search query
        index: Inverted index
        num_docs: Total number of documents
        doc_lengths: List of document lengths
        tokenizer: Tokenization function
        
    Returns:
        Array of scores for each document
    """
    query_tokens = tokenizer(query)
    scores = np.zeros(num_docs)
    avg_doc_len = np.mean(doc_lengths) if doc_lengths else 1.0
    
    for token in query_tokens:
        df = get_df(token, index)
        if df == 0:
            continue
        
        idf = bm25_idf(df, num_docs)
        
        if token in index:
            for doc_id, tf in index[token].items():
                tf_norm = bm25_tf(tf, doc_lengths[doc_id], avg_doc_len, k1, b)
                scores[doc_id] += idf * tf_norm
    
    return scores

def search_products(query: str, products_df: pd.DataFrame, index: dict, 
                    doc_lengths: list[int], tokenizer, k: int = 10) -> pd.DataFrame:
    """
    Search products and return top-k results.
    
    Args:
        query: The search query
        products_df: DataFrame of products
        index: Inverted index
        doc_lengths: Document lengths
        tokenizer: Tokenization function
        k: Number of results to return
        
    Returns:
        DataFrame with top-k products and scores
    """
    scores = score_bm25(query, index, len(products_df), doc_lengths, tokenizer)
    top_k_idx = np.argsort(-scores)[:k]
    
    results = products_df.iloc[top_k_idx].copy()
    results['score'] = scores[top_k_idx]
    results['rank'] = range(1, k + 1)
    return results

print("All functions defined!")

All functions defined!


---

## Task 2: Understanding Embeddings (15 points)

### 2a. Load a local model and generate embeddings (5 pts)

Use `sentence-transformers` to load a local embedding model and generate embeddings for a list of words.

In [83]:
# Load the all-MiniLM-L6-v2 model using SentenceTransformer
# Then generate embeddings for each word in the list
words = ["wooden coffee table", "oak dining table", "red leather sofa", "blue area rug", "kitchen sink"]
# YOUR CODE HERE
# Load model
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(words, convert_to_numpy=True)
# Print the number of embeddings you generated and the dimension of the embeddings
print("Embedding dimension:", len(embeddings))

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

[1mBertModel LOAD REPORT[0m from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


Embedding dimension: 5


### 2b. Implement cosine similarity and create a similarity matrix (5 pts)

Implement cosine similarity from scratch:

$$\text{cosine\_similarity}(a, b) = \frac{a \cdot b}{\|a\| \times \|b\|}$$

In [84]:
# Implement cosine similarity from scratch
def cosine_similarity(a, b):
    dot_product = np.dot(a, b)
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    return dot_product / (norm_a * norm_b)
# Create similarity matrix
similarity_matrix = np.zeros((len(embeddings), len(embeddings)))
for i in range(len(embeddings)):
    for j in range(len(embeddings)):
        similarity_matrix[i][j] = cosine_similarity(embeddings[i], embeddings[j])
similarity_df = pd.DataFrame(similarity_matrix, index=words, columns=words)

# Display as DataFrame
similarity_df

Unnamed: 0,wooden coffee table,oak dining table,red leather sofa,blue area rug,kitchen sink
wooden coffee table,1.0,0.588631,0.370622,0.189486,0.295712
oak dining table,0.588631,1.0,0.33791,0.249521,0.34141
red leather sofa,0.370622,0.33791,1.0,0.380311,0.05774
blue area rug,0.189486,0.249521,0.380311,1.0,0.125802
kitchen sink,0.295712,0.34141,0.05774,0.125802,1.0


### 2c. Embed using OpenAI API (5 pts)

Use `litellm` to get embeddings from OpenAI's API and compare dimensions.

In [85]:
# Use litellm to get an embedding from OpenAI's text-embedding-3-small model
# Compare the dimension with the local model
response = litellm.embedding(model="text-embedding-3-small", input=["wooden coffee table"])
openai_embedding = np.array(response.data[0]["embedding"])
#print(response.choices[0].message.content)
print("OpenAI embedding dimension:", len(openai_embedding))
print("Local model dimension:", len(embeddings))

OpenAI embedding dimension: 1536
Local model dimension: 5


---

## Task 3: Batch Embedding Products (20 points)

### 3a. Embed a product sample (10 pts)

Create a combined text field and embed 5,000 products using the local model.

In [86]:
# Get a consistent sample
products_sample = products.sample(n=5000, random_state=42).reset_index(drop=True)


In [87]:
# Create a combined text field (product_name + product_class)
# Then embed all products using model.encode()
# YOUR CODE HERE
products_sample["combined_text"] = ( products_sample["product_name"].fillna("") + " " +
products_sample["product_class"].fillna(""))
product_embeddings = model.encode( products_sample["combined_text"].tolist(), batch_size=64, show_progress_bar=True )
print("Embedding shape:", product_embeddings.shape)


Batches:   0%|          | 0/79 [00:00<?, ?it/s]

Embedding shape: (5000, 384)


### 3b. Save and load embeddings (5 pts)

Save embeddings to a `.npy` file so you don't have to recompute them.

In [88]:
# Save embeddings to ../temp/hw4_embeddings.npy
np.save("/Users/hema/ai-engineering-fordham/temp/hw4_embeddings.npy", product_embeddings)
# Save products_sample to ../temp/hw4_products.csv
products_sample.to_csv("/Users/hema/ai-engineering-fordham/temp/hw4_products.csv", index=False)
# Then load them back and verify they match
loaded_embeddings = np.load("/Users/hema/ai-engineering-fordham/temp/hw4_embeddings.npy")
loaded_products = pd.read_csv("/Users/hema/ai-engineering-fordham/temp/hw4_products.csv")
print("Embeddings match:", np.array_equal(product_embeddings, loaded_embeddings))


Embeddings match: True


### 3c. Cost estimation (5 pts)

Estimate the cost to embed all 43K products using OpenAI's API.

**Pricing**: text-embedding-3-small costs ~$0.02 per 1 million tokens.

In [89]:
# Use tiktoken to count actual tokens in the sample

##### used chatGPT to figure out how to use tiktoken
import tiktoken
encoding = tiktoken.encoding_for_model("text-embedding-3-small")
sample_text = products_sample["combined_text"].tolist()
total_tokens = 0
for text in sample_text:
    total_tokens += len(encoding.encode(text))
print("Total tokens in sample:", total_tokens)

# Then extrapolate to estimate cost for the full dataset
avg_tokens = total_tokens / 5000
estimated_total_tokens = avg_tokens * 43000
cost_per_million = 0.02
estimated_cost = (estimated_total_tokens / 1_000_000) * cost_per_million
print("Estimated cost to embed 43K products: $", round(estimated_cost, 2))

Total tokens in sample: 65077
Estimated cost to embed 43K products: $ 0.01


---

## Task 4: Semantic Search (25 points)

### 4a. Implement semantic search (15 pts)

Implement a semantic search function from scratch.

In [90]:
# Implement batch cosine similarity for efficiency
def batch_cosine_similarity(query_embedding, product_embeddings):
    query_norm = query_embedding / np.linalg.norm(query_embedding)
    product_norms = product_embeddings / np.linalg.norm(product_embeddings, axis=1, keepdims=True)
    similarities = np.dot(product_norms, query_norm)
    return similarities

In [91]:
# Implement semantic search
def semantic_search(query, model, product_embeddings, products_df, top_k=10):
    query_embedding = model.encode(query)
    similarities = batch_cosine_similarity(query_embedding, product_embeddings)
    top_indices = np.argsort(similarities)[-top_k:][::-1]
    return products_df.iloc[top_indices]

In [92]:
# Test semantic search
semantic_search("comfortable sofa", model, product_embeddings, products_sample)

Unnamed: 0,product_id,product_name,product_class,category_hierarchy,product_description,product_features,rating_count,average_rating,review_count,combined_text
723,38543,sofa bed with ottoman,,Furniture / Living Room Furniture / Sofas,create a cozy spot in your living room with this sofa bed . the sofa bed and...,seatwidth-sidetoside:61|overallheight-toptobottom:36|seatfillmaterial : foam...,,,,sofa bed with ottoman
611,13047,arshleen patio sofa with cushions,Patio Sofas,Outdoor / Outdoor & Patio Furniture / Outdoor Seating & Patio Chairs / Patio...,the outdoor patio furniture wicker rattan sofa seat with olefin cushions are...,piecesincluded : |levelofassembly : none|seatingcapacity:3|cushioncoverclosu...,,,,arshleen patio sofa with cushions Patio Sofas
3032,20279,simge patio sofa with cushions,Patio Sofas,Outdoor / Outdoor & Patio Furniture / Outdoor Seating & Patio Chairs / Patio...,this sofa is a beautiful example of today â€™ s casual contemporary styling . ...,armheight-floortoarm:25.19|cushioncovermaterial : olefin|cushioncoverclosure...,,,,simge patio sofa with cushions Patio Sofas
2120,9687,kendall sectional sofa with ottoman,Sectionals,Furniture / Living Room Furniture / Sectionals,this sectional has a simple but elegant contemporary look that will combo we...,ottomandepth-fronttoback:23|upholsterycolor : gray|orientation : left hand f...,10.0,4.0,9.0,kendall sectional sofa with ottoman Sectionals
2631,28692,yland patio sofa with cushions,Patio Sofas,Outdoor / Outdoor & Patio Furniture / Outdoor Seating & Patio Chairs / Patio...,"build a daybed for daydreaming , a lounge sofa for night-time entertaining ....",fullorlimitedwarranty : full|style : coastal|framematerial : wicker/rattan|s...,,,,yland patio sofa with cushions Patio Sofas
2736,27404,baeten patio sofa with cushions,,Outdoor / Outdoor & Patio Furniture / Outdoor Seating & Patio Chairs / Patio...,"5 piece cushioned patio set , which in modern designs , are constructed from...",dssecondaryproductstyle : transitional modern|piecesincluded:5|warrantylengt...,,,,baeten patio sofa with cushions
3371,14295,samuel 91 '' velvet flared arm sofa,Sofas,Furniture / Living Room Furniture / Sofas,surround yourself in elegance as you relax against the various upholsteries ...,dssecondaryproductstyle : contemporary glam|overallproductweight:131|pattern...,,,,samuel 91 '' velvet flared arm sofa Sofas
4768,41382,castilloux patio sofa with cushions,Patio Sofas,Outdoor / Outdoor & Patio Furniture / Outdoor Seating & Patio Chairs / Patio...,enjoy al fresco meals on your patio or porch with this seven-piece outdoor d...,supplierintendedandapproveduse : non residential use|framecolor : gray|seati...,12.0,4.0,11.0,castilloux patio sofa with cushions Patio Sofas
4741,18969,94 '' square arm sofa with reversible cushions,,Furniture / Living Room Furniture / Sofas,"this sofa is the biggest , deepest , softest , most comfortable one-piece so...",legcolor : black|seatdepthsd : extra deep ( over 35 '' ) |backfillmaterial :...,282.0,4.5,208.0,94 '' square arm sofa with reversible cushions
1668,22118,abrish patio sectional with cushions,Patio Sofas,Outdoor / Outdoor & Patio Furniture / Outdoor Seating & Patio Chairs / Patio...,,framedurability : rust resistant|cushioncoverclosuremethod : zipper|dsprimar...,6.0,4.0,6.0,abrish patio sectional with cushions Patio Sofas


### 4b. Evaluate and compare BM25 vs semantic search (10 pts)

Implement Recall@k and compare the two search methods.

In [93]:
# Implement Recall@k
def recall_at_k(relevant_items, retrieved_items, k=10):
    retrieved_at_k = retrieved_items[:k]
    relevant_set = set(relevant_items)
    hits = 0
    for item in retrieved_at_k:
        if item in relevant_set:
            hits += 1    
    return hits / len(relevant_set) if len(relevant_set) > 0 else 0

In [94]:
# Build BM25 index for comparison
# Filter queries to those with products in our sample

In [95]:
# Evaluate both BM25 and semantic search on all queries
# Calculate Recall@10 for each method

In [96]:
# Visualize comparison


---

## Task 5: Compare Embedding Models (20 points)

### 5a. Embed products with two different models (10 pts)

Compare embeddings from:
- `BAAI/bge-base-en-v1.5`
- `sentence-transformers/all-mpnet-base-v2`

In [97]:
# Load the two embedding models
bge_model = SentenceTransformer("BAAI/bge-base-en-v1.5")
mpnet_model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")

Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

[1mBertModel LOAD REPORT[0m from: BAAI/bge-base-en-v1.5
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

[1mMPNetModel LOAD REPORT[0m from: sentence-transformers/all-mpnet-base-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


In [98]:
# Embed products with both models

### 5b. Compare search results between models (10 pts)

Evaluate both models on the same queries and analyze differences.

In [99]:
# Compare results for specific queries
test_queries = ["comfortable sofa", "star wars rug", "modern coffee table"]
# add more!

In [100]:
# Visualize model comparison with a scatter plot
# X-axis: BGE Recall@10, Y-axis: MPNet Recall@10


---

## Task 6: Git Submission (5 points)

Submit your work using the Git workflow:

- [ ] Create a new branch called `homework-4`
- [ ] Commit your work with a meaningful message
- [ ] Push to GitHub
- [ ] Create a Pull Request
- [ ] Merge the PR to main
- [ ] Submit the `.ipynb` file on Blackboard

The TA will verify your submission by checking the merged PR on GitHub.