# Session 2: Word Vectors and Distributional Semantics

## From Discrete Symbols to Dense Representations

**Machine Learning Master - UPNA**  
**Academic Year 2025-2026**

**Estimated Duration:** 2-3 hours

---

## Table of Contents

1. [Introduction and Setup](#1-introduction-and-setup)
2. [The Problem: Representing Meaning](#2-the-problem-representing-meaning)
3. [Classical Approaches: Sparse Representations](#3-classical-approaches-sparse-representations)
4. [The Distributional Hypothesis](#4-the-distributional-hypothesis)
5. [Word2Vec: Learning Dense Embeddings](#5-word2vec-learning-dense-embeddings)
6. [Alternative Approaches: GloVe and FastText](#6-alternative-approaches-glove-and-fasttext)
7. [Evaluation Methods](#7-evaluation-methods)
8. [Practical Applications](#8-practical-applications)
9. [Limitations and Future Directions](#9-limitations-and-future-directions)
10. [Final Exercises](#10-final-exercises)

---

## Learning Objectives

By the end of this session, you should be able to:

- Explain the limitations of discrete word representations
- Understand the distributional hypothesis and its implications
- Implement and train Word2Vec models
- Compare different embedding approaches (Word2Vec, GloVe, FastText)
- Evaluate word embeddings using intrinsic and extrinsic methods
- Apply embeddings to downstream NLP tasks

---

## 1. Introduction and Setup

### 1.1 Required Libraries

In [None]:
# Install required packages (uncomment if needed)
# !pip install gensim nltk scikit-learn matplotlib seaborn numpy scipy

# Standard imports
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter, defaultdict
from scipy.spatial.distance import cosine
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings('ignore')

# Word embeddings
import gensim
from gensim.models import Word2Vec, KeyedVectors
from gensim.models.fasttext import FastText

# NLP utilities
import nltk
nltk.download('stopwords', quiet=True)
nltk.download('punkt', quiet=True)

print(f"NumPy version: {np.__version__}")
print(f"Gensim version: {gensim.__version__}")

# Set random seed for reproducibility
np.random.seed(42)

### 1.2 Helper Functions

In [None]:
def plot_vectors_2d(words, vectors, method='PCA', title='Word Embeddings Visualization'):
    """
    Visualize word vectors in 2D space.

    Args:
        words: List of words
        vectors: Numpy array of word vectors
        method: Dimensionality reduction method ('PCA' or 'TSNE')
        title: Plot title
    """
    if method == 'PCA':
        from sklearn.decomposition import PCA
        reducer = PCA(n_components=2)
    else:
        from sklearn.manifold import TSNE
        reducer = TSNE(n_components=2, random_state=42)

    vectors_2d = reducer.fit_transform(vectors)

    plt.figure(figsize=(12, 8))
    plt.scatter(vectors_2d[:, 0], vectors_2d[:, 1], alpha=0.6)

    for i, word in enumerate(words):
        plt.annotate(word, xy=(vectors_2d[i, 0], vectors_2d[i, 1]),
                    xytext=(5, 2), textcoords='offset points',
                    ha='right', fontsize=10)

    plt.title(title)
    plt.xlabel('Dimension 1')
    plt.ylabel('Dimension 2')
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

---

## 2. The Problem: Representing Meaning

### 2.1 Theory: Why Traditional Representations Fail

**Question 2.1:** Before we start coding, think about this:

- How would you represent the word "cat" to a computer?
- What information should be captured?
- How can we capture similarity between "cat" and "dog"?

_Write your thoughts here:_

```
[Your answer]
```

### Exercise 2.1: One-Hot Encoding Analysis

**Objective:** Understand the limitations of one-hot encoded vectors.

In [None]:
def create_one_hot_vectors(vocabulary):
    """
    Create one-hot encoded vectors for a vocabulary.

    Args:
        vocabulary (list): List of words

    Returns:
        dict: Dictionary mapping words to one-hot vectors

    Example:
        >>> vocab = ['cat', 'dog', 'mouse']
        >>> vectors = create_one_hot_vectors(vocab)
        >>> vectors['cat']
        array([1., 0., 0.])
    """
    vocab_size = len(vocabulary)
    one_hot_dict = {}

    # TODO: Complete this function
    # For each word in vocabulary, create a vector of zeros with length vocab_size
    # Set the appropriate index to 1
    # Hint: Use np.zeros() and enumerate()

    # YOUR CODE HERE

    return one_hot_dict

# Test the function
vocabulary = ['cat', 'dog', 'mouse', 'lion', 'tiger']
one_hot_vectors = create_one_hot_vectors(vocabulary)

print("One-hot vectors:")
for word, vector in one_hot_vectors.items():
    print(f"{word}: {vector}")

In [None]:
def compute_cosine_similarity(vec1, vec2):
    """
    Compute cosine similarity between two vectors.

    Args:
        vec1, vec2: numpy arrays

    Returns:
        float: cosine similarity

    Formula: cosine_sim = (a ¬∑ b) / (||a|| ||b||)
    """
    # TODO: Complete this function
    # Compute cosine similarity using the formula: (a ¬∑ b) / (||a|| ||b||)
    # Hint: Use np.dot() and np.linalg.norm()

    # YOUR CODE HERE

    pass

# Test similarity computation
cat_vec = one_hot_vectors['cat']
dog_vec = one_hot_vectors['dog']

cat_cat_similarity = compute_cosine_similarity(cat_vec, cat_vec)
cat_dog_similarity = compute_cosine_similarity(cat_vec, dog_vec)

print(f"\nSimilarity(cat, cat): {cat_cat_similarity:.4f}")
print(f"Similarity(cat, dog): {cat_dog_similarity:.4f}")
print(f"Similarity(cat, mouse): {compute_cosine_similarity(cat_vec, one_hot_vectors['mouse']):.4f}")

**Question 2.2:** What do you notice about the similarity between different words in one-hot encoding? Why is this a problem for capturing semantic relationships?

_Your Answer:_

```
[Write your observations here]

Expected observations:
- All different words have similarity of 0 (orthogonal vectors)
- This means "cat" is as similar to "dog" as it is to "mouse" or any other word
- No semantic information is captured
- All words are equally distant from each other in the vector space
```

### Exercise 2.2: Vocabulary Size Problem

In [None]:
def analyze_vocabulary_growth(text_samples):
    """
    Analyze how vocabulary size grows with corpus size.

    Args:
        text_samples (list): List of text samples

    Returns:
        list: Vocabulary sizes at different corpus sizes
    """
    vocabulary = set()
    vocab_sizes = []

    for text in text_samples:
        # TODO: Tokenize the text (simple split by whitespace)
        # Add words to vocabulary set
        # Record the vocabulary size

        # YOUR CODE HERE

        pass

    return vocab_sizes

# Example corpus (you can expand this)
sample_texts = [
    "the cat sat on the mat",
    "the dog played in the park",
    "a cat and a dog are friends",
    "the quick brown fox jumps over the lazy dog",
    "cats and dogs are common pets",
]

vocab_sizes = analyze_vocabulary_growth(sample_texts)

# Plot vocabulary growth
plt.figure(figsize=(10, 5))
plt.plot(range(1, len(vocab_sizes) + 1), vocab_sizes, marker='o', linewidth=2, markersize=8)
plt.xlabel('Number of Sentences')
plt.ylabel('Vocabulary Size')
plt.title('Vocabulary Growth with Corpus Size')
plt.grid(True)
plt.show()

print(f"Final vocabulary size: {vocab_sizes[-1]}")
print(f"One-hot vector dimension: {vocab_sizes[-1]}")
print(f"Memory for 1000 words (float32): {vocab_sizes[-1] * 1000 * 4 / 1024:.2f} KB")

**Question 2.3:** What happens to the dimensionality as the vocabulary grows? What are the implications for:

- Memory usage
- Computational efficiency
- Sparsity of data

_Your Answer:_

```
[Write your answer here]
```

---

## 3. Classical Approaches: Sparse Representations

### 3.1 Theory: Bag of Words (BoW)

Before Word2Vec, researchers used count-based methods. The simplest is **Bag of Words**: represent a document as the sum of its one-hot word vectors (i.e., word counts).

**Properties:**

- Disregards grammar and word order
- Keeps multiplicity (counts)
- Still sparse and high-dimensional

### Exercise 3.1: Implementing Bag of Words

In [None]:
def create_bow_representation(documents, vocabulary=None):
    """
    Create Bag of Words representation for documents.

    Args:
        documents (list): List of documents (strings)
        vocabulary (list): Optional vocabulary to use

    Returns:
        np.array: BoW matrix (documents √ó vocabulary)
        list: vocabulary

    Example:
        >>> docs = ["I love NLP", "I love deep learning"]
        >>> bow_matrix, vocab = create_bow_representation(docs)
        >>> bow_matrix.shape
        (2, 5)  # 2 documents, 5 unique words
    """
    # Build vocabulary if not provided
    if vocabulary is None:
        vocabulary = set()
        for doc in documents:
            words = doc.lower().split()
            vocabulary.update(words)
        vocabulary = sorted(list(vocabulary))

    # Create word to index mapping
    word_to_idx = {word: idx for idx, word in enumerate(vocabulary)}

    # TODO: Complete this function
    # Create a matrix of size (num_documents, vocab_size)
    # For each document, count the occurrences of each word

    # YOUR CODE HERE

    return bow_matrix, vocabulary

# Test the function
documents = [
    "I love natural language processing",
    "I love deep learning",
    "Natural language processing is amazing"
]

bow_matrix, vocab = create_bow_representation(documents)

print("Vocabulary:", vocab)
print("\nBag of Words matrix:")
print(bow_matrix)
print(f"\nMatrix shape: {bow_matrix.shape}")
print(f"Sparsity: {np.sum(bow_matrix == 0) / bow_matrix.size * 100:.2f}%")

### 3.2 Theory: TF-IDF

**TF-IDF** (Term Frequency - Inverse Document Frequency) weights terms by their importance, penalizing common words.

**Formula:**

- **Term Frequency (TF):** $\text{tf}(t,d) = \frac{\text{count}(t,d)}{\text{total\_words}(d)}$
- **Inverse Document Frequency (IDF):** $\text{idf}(t, D) = \log \frac{N}{|\{d \in D : t \in d\}|}$
- **TF-IDF:** $\text{tfidf}(t,d,D) = \text{tf}(t,d) \times \text{idf}(t, D)$

### Exercise 3.2: Implementing TF-IDF

In [None]:
def compute_tf_idf(documents):
    """
    Compute TF-IDF representation for documents.

    Args:
        documents (list): List of documents (strings)

    Returns:
        np.array: TF-IDF matrix
        list: vocabulary
    """
    # First, get the BoW representation
    bow_matrix, vocabulary = create_bow_representation(documents)

    # TODO: Complete this function
    # 1. Compute Term Frequency (TF): normalize BoW by document length
    # 2. Compute Document Frequency (DF): number of documents containing each term
    # 3. Compute Inverse Document Frequency (IDF): idf(t) = log(N / df(t))
    # 4. Compute TF-IDF: tf_idf(t,d) = tf(t,d) * idf(t)

    # YOUR CODE HERE

    return tf_idf_matrix, vocabulary

# Test the function
test_documents = [
    "the cat sat on the mat",
    "the dog sat on the log",
    "cats and dogs are enemies",
    "the cat and the dog are friends"
]

tf_idf_matrix, vocab = compute_tf_idf(test_documents)

print("TF-IDF matrix:")
print(tf_idf_matrix)
print("\nTop weighted terms per document:")
for doc_idx in range(len(test_documents)):
    top_indices = np.argsort(tf_idf_matrix[doc_idx])[-3:][::-1]
    top_terms = [(vocab[idx], tf_idf_matrix[doc_idx, idx]) for idx in top_indices]
    print(f"Document {doc_idx + 1}: {top_terms}")

### Exercise 3.3: Comparing BoW and TF-IDF

In [None]:
def compare_bow_tfidf(documents, query_word):
    """
    Compare BoW and TF-IDF representations for a specific word.

    Args:
        documents (list): List of documents
        query_word (str): Word to analyze
    """
    bow_matrix, vocab = create_bow_representation(documents)
    tf_idf_matrix, _ = compute_tf_idf(documents)

    # TODO: Complete this function
    # Find the index of query_word in vocabulary
    # Print its BoW and TF-IDF weights across documents
    # Create a visualization comparing the two

    # YOUR CODE HERE

    pass

# Test with different words
compare_bow_tfidf(test_documents, "the")
compare_bow_tfidf(test_documents, "cat")

**Question 3.1:** Why does TF-IDF assign lower weights to common words like "the"? How does this help with information retrieval?

_Your Answer:_

```
[Write your answer here]
```

---

## 4. The Distributional Hypothesis

### 4.1 Theory: "You Shall Know a Word by the Company It Keeps"

**J.R. Firth (1957):** Words that occur in similar contexts have similar meanings.

**Mathematical Formulation:**
$$\text{If } C(w_i) \approx C(w_j) \quad \Rightarrow \quad \text{semantics}(w_i) \approx \text{semantics}(w_j)$$

Where:

- $C(w)$: Context of word $w$ (distribution of surrounding words)
- $\approx$: Similarity measure

This is **one of the most successful ideas of modern statistical NLP!**

### Exercise 4.1: Co-occurrence Matrix

In [None]:
def build_cooccurrence_matrix(corpus, window_size=2):
    """
    Build a word co-occurrence matrix from corpus.

    Args:
        corpus (list): List of sentences (strings)
        window_size (int): Context window size

    Returns:
        np.array: Co-occurrence matrix
        list: vocabulary

    Example:
        Given: "the cat sat" with window_size=1
        For center word "cat":
        - Context words: ["the", "sat"]
        - Increment co-occurrence counts for (cat, the) and (cat, sat)
    """
    # Build vocabulary
    vocabulary = set()
    for sentence in corpus:
        words = sentence.lower().split()
        vocabulary.update(words)
    vocabulary = sorted(list(vocabulary))
    word_to_idx = {word: idx for idx, word in enumerate(vocabulary)}

    # TODO: Complete this function
    # Initialize co-occurrence matrix with zeros
    # For each sentence and each word position:
    #   - Get context words within window_size
    #   - Increment co-occurrence counts symmetrically

    # YOUR CODE HERE

    return cooc_matrix, vocabulary

# Test the function
corpus = [
    "the cat sat on the mat",
    "the dog played in the park",
    "a cat and a dog are friends"
]

cooc_matrix, vocab = build_cooccurrence_matrix(corpus, window_size=2)

print("Vocabulary:", vocab)
print("\nCo-occurrence matrix:")
print(cooc_matrix)

# Visualize the matrix
plt.figure(figsize=(10, 8))
sns.heatmap(cooc_matrix, xticklabels=vocab, yticklabels=vocab,
            annot=True, fmt='g', cmap='YlOrRd')
plt.title('Word Co-occurrence Matrix')
plt.tight_layout()
plt.show()

### Exercise 4.2: Analyzing Context Similarity

In [None]:
def analyze_word_contexts(word, cooc_matrix, vocabulary, top_k=5):
    """
    Analyze and display the top k context words for a given word.

    Args:
        word (str): Target word
        cooc_matrix (np.array): Co-occurrence matrix
        vocabulary (list): Vocabulary
        top_k (int): Number of top context words to show
    """
    # TODO: Complete this function
    # Find the row for the target word
    # Sort by co-occurrence count
    # Display top k context words

    # YOUR CODE HERE

    pass

# Analyze contexts for different words
analyze_word_contexts("cat", cooc_matrix, vocab, top_k=5)
analyze_word_contexts("dog", cooc_matrix, vocab, top_k=5)

**Question 4.1:** Looking at the contexts of "cat" and "dog", do they share similar context words? What does this tell us about their semantic relationship?

_Your Answer:_

```
[Write your answer here]
```

### Exercise 4.3: Computing Similarity from Co-occurrence

In [None]:
def compute_context_similarity(word1, word2, cooc_matrix, vocabulary):
    """
    Compute similarity between two words based on their context vectors.

    Args:
        word1, word2 (str): Words to compare
        cooc_matrix (np.array): Co-occurrence matrix
        vocabulary (list): Vocabulary

    Returns:
        float: Cosine similarity between context vectors
    """
    # TODO: Complete this function
    # Get context vectors for both words (rows from co-occurrence matrix)
    # Compute cosine similarity between them

    # YOUR CODE HERE

    pass

# Test similarity computation
print("Context-based similarities:")
print(f"Similarity(cat, dog): {compute_context_similarity('cat', 'dog', cooc_matrix, vocab):.4f}")
print(f"Similarity(cat, mat): {compute_context_similarity('cat', 'mat', cooc_matrix, vocab):.4f}")
print(f"Similarity(dog, park): {compute_context_similarity('dog', 'park', cooc_matrix, vocab):.4f}")

**Question 4.2:** Compare these context-based similarities with the one-hot similarities from Exercise 2.1. What's the key difference?

_Your Answer:_

```
[Write your answer here]
```

---

## 5. Word2Vec: Learning Dense Embeddings

### 5.1 Theory: The Revolution

**Word2Vec (Mikolov et al. 2013)** learns embeddings by predicting context words from center words (or vice versa).

**Key Idea:**

1. Have a large corpus of text
2. Every word represented by a vector
3. Go through each position $t$ in text
4. Use similarity of vectors to calculate probability of context given center
5. Adjust vectors to maximize this probability

**Two Variants:**

- **Skip-gram:** Predict context words given center word
- **CBOW:** Predict center word given context words

**Objective Function (Skip-gram):**
$$J(\theta) = -\frac{1}{T} \sum_{t=1}^{T} \sum_{\substack{-m \leq j \leq m \\ j \neq 0}} \log P(w_{t+j} | w_t; \theta)$$

**Prediction Function:**
$$P(o | c) = \frac{\exp(u_o^T v_c)}{\sum_{w \in V} \exp(u_w^T v_c)}$$

Where:

- $v_c$: center word vector
- $u_o$: context word vector

### Exercise 5.1: Training Your First Word2Vec Model

In [None]:
# Prepare training corpus
training_corpus = [
    "natural language processing is amazing",
    "deep learning revolutionized nlp",
    "word embeddings capture semantic meaning",
    "neural networks learn representations",
    "transformers are powerful models",
    "bert and gpt are popular",
    "language models generate text",
    "semantic similarity measures relatedness",
    "vector space models work well",
    "word2vec learns from context"
]

# TODO: Expand this corpus with more sentences
# Add at least 20 more sentences on topics like:
# - Machine learning
# - Artificial intelligence
# - Natural language processing
# - Deep learning

# YOUR SENTENCES HERE
expanded_corpus = training_corpus + [
    # Add your sentences here
]

# Tokenize corpus
def tokenize_corpus(corpus):
    """Simple tokenization: lowercase and split."""
    return [sentence.lower().split() for sentence in corpus]

tokenized_corpus = tokenize_corpus(expanded_corpus)

print(f"Corpus size: {len(tokenized_corpus)} sentences")
print(f"First 3 tokenized sentences:")
for sent in tokenized_corpus[:3]:
    print(f"  {sent}")

In [None]:
# Train Word2Vec model
from gensim.models import Word2Vec

# TODO: Complete the Word2Vec training
# Set appropriate hyperparameters:
# - vector_size: embedding dimension (try 100)
# - window: context window size (try 5)
# - min_count: minimum word frequency (try 1 for small corpus)
# - workers: number of threads (try 4)
# - sg: 0 for CBOW, 1 for Skip-gram (try 1)

model_w2v = Word2Vec(
    sentences=tokenized_corpus,
    # YOUR HYPERPARAMETERS HERE
)

print(f"\nModel trained!")
print(f"Vocabulary size: {len(model_w2v.wv)}")
print(f"Vector size: {model_w2v.wv.vector_size}")

### Exercise 5.2: Exploring Learned Embeddings

In [None]:
# TODO: Complete this function
def explore_word_embedding(word, model, top_n=5):
    """
    Explore the embedding learned for a specific word.

    Args:
        word (str): Word to explore
        model: Trained Word2Vec model
        top_n (int): Number of similar words to show
    """
    if word not in model.wv:
        print(f"Word '{word}' not in vocabulary")
        return

    # Get vector
    vector = model.wv[word]
    print(f"\nWord: '{word}'")
    print(f"Vector shape: {vector.shape}")
    print(f"Vector (first 10 dimensions): {vector[:10]}")

    # Find most similar words
    # YOUR CODE HERE - use model.wv.most_similar()

    pass

# Test with different words
explore_word_embedding("language", model_w2v)
explore_word_embedding("learning", model_w2v)
explore_word_embedding("model", model_w2v)

### Exercise 5.3: Vector Arithmetic and Analogies

In [None]:
def solve_analogy(model, a, b, c, top_n=1):
    """
    Solve word analogy: a is to b as c is to ?

    Example: king is to queen as man is to woman
    Formula: result = vector(b) - vector(a) + vector(c)

    Args:
        model: Trained Word2Vec model
        a, b, c (str): Analogy words
        top_n (int): Number of results to return

    Returns:
        list: Most similar words to the result
    """
    # TODO: Complete this function
    # Use model.wv.most_similar() with positive and negative arguments
    # positive=[b, c], negative=[a]

    # YOUR CODE HERE

    pass

# Test with analogies (these might not work well with small corpus)
print("Testing analogies:")
print("deep : learning :: natural : ?")
result = solve_analogy(model_w2v, "deep", "learning", "natural")
print(f"Result: {result}\n")

# TODO: Try creating your own analogies
# Examples:
# - "neural : network :: word : ?"
# - "bert : transformer :: word2vec : ?"

**Question 5.1:** Why might analogies not work well with our small training corpus? What would we need to improve them?

_Your Answer:_

```
[Write your answer here]
```

### Exercise 5.4: Visualizing Embeddings

In [None]:
def visualize_embeddings(model, words=None, n_words=20):
    """
    Visualize word embeddings in 2D space using PCA.

    Args:
        model: Trained Word2Vec model
        words (list): Specific words to visualize (optional)
        n_words (int): Number of words to visualize if words=None
    """
    if words is None:
        # Select most frequent words
        words = [word for word, _ in model.wv.index_to_key[:n_words]]

    # TODO: Complete this function
    # 1. Get vectors for selected words
    # 2. Apply PCA to reduce to 2 dimensions
    # 3. Create a scatter plot with word labels

    # YOUR CODE HERE

    pass

# Visualize embeddings
visualize_embeddings(model_w2v, n_words=20)

# Visualize specific semantic groups
semantic_groups = [
    "language", "natural", "processing",
    "learning", "deep", "neural",
    "model", "network", "transformer"
]
visualize_embeddings(model_w2v, words=semantic_groups)

---

## 6. Alternative Approaches: GloVe and FastText

### 6.1 Theory: GloVe (Global Vectors)

**GloVe (Pennington et al. 2014)** combines:

- **Count-based methods:** Uses global co-occurrence statistics
- **Predictive methods:** Learns embeddings via optimization

**Key Insight:** Ratios of co-occurrence probabilities reveal semantic relationships.

**Objective Function:**
$$J = \sum_{i,j=1}^{|V|} f(X_{ij}) \left( w_i^T \tilde{w}_j + b_i + \tilde{b}_j - \log(X_{ij}) \right)^2$$

Where:

- $X_{ij}$: Co-occurrence count of words $i$ and $j$
- $f(X_{ij})$: Weighting function (prevents common words from dominating)

### 6.2 Theory: FastText

**FastText (Bojanowski et al. 2017)** extends Word2Vec with **subword information**.

**Key Innovation:** Represent words as sums of character n-grams.

**Example:** word "where" with n=3

- N-grams: `<wh`, `whe`, `her`, `ere`, `re>`, `<where>`
- Vector: $v_{where} = \sum_{g \in \text{n-grams}} v_g$

**Advantages:**

- Handles **out-of-vocabulary (OOV)** words
- Captures morphological information
- Better for morphologically rich languages

### Exercise 6.1: Training FastText

In [None]:
from gensim.models import FastText

# TODO: Train a FastText model on the same corpus
# Use similar hyperparameters as Word2Vec
# Additional FastText parameters:
# - min_n: minimum length of character n-grams (try 3)
# - max_n: maximum length of character n-grams (try 6)

model_ft = FastText(
    sentences=tokenized_corpus,
    # YOUR HYPERPARAMETERS HERE
)

print(f"\nFastText model trained!")
print(f"Vocabulary size: {len(model_ft.wv)}")
print(f"Vector size: {model_ft.wv.vector_size}")

### Exercise 6.2: Testing OOV Handling

In [None]:
def test_oov_handling(word, model_w2v, model_ft):
    """
    Test how Word2Vec and FastText handle out-of-vocabulary words.

    Args:
        word (str): Word to test (should not be in training vocabulary)
        model_w2v: Trained Word2Vec model
        model_ft: Trained FastText model
    """
    print(f"\nTesting OOV word: '{word}'")

    # Word2Vec
    # TODO: Check if word is in Word2Vec vocabulary
    # If not, print that it's OOV

    # YOUR CODE HERE

    # FastText
    # TODO: Try to get vector from FastText (should work via n-grams)
    # Print the vector shape and find similar words

    # YOUR CODE HERE

    pass

# Test with OOV words (words not in training corpus)
test_words = [
    "running",  # If "run" was in training
    "computation",  # If "compute" was in training
    "networking",  # If "network" was in training
]

for word in test_words:
    test_oov_handling(word, model_w2v, model_ft)

**Question 6.1:** How does FastText generate vectors for OOV words? What are the advantages and disadvantages of this approach?

_Your Answer:_

```
[Write your answer here]
```

---

## 7. Evaluation Methods: Meaningful Assessment with Pretrained Embeddings

> **Critical Insight**: Our custom-trained Word2Vec model (from Section 5) was trained on only ~30 sentences‚Äîfar too small to learn meaningful semantic relationships. Real embeddings require **massive corpora** (millions to billions of tokens) to capture the geometric regularities that make embeddings useful. In this section, we'll evaluate **pretrained embeddings** trained on large-scale data to see how embeddings _actually work in practice_.

### 7.1 Why Small-Corpus Models Fail Evaluation

Before we proceed, let's understand why our custom model gives poor results:

| Evaluation Metric    | Custom Model (30 sentences) | Production Model (100B+ tokens) | Why the Difference?                              |
| -------------------- | --------------------------- | ------------------------------- | ------------------------------------------------ |
| Vocabulary coverage  | ~50 words                   | 3M+ words                       | Missing rare words breaks analogy tasks          |
| Context diversity    | Limited patterns            | Billions of diverse contexts    | Insufficient statistics for robust relationships |
| Vector stability     | High variance               | Converged representations       | Undertrained vectors lack geometric regularity   |
| Semantic granularity | Crude clusters              | Fine-grained distinctions       | Dimensionality requires massive data to populate |

> üí° **Key Takeaway**: _Never evaluate embedding quality on toy corpora._ The distributional hypothesis requires observing words across _thousands_ of diverse contexts to extract reliable semantic signals. Our small corpus illustrates the _mechanism_ of embedding learning‚Äînot its _capability_.

#### Exercise 7.1: Loading Production-Quality Pretrained Embeddings

In [None]:
import gensim.downloader as api
import numpy as np
from scipy.stats import spearmanr
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import warnings
warnings.filterwarnings('ignore')

# TODO: Load Google News Word2Vec embeddings (300 dimensions, 3M words)
# This model was trained on ~100 billion tokens from Google News articles
# Hint: Use api.load('word2vec-google-news-300')
print("üì• Downloading pretrained Word2Vec (Google News)...")
# YOUR CODE HERE
_________________________

print(f"‚úì Successfully loaded {len(w2v_google):,} words with {w2v_google.vector_size} dimensions\n")

# TODO: Perform sanity checks to verify model quality
# 1. Check the analogy: 'king' - 'man' + 'woman' ‚âà ?
# 2. Compute similarity between 'computer' and 'keyboard'
# 3. Compute similarity between 'computer' and 'car'
# Hint: Use w2v_google.most_similar() and w2v_google.similarity()

# YOUR CODE HERE
print("üîç Sanity checks:")
result = _________________________
print(f"   ‚Ä¢ 'king' - 'man' + 'woman' ‚âà {result[0][0]}")
print(f"   ‚Ä¢ Similarity('computer', 'keyboard') = {_________________________:.4f}")
print(f"   ‚Ä¢ Similarity('computer', 'car') = {_________________________:.4f}")

**Question 7.1**: Compare these results with what you observed from your custom-trained model. What differences do you notice? Why do you think pretrained embeddings perform better?

_Your Answer:_

```
[Write your observations here]
```

### 7.2 Intrinsic Evaluation I: Word Similarity Benchmarks

#### Theory: Measuring Alignment with Human Judgment

Word similarity tasks evaluate whether cosine similarity between word vectors correlates with human-rated semantic similarity. Standard benchmarks include WordSim-353 and SimLex-999.

#### Exercise 7.2: Implementing Word Similarity Evaluation

In [None]:
# Curated similarity benchmark reflecting human judgments
similarity_benchmark = [
    # High similarity pairs
    ("cat", "dog", 7.42),
    ("car", "automobile", 9.44),
    ("king", "queen", 8.58),
    # Moderate similarity (functional relationships)
    ("book", "paper", 6.48),
    ("computer", "keyboard", 7.62),
    # Low similarity (dissimilar concepts)
    ("cat", "car", 1.82),
    # Antonyms (should have low similarity)
    ("happy", "sad", 2.62),
]

def evaluate_similarity(model, benchmark):
    """
    Evaluate embeddings against human similarity judgments using Spearman correlation.

    Args:
        model: Pretrained embedding model
        benchmark: List of (word1, word2, human_score) tuples

    Returns:
        float: Spearman correlation coefficient
        list: Results for valid word pairs
        list: Missing word pairs
    """
    model_sims, human_sims = [], []
    missing_words = []

    # TODO: Complete this function
    # For each word pair in benchmark:
    #   1. Check if both words exist in model vocabulary
    #   2. If yes: compute cosine similarity and append to model_sims/human_sims
    #   3. If no: append to missing_words list
    # Hint: Use model.similarity(word1, word2)

    # YOUR CODE HERE
    for w1, w2, human_score in benchmark:
        if _________________________:
            model_sim = _________________________
            model_sims.append(model_sim)
            human_sims.append(human_score)
        else:
            _________________________

    # Compute Spearman correlation if we have enough data
    if len(model_sims) < 2:
        return 0.0, [], missing_words

    corr, p_val = spearmanr(model_sims, human_sims)
    results = list(zip(
        [f"{w1}-{w2}" for w1, w2, _ in benchmark if w1 in model and w2 in model],
        model_sims,
        human_sims
    ))

    return corr, results, missing_words

# Evaluate pretrained embeddings
correlation, results, missing = evaluate_similarity(w2v_google, similarity_benchmark)

print(f"üìä Word Similarity Evaluation (Spearman œÅ = {correlation:.4f})")
print("="*75)
print(f"{'Word Pair':<22} {'Model Similarity':<20} {'Human Rating':<15}")
print("-"*75)

# TODO: Print results table with alignment indicator
# For each result, compute absolute difference between scaled model similarity and human rating
# Add visual indicator: "‚úì‚úì" if diff < 1.5, "‚úì" if diff < 3.0, "‚úó" otherwise

# YOUR CODE HERE
for pair, model_sim, human_sim in results:
    diff = _________________________
    indicator = _________________________
    print(f"{pair:<22} {model_sim:<20.4f} {human_sim:<15.2f} {indicator}")

if missing:
    print(f"\n‚ö†Ô∏è  Missing words in vocabulary: {missing}")

**Question 7.2**: What does a Spearman correlation of 0.8+ indicate about the quality of embeddings? Why don't we expect perfect correlation (1.0) with human judgments?

_Your Answer:_

```
[Write your answer here]
```

### 7.3 Intrinsic Evaluation II: Semantic Analogies

#### Theory: Testing Relational Reasoning

Analogies evaluate whether vector offsets consistently represent semantic relationships:

```
king : queen :: man : woman   ‚Üí   king - man + woman ‚âà queen
France : Paris :: Japan : Tokyo ‚Üí France - Paris + Tokyo ‚âà Japan
```

#### Exercise 7.3: Implementing Analogy Evaluation

In [None]:
# Representative analogy tasks
analogy_tasks = {
    "üåç Geography": [
        ("france", "paris", "germany", "berlin"),
        ("japan", "tokyo", "china", "beijing"),
    ],
    "üë• Family Relationships": [
        ("brother", "sister", "uncle", "aunt"),
        ("son", "daughter", "nephew", "niece"),
    ],
    "‚öñÔ∏è Gender": [
        ("man", "woman", "king", "queen"),
        ("actor", "actress", "prince", "princess"),
    ],
}

def evaluate_analogies(model, tasks, topn=4):
    """
    Evaluate analogy solving accuracy.

    Strategy: For a:b :: c:d, compute d' = b - a + c
    Count as correct if true answer d appears in top-n predictions

    Args:
        model: Pretrained embedding model
        tasks: Dictionary of category ‚Üí list of (a,b,c,d) tuples
        topn: Number of predictions to consider for correctness

    Returns:
        dict: Category-wise results (correct, total)
        float: Overall accuracy
    """
    category_results = {}
    total_correct, total_count = 0, 0

    # TODO: Complete this function
    # For each category and analogy task:
    #   1. Skip if any word missing from vocabulary
    #   2. Solve analogy using model.most_similar(positive=[b,c], negative=[a])
    #   3. Check if expected answer appears in top-n predictions
    #   4. Track correct/total counts per category and overall

    # YOUR CODE HERE
    for category, analogies in tasks.items():
        correct = 0
        valid = 0

        for a, b, c, d in analogies:
            if _________________________:
                continue

            valid += 1
            try:
                predicted = _________________________
                predicted_words = _________________________

                if _________________________:
                    correct += 1
                    total_correct += 1
                total_count += 1
            except Exception as e:
                continue

        if valid > 0:
            category_results[category] = (correct, valid)

    overall_acc = total_correct / total_count if total_count > 0 else 0
    return category_results, overall_acc

# Evaluate pretrained model
results, accuracy = evaluate_analogies(w2v_google, analogy_tasks)

print(f"üß† Analogy Reasoning Evaluation")
print("="*75)
print(f"{'Category':<25} {'Correct':<12} {'Total':<10} {'Accuracy':<12}")
print("-"*75)

# TODO: Print results with visual progress bars
# For each category, print accuracy with bar visualization (e.g., "‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà" for 50%)

# YOUR CODE HERE
for category, (correct_cat, total_cat) in results.items():
    acc = _________________________
    bar = _________________________
    print(f"{category:<25} {correct_cat:<12} {total_cat:<10} {acc:>7.1%} {bar}")

print("-"*75)
print(f"{'OVERALL':<25} {total_correct:<12} {total_count:<10} {accuracy:>7.2%}")

**Question 7.3**: Why might syntactic analogies (e.g., verb tenses, plurals) be harder for embeddings to capture than semantic analogies (e.g., country-capital pairs)? What does this tell us about the nature of distributional learning?

_Your Answer:_

```
[Write your answer here]
```

### 7.4 Extrinsic Evaluation: Downstream Task Performance

#### Theory: The Ultimate Test‚ÄîDoes It Help Real Applications?

Intrinsic metrics don't guarantee utility for actual tasks. **Extrinsic evaluation** measures performance on downstream applications like text classification.

#### Exercise 7.4: Text Classification with Pretrained Embeddings

In [None]:
from sklearn.datasets import fetch_20newsgroups

# Load 20 Newsgroups dataset: Baseball vs Hockey classification
categories = ['rec.sport.baseball', 'rec.sport.hockey']
dataset = fetch_20newsgroups(subset='train', categories=categories,
                           remove=('headers', 'footers', 'quotes'))

# TODO: Implement document vectorization via mean pooling
def doc_to_vec(text, model):
    """
    Convert document to vector by averaging in-vocabulary word embeddings.

    Args:
        text (str): Document text
        model: Pretrained embedding model

    Returns:
        np.array: Document vector (size = model.vector_size)
    """
    # YOUR CODE HERE
    words = _________________________
    if not words:
        return _________________________
    return _________________________

# Prepare features and labels
X = np.array([doc_to_vec(doc, w2v_google) for doc in dataset.data])
y = dataset.target

# TODO: Split data and train classifier
# 1. Use train_test_split with test_size=0.3, random_state=42, stratify=y
# 2. Train LogisticRegression with max_iter=1000
# 3. Evaluate on test set and print classification report

# YOUR CODE HERE
X_train, X_test, y_train, y_test = _________________________

clf = _________________________
clf.fit(X_train, y_train)
y_pred = _________________________

print("‚öæüèí Text Classification: Baseball vs Hockey Articles")
print("="*75)
print(classification_report(y_test, y_pred,
                          target_names=['Baseball', 'Hockey']))
print(f"‚úÖ Test Accuracy: {accuracy_score(y_test, y_pred):.2%}")

**Question 7.4**: Why do pretrained embeddings achieve high accuracy (>90%) on this task with minimal training? How would performance differ if we used one-hot vectors or random embeddings instead?

_Your Answer:_

```
[Write your answer here]
```

---

## 8. Practical Applications with Production Embeddings

### 8.1 Semantic Search Engine (Beyond Keyword Matching)

Traditional search relies on exact keyword matches. Semantic search retrieves documents based on _conceptual relevance_.

#### Exercise 8.1: Building a Semantic Search Engine

In [None]:
class SemanticSearchEngine:
    """Search documents using semantic similarity (not keyword matching)."""

    def __init__(self, model, documents):
        """
        Initialize search engine with embeddings model and document corpus.

        Args:
            model: Pretrained embedding model
            documents: List of document strings
        """
        self.model = model
        self.documents = documents

        # TODO: Create document vectors using doc_to_vec function
        # Store in self.doc_vectors as numpy array

        # YOUR CODE HERE
        self.doc_vectors = _________________________

    def search(self, query, top_k=3):
        """
        Return top-k documents most semantically similar to query.

        Args:
            query (str): Search query
            top_k (int): Number of results to return

        Returns:
            list: Top k (similarity_score, document) tuples
        """
        # TODO: Complete search implementation
        # 1. Convert query to vector using doc_to_vec
        # 2. Compute cosine similarities between query and all documents
        #    Hint: Use np.dot and normalize by vector norms
        # 3. Return top-k results sorted by similarity

        # YOUR CODE HERE
        q_vec = _________________________
        if _________________________:
            return []

        sims = _________________________

        top_idx = _________________________
        return [(sims[i], self.documents[i]) for i in top_idx]

# Technical document corpus
docs = [
    "Neural networks learn hierarchical representations from data",
    "Transformers use self-attention to process sequential data",
    "Word2Vec learns word embeddings through context prediction",
    "Convolutional networks excel at spatial feature extraction",
    "Reinforcement learning optimizes actions through environmental rewards",
    "Gradient descent minimizes loss functions by following negative gradients"
]

# Initialize search engine
search = SemanticSearchEngine(w2v_google, docs)

# Test semantic queries (note: NO exact keyword matches required)
queries = [
    "language understanding models",   # Should retrieve transformers/Word2Vec docs
    "optimization algorithms",         # Should retrieve gradient descent doc
    "deep learning architectures"      # Should retrieve CNN/neural network docs
]

print("üîç Semantic Search Results (No Keyword Matching Required)")
print("="*80)
for q in queries:
    print(f"\n	Query: '{q}'")
    results = search.search(q, top_k=2)

    # TODO: Print results with conceptual match indicator
    # If query words don't appear in document ‚Üí "‚úÖ Conceptual match"
    # Otherwise ‚Üí "‚ö†Ô∏è  Keyword match"

    # YOUR CODE HERE
    for rank, (score, doc) in enumerate(results, 1):
        match_type = _________________________
        print(f"  {rank}. [{score:.3f}] {match_type}")
        print(f"     ‚Üí {doc}")

**Question 8.1**: What business applications could benefit from semantic search instead of traditional keyword search? Describe one concrete example.

_Your Answer:_

```
[Write your answer here]
```

### 8.2 Bias Auditing: Critical for Ethical Deployment

Embeddings encode societal biases present in training data. We must audit before deployment.

#### Exercise 8.2: Detecting Gender Bias in Embeddings

In [None]:
def measure_gender_bias(model, gender_pairs, target_words):
    """
    Quantify gender bias using Bolukbasi et al. (2016) methodology.

    Steps:
    1. Compute gender direction from definitional pairs (he/she, man/woman)
    2. Project target words onto this direction
    3. Positive score = male-associated, Negative = female-associated

    Args:
        model: Pretrained embedding model
        gender_pairs: List of (male_word, female_word) tuples
        target_words: Words to measure bias for

    Returns:
        dict: {word: bias_score} where positive = male-associated
    """
    # TODO: Compute gender direction vector
    # 1. For each gender pair, compute difference vector (male - female)
    # 2. Average all difference vectors to get gender direction
    # 3. Normalize the direction vector

    # YOUR CODE HERE
    gender_vecs = []
    for w1, w2 in gender_pairs:
        if _________________________:
            gender_vecs.append(_________________________)

    if not gender_vecs:
        raise ValueError("Insufficient gender pairs in vocabulary")

    gender_dir = _________________________
    gender_dir /= _________________________

    # TODO: Measure bias for target words
    # For each word in target_words:
    #   1. Get word vector
    #   2. Project onto gender direction: dot product normalized by vector norm
    #   3. Store in biases dictionary

    # YOUR CODE HERE
    biases = {}
    for word in target_words:
        if _________________________:
            vec = _________________________
            bias = _________________________
            biases[word] = bias

    return biases

# Define bias measurement parameters
gender_pairs = [("he", "she"), ("man", "woman"), ("boy", "girl")]
professions = ["doctor", "nurse", "engineer", "teacher", "programmer",
               "secretary", "scientist", "homemaker", "pilot"]

# TODO: Measure and display bias scores
# 1. Call measure_gender_bias with appropriate arguments
# 2. Sort professions by bias score (descending)
# 3. Print table with visual indicators:
#    üî¥ for |score| > 0.15 (strong bias)
#    üü† for 0.05 < |score| <= 0.15 (moderate bias)
#    üü¢ for |score| <= 0.05 (neutral)

# YOUR CODE HERE
biases = _________________________

print("‚öñÔ∏è  Gender Bias Audit: Google News Embeddings")
print("(Positive = male-associated | Negative = female-associated)")
print("="*75)
print(f"{'Profession':<18} {'Bias Score':<15} {'Association':<25}")
print("-"*75)

for prof in sorted(biases, key=biases.get, reverse=True):
    score = biases[prof]
    if score > 0.15:
        marker = "üî¥"
        assoc = "Strongly male-associated ‚ôÇ‚ôÇ"
    elif score > 0.05:
        marker = "üü†"
        assoc = "Moderately male-associated ‚ôÇ"
    elif score < -0.15:
        marker = "üî¥"
        assoc = "Strongly female-associated ‚ôÄ‚ôÄ"
    elif score < -0.05:
        marker = "üü†"
        assoc = "Moderately female-associated ‚ôÄ"
    else:
        marker = "üü¢"
        assoc = "Neutral ‚ö≤"

    print(f"{marker} {prof:<16} {score:+.4f}        {assoc}")

**Question 8.2**: Why do embeddings contain societal biases? What are potential real-world harms if biased embeddings are deployed in hiring or lending systems without auditing?

_Your Answer:_

```
[Write your answer here]
```

---

## 9. Limitations and Future Directions

### 9.1 The Polysemy Problem: Static vs. Contextual Embeddings

Static embeddings assign one vector per word type‚Äîfailing for polysemous words.

#### Exercise 9.1: Demonstrating the Polysemy Limitation

In [None]:
def demonstrate_polysemy(model, word, contexts):
    """
    Show how static embeddings cannot distinguish word senses.

    Args:
        model: Pretrained embedding model
        word (str): Polysemous word to analyze
        contexts (list): Sentences showing different meanings
    """
    if word not in model:
        print(f"‚ö†Ô∏è  '{word}' not in vocabulary")
        return

    print(f"\nüî§ Polysemy Analysis: '{word}'")
    print("-"*70)
    print(f"Single static vector (first 10 dimensions): {model[word][:10].round(3)}")
    print("\nContext examples (static embedding cannot distinguish these):")

    # TODO: For each context sentence:
    # 1. Compute sentence vector using doc_to_vec
    # 2. Compute similarity between word vector and sentence context
    # 3. Print context with similarity score

    # YOUR CODE HERE
    for i, ctx in enumerate(contexts, 1):
        ctx_vec = _________________________
        if _________________________:
            continue
        sim = _________________________
        print(f"  {i}. '{ctx}'")
        print(f"     ‚Üí Similarity to '{word}' vector: {sim:.3f}")

# Demonstrate with polysemous words
demonstrate_polysemy(w2v_google, "bank", [
    "I deposited money at the bank",          # Financial institution
    "We sat on the river bank",               # River edge
    "The plane made a steep bank turn"        # Aviation maneuver
])

# TODO: Add another polysemous word example (e.g., "apple", "light", "crane")
# YOUR CODE HERE
demonstrate_polysemy(w2v_google, _________________________, [
    _________________________,  # Meaning 1
    _________________________,  # Meaning 2
    _________________________   # Meaning 3
])

**Question 9.1**: How would contextual embeddings (like BERT) solve the polysemy problem? Describe the key architectural difference that enables this capability.

_Your Answer:_

```
[Write your answer here]
```

### 9.2 When to Use Static vs. Contextual Embeddings

| Decision Factor          | Static Embeddings (Word2Vec/GloVe)      | Contextual Embeddings (BERT)                |
| ------------------------ | --------------------------------------- | ------------------------------------------- |
| **Task complexity**      | Simple classification, clustering       | WSD, coreference, complex QA                |
| **Compute constraints**  | ‚úÖ Low latency (<10ms), CPU-friendly    | ‚ö†Ô∏è High latency (50-200ms), GPU recommended |
| **Data availability**    | Works with small labeled datasets       | Requires fine-tuning data                   |
| **Polysemy handling**    | ‚ùå Impossible                           | ‚úÖ Native capability                        |
| **Model size**           | 100-300MB                               | 300MB-2GB+                                  |
| **Real-world use cases** | Recommendation systems, semantic search | Virtual assistants, machine translation     |

> üí° **Practical Guidance**: Start with static embeddings for most applications. Upgrade to contextual models only when polysemy disambiguation is critical to task success.

---

## 10. Final Exercises

### Exercise 10.1: Semantic Arithmetic Exploration

Explore vector relationships that reveal linguistic regularities:

In [None]:
def explore_analogy(model, positive, negative, topn=5):
    """
    Explore vector arithmetic relationships.

    Args:
        model: Pretrained embedding model
        positive: List of words to add
        negative: List of words to subtract
        topn: Number of results to return
    """
    try:
        result = model.most_similar(positive=positive, negative=negative, topn=topn)
        print(f"\n{' + '.join(positive)} - {' - '.join(negative)} ‚âà")
        print("-"*60)

        # TODO: Print results with visual confidence bars
        # For each result, print word, score, and bar proportional to score
        # Example: "  1. berlin    0.7821 ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà"

        # YOUR CODE HERE
        for i, (word, score) in enumerate(result, 1):
            bar = _________________________
            print(f"  {i}. {word:<20} {score:.4f} {bar}")
        return result
    except KeyError as e:
        print(f"‚ö†Ô∏è  Missing word in vocabulary: {e}")
        return []

print("="*70)
print("üß† SEMANTIC ARITHMETIC WITH GOOGLE NEWS EMBEDDINGS")
print("="*70)

# TODO: Explore at least 3 different types of analogies:
# 1. Geography relationships (capitals)
# 2. Gender transformations
# 3. Temporal relationships (verb tenses)
# 4. Conceptual relationships (abstract concepts)

# YOUR CODE HERE
print("\nüåç Capital Cities:")
explore_analogy(w2v_google, _________________________, _________________________, topn=3)

print("\n‚öñÔ∏è  Gender Transformations:")
explore_analogy(w2v_google, _________________________, _________________________, topn=3)

print("\n‚è≥ Verb Tenses:")
explore_analogy(w2v_google, _________________________, _________________________, topn=3)

### Exercise 10.2: Build a Semantic Recommendation System

Implement semantic product recommendations for e-commerce:

In [None]:
# E-commerce product catalog
products = {
    "laptop": "powerful laptop with fast processor and long battery life for work and gaming",
    "tablet": "lightweight tablet with touchscreen display and all-day battery for media consumption",
    "smartphone": "premium smartphone with high-resolution camera system and 5G connectivity",
    "wireless_headphones": "noise-cancelling wireless headphones with 30-hour battery life",
    "mechanical_keyboard": "tactile mechanical keyboard with RGB lighting for gaming and typing",
    "ultrawide_monitor": "34-inch ultrawide curved monitor with 144Hz refresh rate for productivity",
}

# TODO: Convert products to vectors using doc_to_vec
# Store in product_vectors dictionary: {product_name: vector}
# YOUR CODE HERE
product_vectors = {}
for name, desc in products.items():
    vec = _________________________
    if _________________________:
        product_vectors[name] = vec

def recommend_products(product_id, top_k=3):
    """
    Recommend semantically similar products.

    Args:
        product_id (str): Product the user is viewing
        top_k (int): Number of recommendations to return

    Returns:
        list: Top k (product_name, similarity_score) tuples
    """
    # TODO: Implement recommendation logic
    # 1. Get target product vector
    # 2. Compute cosine similarity with all other products
    # 3. Return top-k most similar products (excluding self)

    # YOUR CODE HERE
    if _________________________:
        return []

    target_vec = _________________________
    similarities = {}

    for name, vec in product_vectors.items():
        if _________________________:
            sim = _________________________
            similarities[name] = sim

    return _________________________

# TODO: Generate and display recommendations for 3 different products
# Format output with business-friendly similarity indicators:
#   "Excellent match" for similarity > 0.6
#   "Good match" for 0.45 < similarity <= 0.6
#   "Moderate match" for similarity <= 0.45

# YOUR CODE HERE
print("üõí SEMANTIC PRODUCT RECOMMENDATIONS")
print("="*75)

for product in ["laptop", "wireless_headphones", "mechanical_keyboard"]:
    print(f"\nüëÄ User viewing: '{product}'")
    print(f"   Description: {products[product][:60]}...")

    recommendations = _________________________
    print("   ‚û°Ô∏è  Recommended alternatives:")

    for rank, (rec, score) in enumerate(recommendations, 1):
        if score > 0.6:
            quality = "Excellent match"
        elif score > 0.45:
            quality = "Good match"
        else:
            quality = "Moderate match"

        print(f"      {rank}. {rec:<22} (similarity: {score:.3f}) ‚Üí {quality}")
        print(f"         Preview: {products[rec][:50]}...")

**Reflection Question R1**: What was the most surprising semantic relationship you discovered through vector arithmetic? Why was it surprising?

_Your Answer:_

```
[Write your reflection here]
```

**Reflection Question R2**: How might bias in embeddings impact the product recommendations generated in Exercise 10.2? Describe one potential fairness concern and how you might address it.

_Your Answer:_

```
[Write your reflection here]
```

---

### Summary and Key Takeaways

‚úÖ **Embeddings require scale**: Semantic geometry emerges only with massive training data (100M+ tokens)  
‚úÖ **Pretrained > custom-trained**: For most applications, pretrained embeddings outperform custom models unless you have domain-specific data at scale  
‚úÖ **Evaluation must be multi-faceted**: Use both intrinsic (similarity/analogies) and extrinsic (downstream tasks) metrics  
‚úÖ **Bias is inevitable**: Embeddings encode societal patterns‚Äîalways audit before deployment in sensitive applications  
‚úÖ **Static vs. contextual tradeoffs**: Choose static embeddings for simplicity/speed; contextual for disambiguation tasks

> "Word embeddings don't _understand_ language‚Äîthey compress statistical patterns from human text. Their power comes not from intelligence, but from scale: the geometric regularities that emerge when modeling billions of word contexts."

### Reflection Questions

**Question R1:** What was the most surprising thing you learned about word embeddings?

_Your Answer:_

```
[Write your reflection here]
```

**Question R2:** How would you explain the distributional hypothesis to someone without an NLP background?

_Your Answer:_

```
[Write your reflection here]
```

**Question R3:** What are the most important factors when training word embeddings for a real application?

_Your Answer:_

```
[Write your reflection here]
```

**Question R4:** How might word embeddings be used in your own research or projects?

_Your Answer:_

```
[Write your reflection here]
```

---

## Additional Resources

### Essential Papers

1. **Mikolov et al. (2013).** "Efficient Estimation of Word Representations in Vector Space"

   - Original Word2Vec paper

2. **Pennington et al. (2014).** "GloVe: Global Vectors for Word Representation"

   - Combines count-based and predictive methods

3. **Bojanowski et al. (2017).** "Enriching Word Vectors with Subword Information"
   - FastText and character n-grams

### Further Reading

- **Levy & Goldberg (2014):** "Neural Word Embeddings as Implicit Matrix Factorization"
- **Arora et al. (2018):** "Linear Algebraic Structure of Word Senses"
- CS224N: http://web.stanford.edu/class/cs224n/
- Pre-trained embeddings: https://nlp.stanford.edu/projects/glove/

### Next Steps

1. **Session 3:** Text Classification & Methodology

   - Building robust NLP baselines
   - Experimental design
   - Train/Val/Test splitting

2. **Future Topics:**
   - Recurrent Neural Networks (RNNs, LSTMs)
   - Attention mechanisms
   - Transformers and BERT
   - Modern LLMs (GPT, Claude, etc.)

---

## Submission Guidelines

### What to Submit

1. **Completed Notebook** with all exercises filled in
2. **Reflection Document** answering all questions
3. **Final Project Code** (Exercise 10.3)

### Grading Criteria

- **Correctness (40%):** Do your implementations work correctly?
- **Completeness (20%):** Did you complete all exercises?
- **Understanding (20%):** Do your answers demonstrate deep understanding?
- **Creativity (20%):** Did you extend beyond the basic requirements?

### Deadline

Check the course website for submission deadline.

---

**End of Notebook**