Maastricht_University_logo.svg

# Information Retrieval and Text Mining Course - Tutorial Document Representation
Author: Gijs Wijngaard and Jan Scholtes

Version: 2025-2026

Welcome to the tutorial about **document representation**. A fundamental challenge in Information Retrieval and Text Mining is converting unstructured text into numerical representations that algorithms can process. In this notebook you will explore a progression of methods — from simple counting-based approaches to modern neural embeddings:

1. **One-Hot Encoding** — the simplest binary representation
2. **N-grams & Bag-of-Words** — counting word occurrences
3. **TF-IDF** — weighting terms by importance
4. **Cosine Similarity** — measuring document similarity
5. **Word2Vec** — learning dense word embeddings
6. **Sentence Transformers** — contextual sentence embeddings

---

## 1. One-Hot Encoding

The simplest way to represent words as numbers is **one-hot encoding**. Given a vocabulary $V = \{w_1, w_2, \ldots, w_{|V|}\}$, each word $w_i$ is represented as a binary vector $\mathbf{e}_i \in \{0, 1\}^{|V|}$ where:

$$
\mathbf{e}_i[j] = \begin{cases} 1 & \text{if } j = i \\ 0 & \text{otherwise} \end{cases}
$$

For example, with vocabulary $V = \{\text{cat}, \text{dog}, \text{fox}\}$:
- cat $\rightarrow [1, 0, 0]$
- dog $\rightarrow [0, 1, 0]$
- fox $\rightarrow [0, 0, 1]$

**Limitations:**
- Vectors are very high-dimensional (size of vocabulary, typically 10,000+)
- All word vectors are **orthogonal** — there is no notion of similarity ($\text{cosine}(\mathbf{e}_i, \mathbf{e}_j) = 0$ for $i \neq j$)
- No semantic information is captured ("king" and "queen" are as different as "king" and "banana")

In [None]:
import numpy as np

sentence = "the quick brown fox jumps over the lazy dog"
words = sentence.split()
vocabulary = sorted(set(words))

print(f"Vocabulary ({len(vocabulary)} unique words): {vocabulary}\n")

# Create one-hot vectors for each unique word
one_hot = {word: np.eye(len(vocabulary))[i].astype(int) for i, word in enumerate(vocabulary)}

for word, vector in one_hot.items():
    print(f"  {word:6s} → {vector}")

## 2. N-grams and Bag-of-Words

Instead of representing individual words, we can capture **sequences** of words. An **n-gram** is a contiguous sequence of $n$ items from a text. Given a sequence of tokens $w_1, w_2, \ldots, w_m$, the set of n-grams is:

$$
\text{ngrams}(n) = \{(w_i, w_{i+1}, \ldots, w_{i+n-1}) \mid 1 \leq i \leq m - n + 1\}
$$

Let's see this in practice with our example sentence:

In [None]:
sentence = "the quick brown fox jumps over the lazy dog"

A **bigram** ($n=2$) groups two consecutive words together. Bigrams capture local word co-occurrence patterns:

In [None]:
splitted = sentence.split(" ")
[bigram for bigram in zip(splitted, splitted[1:])]

With the grouping of 3 words together, we call it a trigram.

In [None]:
[trigram for trigram in zip(splitted, splitted[1:], splitted[2:])]

### Bag-of-Words (BoW)

A **Bag-of-Words** representation ignores word order entirely and represents a document as a vector of word counts. Given vocabulary $V = \{w_1, \ldots, w_{|V|}\}$, a document $d$ is represented as:

$$
\mathbf{d} = [\text{tf}(w_1, d), \text{tf}(w_2, d), \ldots, \text{tf}(w_{|V|}, d)]
$$

where $\text{tf}(w, d)$ is the **term frequency** — the number of times word $w$ appears in document $d$.

In [None]:
from collections import Counter
bag_of_words = Counter(splitted)
print("Bag-of-Words representation:")
for word, count in bag_of_words.items():
    print(f"  {word}: {count}")

Notice that "the" has a count of 2, while all other words appear once. Words like "the", "a", "and" are called **stop words** — they are extremely common but carry little meaning about the document's topic. We need a weighting scheme that reduces the importance of such words. This is where **TF-IDF** comes in.

<a name="dataset"></a>

## 3. Dataset

We use a [movie review dataset](https://www.cs.cornell.edu/people/pabo/movie-review-data/) from NLTK. This dataset contains **1000 positive** and **1000 negative** movie reviews, making it suitable for **sentiment analysis** — classifying reviews as positive or negative based on their word content.

In [None]:
import nltk
nltk.download('movie_reviews')
nltk.download('words')
from nltk.corpus import words, movie_reviews as mr
nltk_words = set(words.words())

We first remove the punctuation from all the words, and afterwards we count the most common words.

In [None]:
import string
from collections import Counter
def remove_punct(word):
    word = word.translate(str.maketrans('', '', string.punctuation))
    return word if word in nltk_words else ''
all_words = Counter(filter(remove_punct, mr.words()))
all_words.most_common(10)

The same problem we have here. Words such as *the* and *a* are the most common amongst the movie reviews of our dataset. However, to do something with the movie review, such as classifying it, we should give a lower probability to these words, as they do not say much about the content itself.

In [None]:
documents = [(list(filter(remove_punct, mr.words(f))), mr.categories(f)) for f in mr.fileids()]
print("Total number of documents:", len(documents))
print("Total number of words in first document:", len(documents[0][0]))

## 4. TF-IDF (Term Frequency–Inverse Document Frequency)

TF-IDF assigns a **weight** to each term in a document that reflects how important that term is relative to the collection. The score **increases** with the number of occurrences in a document and **increases** with the rarity of the term across all documents.

The TF-IDF weight of term $t$ in document $d$ is:

$$
w_{t,d} = \text{tf-idf}(t, d) = \log(1 + \text{tf}_{t,d}) \times \log_{10}\!\left(\frac{N}{\text{df}_t}\right)
$$

where:
- $\text{tf}_{t,d}$ = **term frequency**: number of times term $t$ appears in document $d$
- $\text{df}_t$ = **document frequency**: number of documents in the collection containing term $t$
- $N$ = total number of documents in the collection
- $\log(1 + \text{tf}_{t,d})$ applies **sublinear** scaling — a word appearing 10× is not 10× as important
- $\log_{10}(N / \text{df}_t)$ is the **inverse document frequency (IDF)** — rare terms get higher weight

**Key insight**: A term gets a high TF-IDF score when it appears frequently in a specific document (high TF) but rarely across the collection (high IDF). Common words like "the" will have $\text{df}_t \approx N$, giving $\text{IDF} \approx 0$.

> **Note:** There are many TF-IDF variants. A popular alternative is **BM25** (Best Matching 25), which adds document length normalization and a saturation parameter. BM25 is used by search engines like Elasticsearch and Solr.

Lets start with calculating the term frequency (tf). Now, we calculated the number of words for all documents. However, to calculate the tf-idf score we need to calculate the term-frequency for each term per document. Thus, we need to loop over the documents and count the occurrences of the terms per document.

In [None]:
tf = [Counter(words) for words, category in documents]
tf[0].most_common(10) # Most common terms for the first document

Now let's calculate the **document frequency** ($\text{df}$). For each word in our vocabulary, we count how many documents contain that word. We convert each document to a `set` (unique words) so that membership lookup is $O(1)$ instead of $O(n)$:

In [None]:
setted_docs = [set(doc) for doc, category in documents]
df = {word: sum([1 for doc in setted_docs if word in doc]) for word in all_words.keys()}
list(df.items())[:10]

### Exercise 1 — Implement TF-IDF (4 points)

Implement the TF-IDF score for each word per document yourself using the formula:

$$
w_{t,d} = \log(1 + \text{tf}_{t,d}) \times \log_{10}\!\left(\frac{N}{\text{df}_t}\right)
$$

Use `numpy` for `log` and `log10`. Store the result as a list of dictionaries called `tfidf`, where each dictionary maps words to their TF-IDF scores for that document.

> **Hint**: Use `tf[i]` to get the term frequency dictionary for document `i`, `df[word]` for the document frequency of a word, and `len(documents)` for $N$.

In [None]:
import numpy as np

N = len(documents)

### BEGIN SOLUTION
tfidf = []
for i in range(N):
    tfidf_doc = {}
    for word, count in tf[i].items():
        tfidf_doc[word] = np.log(1 + count) * np.log10(N / df[word])
    tfidf.append(tfidf_doc)
### END SOLUTION

# Check: print top 5 TF-IDF terms for the first document
sorted_first = sorted(tfidf[0].items(), key=lambda x: x[1], reverse=True)[:5]
print("Top 5 TF-IDF terms for document 1:")
for word, score in sorted_first:
    print(f"  {word}: {score:.4f}")

In [None]:
### BEGIN HIDDEN TESTS
# Test that tfidf is a list of dictionaries with correct length
assert isinstance(tfidf, list), "tfidf should be a list"
assert len(tfidf) == len(documents), f"tfidf should have {len(documents)} entries, got {len(tfidf)}"
assert isinstance(tfidf[0], dict), "Each element should be a dictionary"

# Test that common words get low TF-IDF scores
# "the" appears in almost all documents, so its IDF should be near 0
if "the" in tfidf[0]:
    assert tfidf[0]["the"] < 0.1, "Common words like 'the' should have very low TF-IDF"

# Test that all values are non-negative
for doc_tfidf in tfidf[:5]:
    for word, score in doc_tfidf.items():
        assert score >= 0, f"TF-IDF scores should be non-negative, got {score} for '{word}'"

print("All Exercise 1 tests passed!")
### END HIDDEN TESTS

### Exercise 2 — Analyze TF-IDF (2 points)

**a.** Using your TF-IDF scores, compute the **average TF-IDF score** for each word across all **positive** reviews and all **negative** reviews separately. Then get the top 50 words by average TF-IDF for each sentiment.

Store the top 50 positive words as `top_positive` and top 50 negative words as `top_negative` (each as a list of `(word, score)` tuples, sorted descending by score).

**b.** What do you notice? Write your observations in the text cell below. Are there differences between the two lists? Could we train a classifier based on these TF-IDF features?

In [None]:
### BEGIN SOLUTION
# Separate positive and negative documents
pos_indices = [i for i, (_, cat) in enumerate(documents) if 'pos' in cat]
neg_indices = [i for i, (_, cat) in enumerate(documents) if 'neg' in cat]

# Average TF-IDF per word for positive reviews
pos_avg = {}
for i in pos_indices:
    for word, score in tfidf[i].items():
        pos_avg[word] = pos_avg.get(word, 0) + score
pos_avg = {w: s / len(pos_indices) for w, s in pos_avg.items()}

# Average TF-IDF per word for negative reviews
neg_avg = {}
for i in neg_indices:
    for word, score in tfidf[i].items():
        neg_avg[word] = neg_avg.get(word, 0) + score
neg_avg = {w: s / len(neg_indices) for w, s in neg_avg.items()}

top_positive = sorted(pos_avg.items(), key=lambda x: x[1], reverse=True)[:50]
top_negative = sorted(neg_avg.items(), key=lambda x: x[1], reverse=True)[:50]
### END SOLUTION

print("Top 20 words in POSITIVE reviews:")
for word, score in top_positive[:20]:
    print(f"  {word}: {score:.4f}")

print("\nTop 20 words in NEGATIVE reviews:")
for word, score in top_negative[:20]:
    print(f"  {word}: {score:.4f}")

### BEGIN SOLUTION
**Observations:** Both lists share many words related to movies (film, movie, character, story). However, some sentiment-specific words differ — positive reviews may feature words like "best", "great", "excellent" while negative reviews may include "bad", "worst", "boring". Despite overlap, a classifier could leverage the different TF-IDF weight distributions to distinguish sentiment. The overlap occurs because both types of reviews discuss the same domain (movies).
### END SOLUTION

### Visualizing TF-IDF Weights

Let's visualize how TF-IDF assigns different weights to words. We use scikit-learn's `TfidfVectorizer` to compute TF-IDF on a small set of example documents and display the result as a heatmap:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Small example corpus
example_docs = [
    "the cat sat on the mat",
    "the dog chased the cat",
    "the fox jumped over the lazy dog",
    "a quick brown fox"
]

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(example_docs)
feature_names = vectorizer.get_feature_names_out()

# Create a DataFrame for visualization
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names,
                        index=[f"Doc {i+1}" for i in range(len(example_docs))])

plt.figure(figsize=(12, 4))
sns.heatmap(tfidf_df, annot=True, fmt=".2f", cmap="YlOrRd", linewidths=0.5)
plt.title("TF-IDF Weights per Document")
plt.ylabel("Documents")
plt.xlabel("Terms")
plt.tight_layout()
plt.show()

print("\nNotice how common words like 'the' get lower weights,")
print("while distinctive words like 'mat', 'jumped', 'quick' get higher weights.")

## 5. Cosine Similarity

Once we have vector representations of documents (whether BoW, TF-IDF, or embeddings), we need a way to measure **how similar** two documents are. The most commonly used measure in IR is **cosine similarity**.

For two vectors $\mathbf{a}$ and $\mathbf{b}$, cosine similarity is defined as:

$$
\text{cos}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \, \|\mathbf{b}\|} = \frac{\sum_{i=1}^{n} a_i \, b_i}{\sqrt{\sum_{i=1}^{n} a_i^2} \cdot \sqrt{\sum_{i=1}^{n} b_i^2}}
$$

- The result ranges from $-1$ (opposite) to $1$ (identical direction), with $0$ meaning orthogonal (unrelated).
- For non-negative vectors (like TF-IDF), the range is $[0, 1]$.
- Cosine similarity measures **angle**, not magnitude — a long document and a short document with the same word proportions will have high similarity.

Let's compute the cosine similarity between our example TF-IDF documents:

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Compute pairwise cosine similarity on TF-IDF vectors
cos_sim = cosine_similarity(tfidf_matrix)
cos_df = pd.DataFrame(cos_sim, 
                       index=[f"Doc {i+1}" for i in range(len(example_docs))],
                       columns=[f"Doc {i+1}" for i in range(len(example_docs))])

plt.figure(figsize=(6, 5))
sns.heatmap(cos_df, annot=True, fmt=".3f", cmap="Blues", vmin=0, vmax=1, linewidths=0.5)
plt.title("Cosine Similarity between Documents (TF-IDF)")
plt.tight_layout()
plt.show()

print("Documents:")
for i, doc in enumerate(example_docs):
    print(f"  Doc {i+1}: \"{doc}\"")
print("\nDoc 1 and Doc 2 share 'the' and 'cat' → moderate similarity")
print("Doc 3 and Doc 4 share 'fox' → some similarity")
print("Doc 1 and Doc 4 share nothing meaningful → low similarity")

## 6. Word2Vec — Dense Word Embeddings

So far, our representations have been **sparse** and **high-dimensional** (one dimension per vocabulary word). Word2Vec (Mikolov et al., 2013) learns **dense, low-dimensional** vectors (typically 100-300 dimensions) where **semantically similar words are close together** in vector space.

The core idea is the **distributional hypothesis**: *"A word is characterized by the company it keeps."* (Firth, 1957)

Word2Vec has two training architectures:

1. **Skip-gram**: Given a center word $w_t$, predict context words $w_{t+j}$ within a window of size $c$. The objective maximizes:

$$
\frac{1}{T} \sum_{t=1}^{T} \sum_{-c \leq j \leq c, \, j \neq 0} \log P(w_{t+j} \mid w_t)
$$

where 

$$
P(w_O \mid w_I) = \frac{\exp(\mathbf{v}'_{w_O} \cdot \mathbf{v}_{w_I})}{\sum_{w=1}^{|V|} \exp(\mathbf{v}'_w \cdot \mathbf{v}_{w_I})}
$$

2. **CBOW (Continuous Bag of Words)**: Given context words, predict the center word. Faster to train but less effective on rare words.

In practice, **negative sampling** is used instead of the full softmax to make training tractable.

The most common implementation for Word2Vec in Python is [gensim](https://radimrehurek.com/gensim/models/word2vec.html). Let's train Word2Vec on our movie reviews:

In [None]:
from gensim.models import Word2Vec
model = Word2Vec(sentences=[doc for doc, cat in documents])
word_vectors = model.wv
word_vectors['the']

We can find the most similar vector nearby a word using `most_similar`.

In [None]:
word_vectors.most_similar('king')

And we can even do arithmetic with it. The most famous example of this is the `king + man - woman = queen` analogy. By adding the vector of king and man to each other, and subtracting the vector of woman, we should get the queen vector. Lets try!

In [None]:
word_vectors.most_similar(positive=['king','woman'],negative=['man'])

We get queen as the second most similar vector. We only trained our word2vec model on our reviews dataset which is a small dataset for word2vec standards, so that makes sense.

Lastly, lets plot the data. For this, we need to represent our vectors as a 2-d space. For this, we need a dimensionality reduction technique, such as PCA or t-SNE. We use t-SNE (invented by someone who did the same master as you are doing!). It might take a while to compute the vectors below:

In [None]:
from sklearn.manifold import TSNE
import numpy as np
tsne = TSNE(n_components=2, random_state=0)
vectors = tsne.fit_transform(np.asarray(model.wv.vectors))
x, y = zip(*vectors)

In [None]:
len(x), len(y)

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 12))
plt.scatter(x, y)

## 7. Pretrained Word Embeddings (GloVe)

Word2Vec works best with **pretrained embeddings** — vectors trained on massive corpora by researchers and made publicly available.

**GloVe** (Global Vectors, Pennington et al., 2014) is a popular alternative to Word2Vec. While Word2Vec uses local context windows, GloVe combines:
- **Global co-occurrence statistics** (how often words appear together across the entire corpus)  
- **Local context** (word-word co-occurrence within windows)

GloVe optimizes the objective:

$$
J = \sum_{i,j=1}^{|V|} f(X_{ij}) \left( \mathbf{w}_i^T \tilde{\mathbf{w}}_j + b_i + \tilde{b}_j - \log X_{ij} \right)^2
$$

where $X_{ij}$ is the co-occurrence count and $f$ is a weighting function that prevents very common co-occurrences from dominating.

Let's download pretrained GloVe vectors (trained on Wikipedia + Gigaword, 6B tokens):

In [None]:
import gensim.downloader
glove = gensim.downloader.load('glove-wiki-gigaword-50')

In [None]:
glove["king"]

### Exercise 3 — Sentiment Classification with GloVe (4 points)

Using our `documents`, perform sentiment classification:

1. For each document, get the GloVe pretrained word vector for every word (skip words not in the vocabulary)
2. **Average** all word vectors for each document to create a single document embedding
3. Split the data into **80/20 train/test** (use `train_test_split` with `random_state=42`)
4. Train a `LogisticRegression` classifier on the averaged vectors
5. Store predictions as `y_pred` and compute accuracy as `glove_accuracy`

> **Hint**: Use `glove[word]` to get vectors. Use `word in glove` to check if a word exists. Use `np.mean(vectors, axis=0)` to average.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

### BEGIN SOLUTION
# Create averaged GloVe vectors for each document
X_glove = []
y_labels = []
for words, categories in documents:
    word_vectors = [glove[w] for w in words if w in glove]
    if word_vectors:
        X_glove.append(np.mean(word_vectors, axis=0))
    else:
        X_glove.append(np.zeros(50))  # GloVe-50d
    y_labels.append(1 if 'pos' in categories else 0)

X_glove = np.array(X_glove)
y_labels = np.array(y_labels)

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X_glove, y_labels, test_size=0.2, random_state=42)

# Train classifier
clf_glove = LogisticRegression(max_iter=1000, random_state=42)
clf_glove.fit(X_train, y_train)

# Predict and evaluate
y_pred = clf_glove.predict(X_test)
glove_accuracy = accuracy_score(y_test, y_pred)
### END SOLUTION

print(f"GloVe + LogisticRegression Accuracy: {glove_accuracy:.4f}")

In [None]:
### BEGIN HIDDEN TESTS
assert glove_accuracy > 0.55, f"Accuracy should be above chance (50%), got {glove_accuracy:.4f}"
assert len(y_pred) == len(y_test), "Predictions should match test set size"
assert X_glove.shape[0] == len(documents), "Should have one vector per document"
assert X_glove.shape[1] == 50, "GloVe-50d should produce 50-dimensional vectors"
print(f"All Exercise 3 tests passed! Accuracy: {glove_accuracy:.4f}")
### END HIDDEN TESTS

## Bias in Word2Vec
One of the problems with Word2Vec (and with machine learning in general) is that there is lots of biases assumed by the model. Examples of biases that can be harmful when using these algorithms include gender bias and ethnicity bias. Lets check for example what happens if we take the female equivalent of `doctor`:

In [None]:
glove.most_similar(positive=['doctor','woman'],negative=['man'])

### Exercise 4 — Bias in Word Embeddings (2 points)

Think of **at least 2 other examples** of bias in Word2Vec/GloVe (e.g., gender bias, racial bias, professional stereotypes). Use the `glove.most_similar()` function to demonstrate them with code, and explain why these biases are harmful in real-world applications.

> **Hint**: Try analogies like `man:programmer :: woman:?` or `white:wealthy :: black:?`

### BEGIN SOLUTION
**Examples of bias:**

1. **Gender bias in professions**: `man:computer_programmer :: woman:homemaker` — the model associates technical professions with men and domestic roles with women. This is harmful because it can reinforce stereotypes in hiring systems or recommendation engines.

2. **Racial/ethnic bias**: Word embeddings trained on internet text encode societal biases about race, associating certain ethnicities with negative attributes. This is harmful in applications like resume screening, criminal justice risk assessment, or content moderation.

These biases exist because the training data (web text, news articles) reflects historical and societal prejudices. When these embeddings are used in downstream applications (search engines, chatbots, hiring tools), they can perpetuate and amplify discrimination.
### END SOLUTION

In [None]:
### BEGIN SOLUTION
# Example 1: Gender bias in professions
print("man:programmer :: woman:?")
print(glove.most_similar(positive=['programmer', 'woman'], negative=['man'])[:5])

print("\nman:doctor :: woman:?")
print(glove.most_similar(positive=['doctor', 'woman'], negative=['man'])[:5])

# Example 2: Try another analogy
print("\nfather:doctor :: mother:?")
print(glove.most_similar(positive=['doctor', 'mother'], negative=['father'])[:5])
### END SOLUTION

## 8. Sentence Transformers — Contextual Embeddings

A key limitation of Word2Vec and GloVe is that they produce **static embeddings** — each word has exactly one vector regardless of context. The word "bank" gets the same vector whether it means "river bank" or "financial bank".

**Transformer-based models** (like BERT) solve this by producing **contextual embeddings** — the vector for each word depends on its surrounding context. These models use the **self-attention mechanism**:

$$
\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right) V
$$

where $Q$, $K$, $V$ are query, key, and value matrices derived from input embeddings, and $d_k$ is the dimension of the keys.

**Sentence Transformers** (Reimers & Gurevych, 2019) extend BERT by applying **mean pooling** over all token embeddings to produce a single fixed-size vector for an entire sentence. This is efficient and well-suited for tasks like:
- Semantic search
- Sentence similarity
- Clustering
- Sentiment classification

| Feature | Word2Vec / GloVe | Sentence Transformers |
|:--------|:-----------------|:---------------------|
| Type | Static | Contextual |
| Granularity | Word-level | Sentence-level |
| Polysemy | One vector per word | Context-dependent |
| Training data | Co-occurrence | Masked language model |
| Typical dims | 50–300 | 384–1024 |

Let's load a sentence transformer model:

In [None]:
from sentence_transformers import SentenceTransformer
sentence_model = SentenceTransformer('all-MiniLM-L6-v2')

We can encode any sentence like this:

In [None]:
sentence_embedding = sentence_model.encode("the quick brown fox jumps over the lazy dog")
sentence_embedding.shape

We now get a vector of 384 instead of a matrix of 11 by 768. This makes it much easier to deal with.

### Exercise 5 — Sentiment Classification with Sentence Transformers (4 points)

Now repeat the sentiment classification task using **Sentence Transformers** instead of GloVe:

1. Convert all documents to sentence strings (join the word lists with spaces)
2. Use `sentence_model.encode()` to compute embeddings for all documents
3. Use the same 80/20 split with `random_state=42`
4. Train a `LogisticRegression` classifier
5. Store predictions as `y_pred_st` and accuracy as `st_accuracy`

Then answer: Do you see a difference compared to Exercise 3? Why might that be?

In [None]:
### BEGIN SOLUTION
# Convert documents to sentence strings
doc_strings = [" ".join(words) for words, cat in documents]

# Encode with sentence transformers
X_st = sentence_model.encode(doc_strings, show_progress_bar=True)

# Same labels as before
X_train_st, X_test_st, y_train_st, y_test_st = train_test_split(
    X_st, y_labels, test_size=0.2, random_state=42
)

# Train classifier
clf_st = LogisticRegression(max_iter=1000, random_state=42)
clf_st.fit(X_train_st, y_train_st)

# Predict and evaluate
y_pred_st = clf_st.predict(X_test_st)
st_accuracy = accuracy_score(y_test_st, y_pred_st)
### END SOLUTION

print(f"Sentence Transformers + LogisticRegression Accuracy: {st_accuracy:.4f}")
print(f"GloVe + LogisticRegression Accuracy:                 {glove_accuracy:.4f}")
print(f"Improvement: {(st_accuracy - glove_accuracy)*100:+.1f} percentage points")

In [None]:
### BEGIN HIDDEN TESTS
assert st_accuracy > 0.55, f"Sentence Transformer accuracy should be above chance, got {st_accuracy:.4f}"
assert len(y_pred_st) == len(y_test_st), "Predictions should match test set size"
assert X_st.shape[0] == len(documents), "Should have one embedding per document"
assert X_st.shape[1] == 384, "all-MiniLM-L6-v2 should produce 384-dimensional vectors"
print(f"All Exercise 5 tests passed! Accuracy: {st_accuracy:.4f}")
### END HIDDEN TESTS

Do you see a difference between the accuracy at Exercise 3 (GloVe) and Exercise 5 (Sentence Transformers)? Why do you think this is? How could we further improve accuracy?

### BEGIN SOLUTION
Sentence Transformers typically achieve higher accuracy than GloVe because:
1. **Contextual embeddings** capture word meaning in context, while GloVe uses static vectors
2. **Sentence-level representations** capture the overall meaning instead of just averaging word vectors
3. **Pre-training on NLI tasks** makes sentence transformers better at understanding semantic relationships

To further improve accuracy, we could:
- Fine-tune the sentence transformer on our specific dataset
- Use a larger transformer model (e.g., `all-mpnet-base-v2`)
- Use a more powerful classifier (e.g., fine-tuned BERT end-to-end)
- Augment the training data
### END SOLUTION

## 9. Summary — Comparison of Document Representation Methods

| Method | Type | Dimensions | Captures Semantics? | Handles Polysemy? | Key Advantage | Key Limitation |
|:-------|:-----|:-----------|:--------------------|:-----------------|:-------------|:--------------|
| **One-Hot** | Sparse, binary | $|V|$ (10K+) | No | No | Simple, interpretable | No similarity between words |
| **Bag-of-Words** | Sparse, count | $|V|$ | No | No | Captures word frequency | Ignores word order and importance |
| **TF-IDF** | Sparse, weighted | $|V|$ | Partially | No | Weights by importance | Still high-dimensional, no semantics |
| **Word2Vec** | Dense, static | 100–300 | Yes | No | Captures analogies and similarity | One vector per word |
| **GloVe** | Dense, static | 50–300 | Yes | No | Combines global + local statistics | One vector per word |
| **Sentence Transformers** | Dense, contextual | 384–1024 | Yes | Yes | Context-aware, sentence-level | Computationally expensive |

**The evolution**: From sparse, high-dimensional, context-free representations → dense, low-dimensional, context-aware embeddings. Each step captures more linguistic information but requires more computational resources.