# Vector Space Semantics and Word Embeddings

Word embeddings are a powerful concept in natural language processing (NLP) that involve representing words as vectors in a continuous vector space. Word embeddings capture semantic information about words. This means that words with similar meanings are represented by vectors that are close to each other in the vector space. 

In this assignment, we'll look at more traditional, *sparse* representations of words (i.e. where words are represented by vectors of size |V| with the word count corresponding to the word's index in the vocabulary), and then play with more advanced, *dense* representations of words, learned via neural network.

## Part 1: Sparse Vector Representations

We'll implement TF-IDF and PPMI, two methods to represent words and documents in a sparse matrix. We have provided a *term-document matrix* for the BLT corpus, treating each review as a document. The documents (columns) are represented via word count (rows). The columns of the matrix represent each review, and the rows represent every word in the vocabulary. One cell of the matrix gives the term frequency for that document.

Because most words do not appear in most reviews, the matrix will be mostly 0s. This is why we call it sparse.

In [None]:
# Imports
import pandas as pd
import numpy as np

In [None]:
# Get the term x doc matrix of the BLT corpus
# Note: the matrix does not include reviews that were flagged for rejection in the corpus
termxdoc = pd.read_csv('blt_termxdoc.csv', index_col=0)

In [None]:
termxdoc # checking it out...

In [None]:
# Our term x doc matrix provides a representation of the BLT corpus in terms of the 
# frequency of each term in each document. The rows represent the terms and the columns 
# represent the documents. The values in the matrix represent the frequency of each term 
# in each document.
#
# The matrix is sparse, as most terms do not appear in most documents.
# For example, let's find the frequency of the term 'grill' in each document:
termxdoc.loc['grill']

In [None]:
# Summing across axis 1 (columns) gives the term frequencies across the entire corpus
termxdoc.sum(axis=1).sort_values(ascending=False).head(10) # Top 10 terms by frequency; 'the' is the most frequent term

### Implement TF-IDF

Let's implement TF-IDF using the term-document matrix. As a reminder, TF-IDF is a $|V| x N$ matrix in which the weighted value $w_{t,d}$ for word $t$ in document $d$ combines term frequency $\text{tf}_{t,d}$ with the inverse document frequency $\text{idf}$:

$$
w_{t,d} = \text{tf}_{t,d} * \text{idf}_t
$$

We will use logarithmically scaled term frequency:

$$
\text{tf}_{t,d} = \begin{cases}
                     1 +\log\text{count}(t,d) & \text{if count}(t,d) > 0 \\
                     0 & \text{otherwise}
                   \end{cases}
$$

The $\text{idf}$ is defined using the fraction $N/\text{df}_t$, where $N$ is the total number of documents in the collection, and $\text{df}_t$ is the number of documents in which term $t$ occurs. The fewer documents in which a term occurs, the higher this weight. The lowest weight of 1 is assigned to terms that occur in all the documents.

We'll also logarithmically scale the inverse document frequency in our implementation:

$$
\text{idf}_t = \log \frac{N}{\text{df}_t} 
$$

In [None]:
def tfidf(termxdoc: pd.DataFrame) -> pd.DataFrame:
    """This function takes a raw term x doc matrix and returns a term x doc matrix with tfidf values.
    Remember, the rows are the term counts and the columns are the documents."""
    
    # Get the term frequency
    log_counts = np.log(termxdoc)
    log_counts[np.isinf(log_counts)] = 0.0 # log(0) goes to 0
    tfs = 1 + log_counts
    
    # your code here
    raise NotImplementedError

    return tfidf

In [None]:
blt_tfidf = tfidf(termxdoc)

In [None]:
blt_tfidf.loc['grill'] # Check to see how the tfidf values for the term 'grill' have changed

In [None]:
assert blt_tfidf.loc['grill'][0] == 10.725269122772522
print('TF-IDF implementation seems to be working!')

### Implement PPMI

In NLP, the pointwise mutual information between a target word $i$ and a context word $j$ is defined as the observations of $i$ and $j$ co-occurring, divided by our expectations of $i$ and $j$ occurring assuming they each occurred independently:

$$
PMI(i, j) = \log\frac{\text{observed}(i, j)}{\text{expected}(i, j)} = \log\frac{P(i, j)}{P(i)P(j)}
$$

Given a word co-occurrence matrix $W$ where $w_{i,j}$ gives the number of times word $w_i$ occurs with context $w_j$, we have

$$
\text{observed}(i, j) = W_{i,j}
$$

and our $\text{expected}(i,j)$ value can be defined as:

$$
\text{expected}(i,j) = \frac{\text{rowsum}(i)\cdot\text{colsum}(j)}{\text{sum}(W)}
$$

Finally, we define *positive* PMI (PPMI) as follows:

$$
PPMI_{ij} = \max(\log\frac{\text{observed}(i,j)}{\text{expected}(i,j)},0)
$$


PPMI fixes the problem of taking the log of 0-count cells.

TF-IDF measured term-document frequencies, but PPMI measures term-term frequencies. So, we've provided a term-term matrix for the BLT corpus. The counts are simple: for the matrix $W_{i,j}$, a context word $w_j$ appears in the same review as target word $w_i$ a total of $w_{i,j}$ times. To keep things relatively faster, we've truncated the matrix to the most frequent 3,000 terms (on both axes).

In [None]:
# Get the term x term matrix
termxterm = pd.read_csv('blt_termxterm.csv', index_col=0)

In [None]:
termxterm

In [None]:
termxterm.loc['grill'].sort_values(ascending=False)

In [None]:
def ppmi(termxterm: pd.DataFrame) -> pd.DataFrame:
    """This function takes a raw term x term matrix and returns a term x term matrix with PPMI values."""
    context_word_sums = termxterm.sum(axis=0) # sum context words (column sum)
    sum_W = context_word_sums.sum() # total words in corpus
    target_word_sums = termxterm.sum(axis=1) # sum target words (row sum)

    # your code here
    raise NotImplementedError
    return ppmi

In [None]:
blt_ppmi = ppmi(termxterm)

In [None]:
blt_ppmi.loc['grill'].sort_values(ascending=False) # Check to see how the PPMI values for the term 'grill' have changed

In [None]:
assert blt_ppmi.loc['grill'][0] == 0.0
print('PPMI implementation seems to be working!')

### Implement nearest neighbors

Now that we have a couple of different vector spaces for the BLT corpus, let's play around with them.

Implement a `nearest_neighbors` function that returns the `k` nearest neighbors for a term, given a vector space (matrix). Use cosine similarity for the distance function:

$$
\text{cosine distance}(u,v) = 1 - \frac{u \cdot v}{||u||_2 \cdot ||v||_2}
$$

In [None]:
def nearest_neighbors(term: str, matrix: pd.DataFrame, k: int) -> pd.DataFrame:
    """This function takes a term, a term x doc matrix, and a number of neighbors k, 
    and returns the k nearest neighbors of the term using cosine distance."""
    # your code here
    raise NotImplementedError
    return neighbors

In [None]:
sample_matrix = pd.DataFrame(
    [[1.0,  2.0],
    [2.0, 1.0],
    [11.0, 20.0],
    [18.0, 1.0]],
    index=['a', 'b', 'c', 'd'],
    columns=['x', 'y'])
sample_matrix 

In [None]:
neighbors_to_check = nearest_neighbors('a', sample_matrix, 1).index
assert 'c' in neighbors_to_check
neighbors_to_check = nearest_neighbors('a', sample_matrix, 2).index
assert 'b' in neighbors_to_check
print('Nearest neighbors implementation is working!')

In [None]:
nearest_neighbors('grill', blt_tfidf, 10) # Get the 10 nearest neighbors of the term 'grill' using the tfidf matrix

In [None]:
neighbors_to_check = nearest_neighbors('grill', blt_tfidf, 10).index
nns_tfidf = ['george', 'foreman', 'steaks', 'college', 'roommates', 'coming', 'usability', 'deceiving', 'hum', 'humid']
for nn in nns_tfidf:
    assert nn in neighbors_to_check

In [None]:
nearest_neighbors('grill', blt_ppmi, 10) # Get the 10 nearest neighbors of the term 'grill' using the PPMI matrix

In [None]:
neighbors_to_check = nearest_neighbors('grill', blt_ppmi, 10).index
nns_ppmi = ['george', 'foreman', 'plants', 'forty', 'steaks', 'bear', 'peninsula', 'freshly', 'housekeepers', 'culture']
for nn in nns_ppmi:
    assert nn in neighbors_to_check

## Part 2: Dimensionality reduction

We can capture latent relations among the words in our vocabulary words by reducing the dimensions of our sparse matrices and getting dense embeddings. We'll explore a technique called Latent Semantic Analysis (LSA) to do this with our term-document matrix.

### Implement LSA

LSA uses SVD (singular value decomposition) to factor a matrix into three matrices:

$$
X_{mxn} = T_{mxr} S_{rxr} V^T_{rxn}
$$

$S$ is diagonal matrix containing what are called the singular values. The entries in $S$ along the diagonal are non-negative values sorted in descending order.

For LSA, for a term-document matrix $X_{mxn}$ in which we have $m$ terms and $n$ documents, we keep only the $k$ highest values in $S$, along with the first $k$ columns of $T$:

$$
LSA(X_{mxn}) = T_{mxk} S_{kxk}
$$

(We can throw away $V^T_{rxn}$).


Feel free to use a library for SVD, e.g. https://numpy.org/doc/2.1/reference/generated/numpy.linalg.svd.html.

In [None]:
def lsa(matrix: pd.DataFrame, k: int):
    """This function takes a term x doc matrix and a number of components k, and returns the LSA matrix."""
    T, S, V_T = np.linalg.svd(matrix, full_matrices=False)
    # your code here
    raise NotImplementedError
    return lsa_matrix

In [None]:
# This may take a few minutes...
blt_lsa_50 = lsa(termxdoc, 50) # Get the LSA matrix with 50 components

In [None]:
# This one will take a while too...
blt_lsa_100 = lsa(termxdoc, 100) # Get the LSA matrix with 100 components

In [None]:
# And this one will take even longer. Hang in there!
blt_lsa_300 = lsa(termxdoc, 300) # Get the LSA matrix with 300 components

In [None]:
blt_lsa_100 # Our term x doc matrix is not so sparse anymore!

In [None]:
blt_lsa_100.loc['grill'] # The 'grill' term vector in the LSA matrix...

In [None]:
# Let's look at the nearest neighbors of the term 'grill' in our LSA matrices
nearest_neighbors('grill', blt_lsa_50, 10)

In [None]:
neighbors_to_check = nearest_neighbors('grill', blt_lsa_50, 10).index
nns_lsa50 = ['george', 'foreman', 'boxing', 'consummate', 'instantaneously', 'slapped', 'stupendous', 'champ', 'unto', 'steaks']
for nn in nns_lsa50:
    assert nn in neighbors_to_check
print('LSA (n=50)) implementation seems to be working!')

In [None]:
nearest_neighbors('grill', blt_lsa_100, 10)

In [None]:
neighbors_to_check = nearest_neighbors('grill', blt_lsa_100, 10).index
nns_lsa100 = ['george', 'foreman', 'boxing', 'consummate', 'instantaneously', 'slapped', 'stupendous', 'champ', 'unto', 'steaks']
for nn in nns_lsa100:
    assert nn in neighbors_to_check
print('LSA (n=100)) implementation seems to be working!')

In [None]:
nearest_neighbors('grill', blt_lsa_300, 10)

In [None]:
neighbors_to_check = nearest_neighbors('grill', blt_lsa_300, 10).index
nns_lsa300 = ['george', 'foreman', 'instantaneously', 'slapped', 'consummate', 'boxing', 'stupendous', 'scare', 'unto', 'champ']
for nn in nns_lsa300:
    assert nn in neighbors_to_check
print('LSA (n=300)) implementation seems to be working!')

## Part 3: Sentiment Analysis

Let's see how we do on sentiment analysis on the BLT corpus using our LSA matrices. We can think of our LSA matrices as 50-, 100-, and 300-dimension word embeddings. Similar to our ngram language modeling approach to sentiment analysis, we can try to classify the sentiment of unseen reviews by finding reviews that are similar to known positive reviews or known negative reviews.

We'll represent our positive and negative reviews using our LSA word embeddings, and then, given an unseen review, we'll take a vote: compare it across all of the reviews, and classify it as positive if the majority of the similar reviews are positive, and classify it as negative if the majority of the similar reviews are negative.

### Representing reviews with LSA embeddings

Let's create a function `review2vec` that takes the text of a review and a matrix of embeddings, and turns it into a vector based on those embeddings.

We will tokenize the review and then find word embeddings for every word in the review. Then, we'll take the centroid of all the embeddings in the review: that is to say, we'll simply average all the word embeddings.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

def review2vec(review: str, matrix: pd.DataFrame) -> np.array:
    """This function takes a review and a matrix of word embeddings, 
    and returns the vector representation of the review.
    The review vector is the average of the vectors of the 
    words in the review that are in the matrix."""
    # Get words in review
    # We're tokenizing the review using the same tokenizer used to build the term x doc matrix
    # This avoids out-of-vocabulary words
    words = vectorizer.build_tokenizer()(review)

    # Initialize review vector as zeros
    review_vector = np.zeros(matrix.shape[1])

    # your code here
    raise NotImplementedError

    return review_vector

In [None]:
vector_to_check = review2vec("This is a fake review", blt_lsa_50)
assert vector_to_check[0] == -18.86910966512524
print("Review2vec implementation seems to be working!")

In [None]:
# Let's test our review2vec technique for sentiment classification
# First we need to read in all the reviews and their sentiment labels
# This may take a minute or two
import csv
data_and_annotations = {
    'train': {
        'pos': {'ids': [], 'vectors': {50: [], 100: [], 300: []}},
        'neg': {'ids': [], 'vectors': {50: [], 100: [], 300: []}}
    },
    'test': {'ids': [], 'sentiment': [], 'vectors': {50: [], 100: [], 300: []}}
}
with open('positive_reviews_train.tsv', 'r') as f:
    reader = csv.reader(f, delimiter='\t')
    for i, row in enumerate(reader):
        id_ = row[0]
        review = row[1]
        for n in [50, 100, 300]:
            model = blt_lsa_50 if n == 50 else blt_lsa_100 if n == 100 else blt_lsa_300
            review_vector = review2vec(review, model)
            data_and_annotations['train']['pos']['vectors'][n].append(review_vector)
        data_and_annotations['train']['pos']['ids'].append(id_)
with open('negative_reviews_train.tsv', 'r') as f:
    reader = csv.reader(f, delimiter='\t')
    for i, row in enumerate(reader):
        id_ = row[0]
        review = row[1]
        for n in [50, 100, 300]:
            review_vector = review2vec(review, globals()[f'blt_lsa_{n}'])
            data_and_annotations['train']['neg']['vectors'][n].append(review_vector)
        data_and_annotations['train']['neg']['ids'].append(id_)
with open('test.tsv', 'r') as f:
    reader = csv.reader(f, delimiter='\t')
    for i, row in enumerate(reader):
        sentiment = row[2]
        review = row[1]
        id_ = row[0]
        for n in [50, 100, 300]:
            review_vector = review2vec(review, globals()[f'blt_lsa_{n}'])
            data_and_annotations['test']['vectors'][n].append(review_vector)
        data_and_annotations['test']['ids'].append(id_)
        data_and_annotations['test']['sentiment'].append(sentiment)

In [None]:
# The vector representation of the first positive review
data_and_annotations['train']['pos']['ids'][0], data_and_annotations['train']['pos']['vectors'][100][0]

In [None]:
def classify_sentiment(
    review_vector: np.array, 
    positive_review_vectors: list, 
    negative_review_vectors: list, 
    k: int = 10) -> str:
    """This function takes a review vector and two lists of positive and negative review vectors, 
    and returns the sentiment of the review.
    The sentiment is determined by the vote of the k nearest neighbors."""
    # Calculate the cosine similarity between the review vector and the positive and negative review vectors
    positive_sim = np.dot(positive_review_vectors, review_vector) / ((np.linalg.norm(positive_review_vectors, axis=1) * np.linalg.norm(review_vector)))
    negative_sim = np.dot(negative_review_vectors, review_vector) / ((np.linalg.norm(negative_review_vectors, axis=1) * np.linalg.norm(review_vector)))
    # Find top k nearest neighbors
    topk_positive = positive_sim.argsort()[-k:][::-1]
    topk_negative = negative_sim.argsort()[-k:][::-1]
    # Get the vote of the top k most similar reviews
    vote = sum(positive_sim[topk_positive] > negative_sim[topk_negative])
    # Return the sentiment of the review
    if vote > k/2:
        return 'pos'
    else:
        return 'neg'

In [None]:
# Let's define experimental conditions
ks = [1, 3, 5, 7, 10]
n_components = [50, 100, 300]
results = {}

In [None]:
# This may take a few minutes as well!
from copy import deepcopy
test_ids = data_and_annotations['test']['ids']
test_sentiment = data_and_annotations['test']['sentiment']
for n in n_components:
    print('Running experiments for LSA with {} components...'.format(n))
    results[n] = {}
    test_review_vectors = deepcopy(data_and_annotations['test']['vectors'][n])
    positive_review_vectors = deepcopy(data_and_annotations['train']['pos']['vectors'][n])
    negative_review_vectors = deepcopy(data_and_annotations['train']['neg']['vectors'][n])

    for k in ks:
        print('Running experiments for k = {}...'.format(k))
        results[n][k] = {}
        
        tp = 0
        fp = 0
        tn = 0
        fn = 0

        for t_i, t_v, t_s in zip(test_ids, test_review_vectors, test_sentiment):
            predicted_sentiment = classify_sentiment(t_v, positive_review_vectors, negative_review_vectors, k)
            if predicted_sentiment == 'pos':
                if t_s == 'pos':
                    tp += 1
                else:
                    fp += 1
            else:
                if t_s == 'neg':
                    tn += 1
                else:
                    fn += 1

        accuracy = (tp + tn) / (tp + fp + tn + fn)
        precision = tp / (tp + fp) if tp + fp > 0 else 0
        recall = tp / (tp + fn) if tp + fn > 0 else 0
        f1 = 2 * ((precision * recall) / (precision + recall)) if precision + recall > 0 else 0

        results[n][k]['accuracy'] = accuracy
        results[n][k]['precision'] = precision
        results[n][k]['recall'] = recall
        results[n][k]['f1'] = f1
print('Done!')

In [None]:
for n in n_components:
    for k in ks:
        fscore = results[n][k]['f1']
        assert fscore > .6
print('The method seems to be working as it should!')

In [None]:
# plot the results
import matplotlib.pyplot as plt

ks_50 = list(results[50].keys())
ks_100 = list(results[100].keys())
ks_300 = list(results[300].keys())

f1s_50 = [results[50][k]['f1'] for k in ks]
f1s_100 = [results[100][k]['f1'] for k in ks]
f1s_300 = [results[300][k]['f1'] for k in ks]

plt.figure(figsize=(10, 5))
plt.plot(ks_50, f1s_50, marker='o', label='n = 50')
plt.plot(ks_100, f1s_100, marker='o', label='n = 100')
plt.plot(ks_300, f1s_300, marker='o', label='n = 300')
plt.xticks(ks)
plt.title('LSA with n components')
plt.xlabel('k')
plt.ylabel('F1')
plt.legend()
plt.show()

### Representing reviews with GloVe embeddings

Now, let's do the exact same thing, but use pre-trained embeddings. We'll use 100-dimensional [GloVe](https://nlp.stanford.edu/projects/glove/) embeddings, but feel free to explore independently using embeddings from other sources.

How do you think our LSA vectors of the same size (n=100) will compare to GloVe? What about our larger (n=300) LSA vectors?

In [None]:
%pip install gensim

In [None]:
# Get pre-trained embeddings
import gensim.downloader as api
word_vectors = api.load("glove-wiki-gigaword-100")

In [None]:
word_vectors['grill'] # just taking a peek

In [None]:
# Let's put the glove embeddings into a pd.DataFrame for compatibility with our previous functions
glove_matrix = pd.DataFrame({word: word_vectors[word] for word in word_vectors.key_to_index.keys()}).T

In [None]:
glove_matrix.loc['grill'] # The 'grill' term vector in the GloVe matrix - looks like our LSA embeddings now

In [None]:
# Let's redo our experiments with GloVe embeddings
data_and_annotations['train']['pos']['glove'] = []
data_and_annotations['train']['neg']['glove'] = []
data_and_annotations['test']['glove'] = []
with open('positive_reviews_train.tsv', 'r') as f:
    reader = csv.reader(f, delimiter='\t')
    for i, row in enumerate(reader):
        review = row[1]
        glove_review_vector = review2vec(review, glove_matrix)
        data_and_annotations['train']['pos']['glove'].append(glove_review_vector)
with open('negative_reviews_train.tsv', 'r') as f:
    reader = csv.reader(f, delimiter='\t')
    for i, row in enumerate(reader):
        review = row[1]
        glove_review_vector = review2vec(review, glove_matrix)
        data_and_annotations['train']['neg']['glove'].append(glove_review_vector)
with open('test.tsv', 'r') as f:
    reader = csv.reader(f, delimiter='\t')
    for i, row in enumerate(reader):
        review = row[1]
        glove_review_vector = review2vec(review, glove_matrix)
        data_and_annotations['test']['glove'].append(glove_review_vector)

In [None]:
# Let's define experimental conditions
ks = [1, 3, 5, 7, 10]
results['glove-100'] = {}

print('Running experiments for GloVe embeddings...')
test_ids = data_and_annotations['test']['ids']
test_sentiment = data_and_annotations['test']['sentiment']

test_review_vectors = deepcopy(data_and_annotations['test']['glove'])
positive_review_vectors = deepcopy(data_and_annotations['train']['pos']['glove'])
negative_review_vectors = deepcopy(data_and_annotations['train']['neg']['glove'])

for k in ks:
    print('Running experiments for k = {}...'.format(k))
    results['glove-100'][k] = {}
    
    tp = 0
    fp = 0
    tn = 0
    fn = 0

    for t_i, t_v, t_s in zip(test_ids, test_review_vectors, test_sentiment):
        predicted_sentiment = classify_sentiment(t_v, positive_review_vectors, negative_review_vectors, k)
        if predicted_sentiment == 'pos':
            if t_s == 'pos':
                tp += 1
            else:
                fp += 1
        else:
            if t_s == 'neg':
                tn += 1
            else:
                fn += 1

    accuracy = (tp + tn) / (tp + fp + tn + fn)
    precision = tp / (tp + fp) if tp + fp > 0 else 0
    recall = tp / (tp + fn) if tp + fn > 0 else 0
    f1 = 2 * ((precision * recall) / (precision + recall)) if precision + recall > 0 else 0

    results['glove-100'][k]['accuracy'] = accuracy
    results['glove-100'][k]['precision'] = precision
    results['glove-100'][k]['recall'] = recall
    results['glove-100'][k]['f1'] = f1
print('Done!')

In [None]:
# plot the results, along with the LSA results
ks_glove = list(results['glove-100'].keys())
f1s_glove = [results['glove-100'][k]['f1'] for k in ks]

plt.figure(figsize=(10, 5))
plt.plot(ks_50, f1s_50, marker='o', label='LSA n = 50')
plt.plot(ks_100, f1s_100, marker='o', label='LSA n = 100')
plt.plot(ks_300, f1s_300, marker='o', label='LSA n = 300')
plt.plot(ks_glove, f1s_glove, marker='o', label='GloVe n = 100')
plt.xticks(ks)
plt.title('LSA vs GloVe embeddings')
plt.xlabel('k')
plt.ylabel('F1')
plt.legend()
plt.show()

## Conclusion

Looks like our LSA embeddings couldn't keep up with the pre-trained GloVe embeddings. To close out the homework, take some time to reflect and maybe speculate on why that would be: is it the way GloVe was trained? Or maybe it's the data GloVe was trained on?