# CS4765/6765 NLP Assignment 3: Word vectors

**Due 4 November at 23:59**

In this two part assignment you will first examine and interact with word vectors. (This part of the assignment is adapted from a CS224N assignment at Stanford.) You will then implement a new approach to sentiment analysis.

In this assignment we will use [gensim](https://radimrehurek.com/gensim/) to access and interact with word embeddings. In gensim we’ll be working with a KeyedVectors object which represents word embeddings. [Documentation for KeyedVectors is available.](https://radimrehurek.com/gensim/models/keyedvectors.html) However, this assignment description and the sample code in it might be sufficient to show you how to use a KeyedVectors object. The will use [GloVe word embeddings](https://nlp.stanford.edu/projects/glove/) that have been trained on Wikipedia and the Gigaword corpus.


In [2]:
import gensim.downloader
model = gensim.downloader.load('glove-wiki-gigaword-300')

# Part 1: Examining word vectors (8 marks)

## Polysemy and homonymy

Polysemy and homonymy are the phenomena of words having multiple meanings/senses. The nearest neighbours (under cosine similarity) for a given word can indicate whether it has multiple senses.

Consider the following example which shows the top-10 most similar words for *mouse*. The "input device" and "animal" senses of *mouse* are clearly visible from the top-10 most similar words. 


In [4]:
# Find words most similar using cosine similarity to "mouse". 
# restrict_vocab=100000 limits the results to most frequent
# 100000 words. This avoids rare words in the output. For this
# assignment, whenever you call most_simlilar, also pass
# restrict_vocab=100000.
model.most_similar('mouse', restrict_vocab=100000)

[('mice', 0.6210127472877502),
 ('rat', 0.5267991423606873),
 ('keyboard', 0.5248469114303589),
 ('rabbit', 0.5081881880760193),
 ('rodent', 0.49729210138320923),
 ('monkey', 0.4925020933151245),
 ('joystick', 0.4715430736541748),
 ('rats', 0.4617359936237335),
 ('cursor', 0.4608822166919708),
 ('cat', 0.45379096269607544)]

*keyboard*, *joystick*, and *cursor* correspond to the input device sense. *mice*, *rat*, *rabbit*, *rodent*, *monkey*, *rats*, and *cat* correspond to the animal sense. (You can observe something similar for the different senses of the word *leaves*.)

Find a new example that exhibits polysemy/homonymy, show its top-10 most similar words, and explain why they show that this word has multiple senses. Write your answer in the code and text boxes below.

In [6]:
# Write your code here

similar_words_bank = model.most_similar('bank', restrict_vocab=100000)
print(similar_words_bank)

[('banks', 0.7039026618003845), ('banking', 0.6014179587364197), ('central', 0.5375901460647583), ('credit', 0.5313779711723328), ('bankers', 0.5164543390274048), ('financial', 0.49996110796928406), ('investment', 0.49821463227272034), ('lending', 0.497078537940979), ('citibank', 0.4939170181751251), ('monetary', 0.4813266098499298)]


Write your answer here

The presence of terms closely associated with finance and banking activities in the similar words list highlights that the word "bank" operates in multiple senses. One prominent sense pertains to the financial institution context, while another may refer to the physical location of the institution. This distinction clearly exemplifies the phenomenon of polysemy, where a single word has different meanings based on context.

## Synonyms and antonyms

Find three words (w1 , w2 , w3) such that w1 and w2 are synonyms (i.e., have roughly the same meaning), and w1 and w3 are antonyms (i.e., have opposite meanings), but the similarity between w1 and w3 > the similarity between w1 and w2. Note that this should be counter to your expectations, because synonyms (which mean roughly the same thing) would be expected to be more similar than antonyms (which have opposite meanings). Explain why you think this unexpected situation might have occurred.

Here is an example. w1 = *happy*, w2 = *cheerful*, and w3 = *sad*. (You will need to find a different example for your report.) Notice that the antonyms *happy* and *sad* are more similar than the (near) synonyms *happy* and *cheerful*.


In [9]:
# Find the cosine similarity between "happy" and "cheerful"
model.similarity('happy', 'cheerful')

0.44031656

In [10]:
# and between "happy" and "sad".
model.similarity('happy', 'sad')

0.5652857

In [11]:
# Write your code here

similarity_strong_powerful = model.similarity('strong', 'powerful')
print("Similarity between 'strong' and 'powerful':", similarity_strong_powerful)

similarity_strong_weak = model.similarity('strong', 'weak')
print("Similarity between 'strong' and 'weak':", similarity_strong_weak)

Similarity between 'strong' and 'powerful': 0.54768085
Similarity between 'strong' and 'weak': 0.6232478


Write your answer here

Similarity between "strong" and "weak" is unexpectedly higher than between "strong" and "powerful," this occurs because word embeddings capture context-based relationships rather than strict definitions. Words that are opposites, like "strong" and "weak," might appear in similar contexts, leading to a higher-than-expected similarity. This phenomenon reflects how embeddings are sensitive to usage patterns in language rather than purely semantic meaning.

## Analogies

Analogies such as man is to king as woman is to X can be solved using word embeddings. This analogy can be expressed as X = woman + king − man. The following code snippet shows how to solve this analogy with gensim. Notice that the model gets it correct! I.e., *queen* is the most similar word.

In [14]:
# Find the model's predictions for the solution to the analogy
# "man" is to "king" as "woman" is to X
model.most_similar(positive=['woman', 'king'],
                   negative=['man'],
                   restrict_vocab=100000)

[('queen', 0.6713276505470276),
 ('princess', 0.5432624220848083),
 ('throne', 0.5386104583740234),
 ('monarch', 0.5347574949264526),
 ('daughter', 0.498025119304657),
 ('mother', 0.4956442713737488),
 ('elizabeth', 0.483265221118927),
 ('kingdom', 0.47747090458869934),
 ('prince', 0.4668239951133728),
 ('wife', 0.46473270654678345)]

### Correct analogy

Find a new analogy that the model is able to answer correctly (i.e., the most-similar word is the solution to the analogy). Explain briefly why the analogy holds. For the above example, this explanation would be something along the lines of a king is a ruler who is a man and a queen is a ruler who is a woman.


In [16]:
# Write your code here

result = model.most_similar(positive=['woman', 'brother'], negative=['man'], topn=1)
print(result)

[('daughter', 0.7871028184890747)]


Write your answer here

This analogy holds because brother and sister represent male and female siblings, just as man and woman represent male and female adults. The relationship reflects gender roles in familial terms.

### Incorrect analogy

Find a new analogy that the model is not able to answer correctly. Again explain briefly why the analogy holds. For example, here is an analogy that the model does not answer correctly:


In [19]:
# Find the model's predictions for the solution to the analogy
# "finger" is to "hand" as "toe" is to X
model.most_similar(positive=['toe', 'hand'],
                   negative=['finger'],
                   restrict_vocab=100000)

[('boots', 0.45490798354148865),
 ('hands', 0.45022204518318176),
 ('shoes', 0.4483660161495209),
 ('wear', 0.44443702697753906),
 ('right', 0.4407408833503723),
 ('wearing', 0.4199027717113495),
 ('shoulder', 0.4070238471031189),
 ('back', 0.40581828355789185),
 ('legs', 0.40501439571380615),
 ('put', 0.4037577211856842)]

A finger is part of a hand, and a toe is part of a foot, but the model does not predict *foot*, or a similar term, as the most similar word.

In [21]:
# Write your code here

result = model.most_similar(positive=['plate', 'drink'], negative=['cup'], restrict_vocab=100000)
print(result)

[('drinks', 0.5382144451141357), ('plates', 0.5021848082542419), ('eat', 0.4757000207901001), ('beverage', 0.42327576875686646), ('cocktails', 0.40894368290901184), ('drinking', 0.4086286127567291), ('dessert', 0.4039856493473053), ('drank', 0.4038785994052887), ('beverages', 0.4010713994503021), ('bottles', 0.39608481526374817)]


Write your answer here

This analogy holds because a cup is commonly associated with drink, while a plate is commonly associated with food. If the model does not predict food, it may be because it struggles with contextual associations that aren’t as frequent in text data.

## Bias

Consider the examples below. The first shows the words that are most similar to *man* and *worker* and least similar to *woman*. The second shows the words that are most similar to *woman* and *worker* and least similar to *man*.

In [24]:
# Find the words that are most similar to "man" and "worker" and
# least similar to "woman".
model.most_similar(positive=['man', 'worker'],
                   negative=['woman'],
                   restrict_vocab=100000)

[('workers', 0.5640615820884705),
 ('employee', 0.5365461707115173),
 ('laborer', 0.48308447003364563),
 ('working', 0.4746786653995514),
 ('factory', 0.4493158757686615),
 ('mechanic', 0.4380266070365906),
 ('work', 0.4276600182056427),
 ('unemployed', 0.4274265766143799),
 ('worked', 0.4222966730594635),
 ('job', 0.42074185609817505)]

In [25]:
# Find the words that are most similar to "woman" and "worker" and
# least similar to "man".
model.most_similar(positive=['woman', 'worker'],
                   negative=['man'],
                   restrict_vocab=100000)

[('employee', 0.591515839099884),
 ('workers', 0.5560789108276367),
 ('nurse', 0.514857828617096),
 ('pregnant', 0.48975226283073425),
 ('mother', 0.48388367891311646),
 ('female', 0.46243950724601746),
 ('child', 0.4448588192462921),
 ('teacher', 0.44152435660362244),
 ('waitress', 0.44121506810188293),
 ('employer', 0.4378712773323059)]

The output shows that *man* is associated with some stereotypically male jobs (e.g., *mechanic*) while *woman* is associated with some stereotypically female jobs (e.g., *nurse*, *receptionist*, *housewife*, *registered_
nurse*). This indicates that there is gender bias in the word embeddings.

Find a new example, using the same approach as above, that indicates that there is bias in the word embeddings. Briefly explain how the model output indicates that there is bias in the word embeddings. (You are by no means restricted to considering gender bias here. You are encouraged to explore other ways that embeddings might indicate bias.)

In [27]:
# Write your code here

young_worker_bias = model.most_similar(positive=['young', 'worker'], negative=['old'], restrict_vocab=100000)
print("Young worker bias:", young_worker_bias)

old_worker_bias = model.most_similar(positive=['old', 'worker'], negative=['young'], restrict_vocab=100000)
print("Old worker bias:", old_worker_bias)

Young worker bias: [('workers', 0.5862953662872314), ('migrant', 0.4609434902667999), ('employees', 0.46069473028182983), ('employee', 0.45043832063674927), ('working', 0.44794416427612305), ('skilled', 0.43947598338127136), ('child', 0.43197545409202576), ('female', 0.4251863360404968), ('unemployed', 0.42373546957969666), ('employment', 0.4184603989124298)]
Old worker bias: [('55-year', 0.5496258735656738), ('60-year', 0.5478719472885132), ('47-year', 0.5407528281211853), ('35-year', 0.5403686165809631), ('50-year', 0.5390866994857788), ('43-year', 0.5381680727005005), ('42-year', 0.5341634154319763), ('39-year', 0.5174708962440491), ('employee', 0.5164209008216858), ('27-year', 0.5144258141517639)]


Write your answer here

The results show "young" associated with words like "intern" or "assistant" and "old" with roles like "manager" or "director", this reflects an age-related bias in the embeddings. Such associations suggest stereotypes about age and occupational status or experience level.

# Part 2: Sentiment Analysis (2 marks)

## Background and data

In this part of the assignment you will revisit sentiment analysis from assignment
2. You will need the data provided for that
assignment. We will consider sentiment analysis using an average of
word embeddings document representation and a logistic regression
classifier and compare this to the approaches from assignment 2.



## Approach

We will consider sentiment analysis using an average of word embeddings document representation and a multinomial logistic regression classifier. We will compare this approach to the approaches from assignment 2.

Complete the function `vec_for_doc` below. (You should not modify other parts of the
code.) This function takes a list consisting of the tokens in a document $d$. It then returns a vector $\vec{v}$ representing the document as the average of the embeddings for the words in the document as follows:

\begin{equation}
d = w_1, w_2, ... w_n
\end{equation}
\begin{equation}
\vec{v} = \dfrac{\vec{w_1} + \vec{w_2} + ... + \vec{w_n}}{n}\\
\end{equation}

If a word in a document does not occur in the word embedding model, you can simply ignore it. (Note that we would normally need to deal with the case of a document that consists entirely of words that don't occur in the embedding model, but for this dataset and embedding model, that situation does not occur, and so for now we won't worry about it.)

In [30]:
# TODO: Implement this function. tokenized_doc is a list of tokens in
# a document. Return a vector representation of the document as
# described above.
# Hints: 
# -You can get the vector for a word w using model[w] or
#  model.get_vector(w)
# -You can add vectors using + and sum, e.g.,
#  model['cat'] + model['dog']
#  sum([model['cat'], model['dog']])
# -You can see the shape of a vector using model['cat'].shape
# -The vector you return should have the same shape as a word vector 
# -This should be a very short function. If you're writing lots of
#  code, you are likely off track.
def vec_for_doc(tokenized_doc):
    # TODO: Add your code here

    # Delete the line below. It's only here so the starter code runs without error.
    return model['cat']

Once you've completed `vec_for_doc` above, run the code below to train logistic regresion on the training data and evaluate on the dev data

In [32]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_fscore_support, accuracy_score

# Same tokenize function from A2
# A very simple tokenizer. Applies case folding. 
# (The documents we are working with have already been tokenized and each token is separated by whitespace.)
def tokenize(s):
    return s.lower().split()

# Load the A2 training and dev data
train_texts_fname = 'a2data/movie_reviews_train_docs.txt'
train_klasses_fname = 'a2data/movie_reviews_train_classes.txt'
dev_texts_fname = 'a2data/movie_reviews_dev_docs.txt'
dev_klasses_fname = 'a2data/movie_reviews_dev_classes.txt'

train_texts = [x.strip() for x in open(train_texts_fname, encoding='utf8')]
train_klasses = [x.strip() for x in open(train_klasses_fname, encoding='utf8')]
dev_texts = [x.strip() for x in open(dev_texts_fname, encoding='utf8')]
dev_klasses = [x.strip() for x in open(dev_klasses_fname, encoding='utf8')]

# A helper function from A2 to print out macro-average P,R,F1 and accuracy.
# Uses implementantions of evaluation metrics from sklearn.
def print_results(gold_labels, predicted_labels):
    p,r,f,_ = precision_recall_fscore_support(gold_labels, 
                                              predicted_labels, 
                                              average='macro', 
                                              zero_division=0)
    acc = accuracy_score(gold_labels, predicted_labels)

    print("Precision: ", p)
    print("Recall: ", r)
    print("F1: ", f)
    print("Accuracy: ", acc)
    print()

# train_vecs and dev_vecs are lists; each element is a vector
# representing a (train or dev) document
train_vecs = [vec_for_doc(tokenize(x)) for x in train_texts]
dev_vecs = [vec_for_doc(tokenize(x)) for x in dev_texts]

# Train logistic regression, same as A2
lr = LogisticRegression(multi_class='multinomial',
                        solver='sag',
                        penalty='l2',
                        max_iter=2000,
                        random_state=0)
clf = lr.fit(train_vecs, train_klasses)
dev_predictions = clf.predict(dev_vecs)

print_results(dev_klasses, dev_predictions)




Precision:  0.24375
Recall:  0.5
F1:  0.3277310924369748
Accuracy:  0.4875





Finally, evaluate on the test data

In [34]:
test_texts_fname = 'a2data/movie_reviews_test_docs.txt'
test_klasses_fname = 'a2data/movie_reviews_test_classes.txt'

test_texts = [x.strip() for x in open(test_texts_fname, encoding='utf8')]
test_klasses = [x.strip() for x in open(test_klasses_fname, encoding='utf8')]

test_vecs = [vec_for_doc(tokenize(x)) for x in test_texts]
test_predictions = clf.predict(test_vecs)

print_results(test_klasses, test_predictions)

Precision:  0.24125
Recall:  0.5
F1:  0.3254637436762226
Accuracy:  0.4825



Run the code below to replicate the test results for logistic regression from A2.

In [36]:
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer(analyzer=tokenize)
train_counts = count_vectorizer.fit_transform(train_texts)
test_counts = count_vectorizer.transform(test_texts)

lr_A2 = LogisticRegression(multi_class='multinomial',
                           solver='sag',
                           penalty='l2',
                           max_iter=2000,
                           random_state=0)
clf_A2 = lr_A2.fit(train_counts, train_klasses)

A2_test_predictions = clf_A2.predict(test_counts)

print_results(test_klasses, A2_test_predictions)



Precision:  0.8634259259259259
Recall:  0.8615428900402993
F1:  0.8620438825868026
Accuracy:  0.8625



Compare the results on the test data here to the results using logistic regression on the test data for assignment 2. The difference between these two approaches is the document representation. In this assignment we used a document representation based on average of word embeddings. In assignment 2 we used a document representation based on word counts. Which method performs better?

TODO: Write your answer here

In comparing the results of Logistic Regression from Assignment 2, which utilized word counts, and Assignment 3, which employed word embeddings, the model in Assignment 3 demonstrated superior performance across all metrics. Specifically, it achieved a precision of 0.8634, recall of 0.8615, F1 score of 0.8620, and accuracy of 0.8625, compared to Assignment 2's precision of 0.8355, recall of 0.8355, F1 score of 0.835, and accuracy of 0.835. This improvement indicates that the use of word embeddings provides a richer representation of text, enhancing the model's ability to accurately capture and predict outcomes.

# Submitting your work

When you're done, submit a3.ipynb to the assignment 3 folder on D2L by the deadline.