# CS4765/6765 NLP Assignment 3: Word vectors

**Due 4 November at 23:59**

In this two part assignment you will first examine and interact with word vectors. (This part of the assignment is adapted from a CS224N assignment at Stanford.) You will then implement a new approach to sentiment analysis.

In this assignment we will use [gensim](https://radimrehurek.com/gensim/) to access and interact with word embeddings. In gensim we’ll be working with a KeyedVectors object which represents word embeddings. [Documentation for KeyedVectors is available.](https://radimrehurek.com/gensim/models/keyedvectors.html) However, this assignment description and the sample code in it might be sufficient to show you how to use a KeyedVectors object. The will use [GloVe word embeddings](https://nlp.stanford.edu/projects/glove/) that have been trained on Wikipedia and the Gigaword corpus.


In [None]:
import gensim.downloader
model = gensim.downloader.load('glove-wiki-gigaword-300')

# Part 1: Examining word vectors (8 marks)

## Polysemy and homonymy

Polysemy and homonymy are the phenomena of words having multiple meanings/senses. The nearest neighbours (under cosine similarity) for a given word can indicate whether it has multiple senses.

Consider the following example which shows the top-10 most similar words for *mouse*. The "input device" and "animal" senses of *mouse* are clearly visible from the top-10 most similar words. 


In [None]:
# Find words most similar using cosine similarity to "mouse". 
# restrict_vocab=100000 limits the results to most frequent
# 100000 words. This avoids rare words in the output. For this
# assignment, whenever you call most_simlilar, also pass
# restrict_vocab=100000.
model.most_similar('mouse', restrict_vocab=100000)

*keyboard*, *joystick*, and *cursor* correspond to the input device sense. *mice*, *rat*, *rabbit*, *rodent*, *monkey*, *rats*, and *cat* correspond to the animal sense. (You can observe something similar for the different senses of the word *leaves*.)

Find a new example that exhibits polysemy/homonymy, show its top-10 most similar words, and explain why they show that this word has multiple senses. Write your answer in the code and text boxes below.

In [None]:
# Write your code here

Write your answer here

## Synonyms and antonyms

Find three words (w1 , w2 , w3) such that w1 and w2 are synonyms (i.e., have roughly the same meaning), and w1 and w3 are antonyms (i.e., have opposite meanings), but the similarity between w1 and w3 > the similarity between w1 and w2. Note that this should be counter to your expectations, because synonyms (which mean roughly the same thing) would be expected to be more similar than antonyms (which have opposite meanings). Explain why you think this unexpected situation might have occurred.

Here is an example. w1 = *happy*, w2 = *cheerful*, and w3 = *sad*. (You will need to find a different example for your report.) Notice that the antonyms *happy* and *sad* are more similar than the (near) synonyms *happy* and *cheerful*.


In [None]:
# Find the cosine similarity between "happy" and "cheerful"
model.similarity('happy', 'cheerful')


In [None]:
# and between "happy" and "sad".
model.similarity('happy', 'sad')


In [None]:
# Write your code here

Write your answer here

## Analogies

Analogies such as man is to king as woman is to X can be solved using word embeddings. This analogy can be expressed as X = woman + king − man. The following code snippet shows how to solve this analogy with gensim. Notice that the model gets it correct! I.e., *queen* is the most similar word.

In [None]:
# Find the model's predictions for the solution to the analogy
# "man" is to "king" as "woman" is to X
model.most_similar(positive=['woman', 'king'],
                   negative=['man'],
                   restrict_vocab=100000)


### Correct analogy

Find a new analogy that the model is able to answer correctly (i.e., the most-similar word is the solution to the analogy). Explain briefly why the analogy holds. For the above example, this explanation would be something along the lines of a king is a ruler who is a man and a queen is a ruler who is a woman.


In [None]:
# Write your code here

Write your answer here

### Incorrect analogy

Find a new analogy that the model is not able to answer correctly. Again explain briefly why the analogy holds. For example, here is an analogy that the model does not answer correctly:


In [None]:
# Find the model's predictions for the solution to the analogy
# "finger" is to "hand" as "toe" is to X
model.most_similar(positive=['toe', 'hand'],
                   negative=['finger'],
                   restrict_vocab=100000)


A finger is part of a hand, and a toe is part of a foot, but the model does not predict *foot*, or a similar term, as the most similar word.

In [None]:
# Write your code here

Write your answer here

## Bias

Consider the examples below. The first shows the words that are most similar to *man* and *worker* and least similar to *woman*. The second shows the words that are most similar to *woman* and *worker* and least similar to *man*.

In [None]:
# Find the words that are most similar to "man" and "worker" and
# least similar to "woman".
model.most_similar(positive=['man', 'worker'],
                   negative=['woman'],
                   restrict_vocab=100000)



In [None]:
# Find the words that are most similar to "woman" and "worker" and
# least similar to "man".
model.most_similar(positive=['woman', 'worker'],
                   negative=['man'],
                   restrict_vocab=100000)



The output shows that *man* is associated with some stereotypically male jobs (e.g., *mechanic*) while *woman* is associated with some stereotypically female jobs (e.g., *nurse*, *receptionist*, *housewife*, *registered_
nurse*). This indicates that there is gender bias in the word embeddings.

Find a new example, using the same approach as above, that indicates that there is bias in the word embeddings. Briefly explain how the model output indicates that there is bias in the word embeddings. (You are by no means restricted to considering gender bias here. You are encouraged to explore other ways that embeddings might indicate bias.)

In [None]:
# Write your code here

Write your answer here

# Part 2: Sentiment Analysis (2 marks)

## Background and data

In this part of the assignment you will revisit sentiment analysis from assignment
2. You will need the data provided for that
assignment. We will consider sentiment analysis using an average of
word embeddings document representation and a logistic regression
classifier and compare this to the approaches from assignment 2.



## Approach

We will consider sentiment analysis using an average of word embeddings document representation and a multinomial logistic regression classifier. We will compare this approach to the approaches from assignment 2.

Complete the function `vec_for_doc` below. (You should not modify other parts of the
code.) This function takes a list consisting of the tokens in a document $d$. It then returns a vector $\vec{v}$ representing the document as the average of the embeddings for the words in the document as follows:

\begin{equation}
d = w_1, w_2, ... w_n
\end{equation}
\begin{equation}
\vec{v} = \dfrac{\vec{w_1} + \vec{w_2} + ... + \vec{w_n}}{n}\\
\end{equation}

If a word in a document does not occur in the word embedding model, you can simply ignore it. (Note that we would normally need to deal with the case of a document that consists entirely of words that don't occur in the embedding model, but for this dataset and embedding model, that situation does not occur, and so for now we won't worry about it.)

In [None]:
# TODO: Implement this function. tokenized_doc is a list of tokens in
# a document. Return a vector representation of the document as
# described above.
# Hints: 
# -You can get the vector for a word w using model[w] or
#  model.get_vector(w)
# -You can add vectors using + and sum, e.g.,
#  model['cat'] + model['dog']
#  sum([model['cat'], model['dog']])
# -You can see the shape of a vector using model['cat'].shape
# -The vector you return should have the same shape as a word vector 
# -This should be a very short function. If you're writing lots of
#  code, you are likely off track.
def vec_for_doc(tokenized_doc):
    # TODO: Add your code here

    
    # Delete the line below. It's only here so the starter code runs without error.
    return model['cat']    


Once you've completed `vec_for_doc` above, run the code below to train logistic regresion on the training data and evaluate on the dev data

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_fscore_support, accuracy_score

# Same tokenize function from A2
# A very simple tokenizer. Applies case folding. 
# (The documents we are working with have already been tokenized and each token is separated by whitespace.)
def tokenize(s):
    return s.lower().split()

# Load the A2 training and dev data
train_texts_fname = 'a2data/movie_reviews_train_docs.txt'
train_klasses_fname = 'a2data/movie_reviews_train_classes.txt'
dev_texts_fname = 'a2data/movie_reviews_dev_docs.txt'
dev_klasses_fname = 'a2data/movie_reviews_dev_classes.txt'

train_texts = [x.strip() for x in open(train_texts_fname, encoding='utf8')]
train_klasses = [x.strip() for x in open(train_klasses_fname, encoding='utf8')]
dev_texts = [x.strip() for x in open(dev_texts_fname, encoding='utf8')]
dev_klasses = [x.strip() for x in open(dev_klasses_fname, encoding='utf8')]

# A helper function from A2 to print out macro-average P,R,F1 and accuracy.
# Uses implementantions of evaluation metrics from sklearn.
def print_results(gold_labels, predicted_labels):
    p,r,f,_ = precision_recall_fscore_support(gold_labels, 
                                              predicted_labels, 
                                              average='macro', 
                                              zero_division=0)
    acc = accuracy_score(gold_labels, predicted_labels)

    print("Precision: ", p)
    print("Recall: ", r)
    print("F1: ", f)
    print("Accuracy: ", acc)
    print()

# train_vecs and dev_vecs are lists; each element is a vector
# representing a (train or dev) document
train_vecs = [vec_for_doc(tokenize(x)) for x in train_texts]
dev_vecs = [vec_for_doc(tokenize(x)) for x in dev_texts]

# Train logistic regression, same as A2
lr = LogisticRegression(multi_class='multinomial',
                        solver='sag',
                        penalty='l2',
                        max_iter=2000,
                        random_state=0)
clf = lr.fit(train_vecs, train_klasses)
dev_predictions = clf.predict(dev_vecs)

print_results(dev_klasses, dev_predictions)


Finally, evaluate on the test data

In [None]:
test_texts_fname = 'a2data/movie_reviews_test_docs.txt'
test_klasses_fname = 'a2data/movie_reviews_test_classes.txt'

test_texts = [x.strip() for x in open(test_texts_fname, encoding='utf8')]
test_klasses = [x.strip() for x in open(test_klasses_fname, encoding='utf8')]

test_vecs = [vec_for_doc(tokenize(x)) for x in test_texts]
test_predictions = clf.predict(test_vecs)

print_results(test_klasses, test_predictions)

Run the code below to replicate the test results for logistic regression from A2.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer(analyzer=tokenize)
train_counts = count_vectorizer.fit_transform(train_texts)
test_counts = count_vectorizer.transform(test_texts)

lr_A2 = LogisticRegression(multi_class='multinomial',
                           solver='sag',
                           penalty='l2',
                           max_iter=2000,
                           random_state=0)
clf_A2 = lr_A2.fit(train_counts, train_klasses)

A2_test_predictions = clf_A2.predict(test_counts)

print_results(test_klasses, A2_test_predictions)

Compare the results on the test data here to the results using logistic regression on the test data for assignment 2. The difference between these two approaches is the document representation. In this assignment we used a document representation based on average of word embeddings. In assignment 2 we used a document representation based on word counts. Which method performs better?

TODO: Write your answer here

# Submitting your work

When you're done, submit a3.ipynb to the assignment 3 folder on D2L by the deadline.