1.Submit a Google Colab notebook containing your completed code and experimentation results.

2.Include comments and explanations in your code to help understand the implemented logic.

**Additional Notes:**
*   Ensure that the notebook runs successfully in Google Colab.
*   Document any issues encountered during experimentation and how you addressed them.

**Grading:**
*   Each task will be graded out of the specified points.
*   Points will be awarded for correctness, clarity of code, thorough experimentation, and insightful analysis.

# Prediction-Based Word Vectors

more recently prediction-based word vectors have demonstrated better performance, such as word2vec and GloVe (which also utilizes the benefit of counts). Here, we shall explore the embeddings produced by GloVe.

Then run the following cells to load the GloVe vectors into memory.

In [1]:
import gensim.downloader as api
import pprint
wv_from_bin = api.load("glove-wiki-gigaword-200")



### Words with Multiple Meanings
Polysemes and homonyms are words that have more than one meaning (see this [wiki page](https://en.wikipedia.org/wiki/Polysemy) to learn more about the difference between polysemes and homonyms ). Find a word with *at least two different meanings* such that the top-10 most similar words (according to cosine similarity) contain related words from *both* meanings. For example, "leaves" has both "go_away" and "a_structure_of_a_plant" meaning in the top 10, and "scoop" has both "handed_waffle_cone" and "lowdown". You will probably need to try several polysemous or homonymic words before you find one.

Please state the word you discover and the multiple meanings that occur in the top 10. Why do you think many of the polysemous or homonymic words you tried didn't work (i.e. the top-10 most similar words only contain **one** of the meanings of the words)?

**Note**: You should use the `wv_from_bin.most_similar(word)` function to get the top 10 similar words. This function ranks all other words in the vocabulary with respect to their cosine similarity to the given word. For further assistance, please check the __[GenSim documentation](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.FastTextKeyedVectors.most_similar)__.

In [4]:
### CODE HERE
# Function to get the top 10 similar words using cosine similarity
def get_top_similar_words(word):
    similar_words = wv_from_bin.most_similar(word, topn=10)
    return [word[0] for word in similar_words]

In [11]:
# Find a word with multiple meanings in the top 10 similar words
word = "bat"
print(get_top_similar_words(word))

No word with multiple meanings found in the top 10 similar words.
['bats', 'batting', 'balls', 'batted', 'toss', 'wicket', 'pitch', 'bowled', 'hitter', 'batsman']


In [12]:
word = "bank"
print(get_top_similar_words(word))

['banks', 'banking', 'central', 'financial', 'credit', 'lending', 'monetary', 'bankers', 'loans', 'investment']


In [16]:
word = "mouse"
print(get_top_similar_words(word))

['mice', 'keyboard', 'rat', 'rabbit', 'cat', 'cursor', 'trackball', 'joystick', 'mickey', 'clicks']


In [17]:
word = "run"
print(get_top_similar_words(word))

['running', 'runs', 'ran', 'went', 'start', 'allowed', 'out', 'go', 'going', 'first']


In [15]:
word = "spring"
print(get_top_similar_words(word))

['summer', 'autumn', 'winter', 'fall', 'beginning', 'starting', 'year', 'start', 'next', 'during']


### SOLUTION

I tested these words:


*   **bat:** A wooden stick used in sports, A flying mammal
*   **bank:** Financial institution, River bank
*   **mouse:** Computer peripheral, Small mammal
*   **run:** Physical activity, Operate or manage
*   **spring:** A season, River, Jump









### Synonyms & Antonyms

When considering Cosine Similarity, it's often more convenient to think of Cosine Distance, which is simply 1 - Cosine Similarity.

Find three words $(w_1,w_2,w_3)$ where $w_1$ and $w_2$ are synonyms and $w_1$ and $w_3$ are antonyms, but Cosine Distance $(w_1,w_3) <$ Cosine Distance $(w_1,w_2)$.

As an example, $w_1$="happy" is closer to $w_3$="sad" than to $w_2$="cheerful". Please find a different example that satisfies the above. Once you have found your example, please give a possible explanation for why this counter-intuitive result may have happened.

You should use the the `wv_from_bin.distance(w1, w2)` function here in order to compute the cosine distance between two words. Please see the __[GenSim documentation](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.FastTextKeyedVectors.distance)__ for further assistance.

In [19]:
w1 = "large"
w2 = "big"
w3 = "small"
w1_w2_dist = wv_from_bin.distance(w1, w2)
w1_w3_dist = wv_from_bin.distance(w1, w3)

print("Synonyms {}, {} have cosine distance: {}".format(w1, w2, w1_w2_dist))
print("Antonyms {}, {} have cosine distance: {}".format(w1, w3, w1_w3_dist))

Synonyms large, big have cosine distance: 0.3340132236480713
Antonyms large, small have cosine distance: 0.1497916579246521


### SOLUTION

This counter-intuitive result can occur due to the distributional properties of word vectors. Word vectors are trained on large corpora and capture the semantic relationships between words based on their co-occurrence patterns. The training data might contain instances where "large" and "small" appear together in certain contexts, leading to some degree of similarity between the two words in the vector space.

On the other hand, "large" and "big" are more likely to co-occur in similar contexts, reinforcing their synonymy and resulting in a lower cosine distance. The antonym relationship between "large" and "small" might not be as strongly reflected in the training data, leading to a higher cosine distance between them.

It's important to note that word vectors capture general semantic relationships but might not always align perfectly with specific word pairs or their antonymy in every case.

### Analogies with Word Vectors
Word vectors have been shown to *sometimes* exhibit the ability to solve analogies.

As an example, for the analogy "man : grandfather :: woman : x" (read: man is to grandfather as woman is to x), what is x?

In the cell below, we show you how to use word vectors to find x using the `most_similar` function from the __[GenSim documentation](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.most_similar)__. The function finds words that are most similar to the words in the `positive` list and most dissimilar from the words in the `negative` list. The answer to the analogy will have the highest cosine similarity (largest returned numerical value).

In [20]:
# Run this cell to answer the analogy -- man : grandfather :: woman : x
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'grandfather'], negative=['man']))

[('grandmother', 0.7608445286750793),
 ('granddaughter', 0.7200808525085449),
 ('daughter', 0.7168302536010742),
 ('mother', 0.7151536345481873),
 ('niece', 0.7005682587623596),
 ('father', 0.6659887433052063),
 ('aunt', 0.6623408794403076),
 ('grandson', 0.6618767976760864),
 ('grandparents', 0.644661009311676),
 ('wife', 0.6445354223251343)]


Let $m$, $g$, $w$, and $x$ denote the word vectors for `man`, `grandfather`, `woman`, and the answer, respectively. Using **only** vectors $m$, $g$, $w$, and the vector arithmetic operators $+$ and $-$ in your answer, to what expression are we maximizing $x$'s cosine similarity?

Hint: Recall that word vectors are simply multi-dimensional vectors that represent a word. It might help to draw out a 2D example using arbitrary locations of each vector. Where would `man` and `woman` lie in the coordinate plane relative to `grandfather` and the answer?

### SOLUTION

We want to find the expression that maximizes the cosine similarity between $x$ and the other vectors.

To do this, we can use the following expression:

$x$ $=$ $g$ $-$ $m$ $+$ $w$

This expression represents the vector arithmetic required to find $x$. We subtract the vector for `man` from the vector for `grandfather` and then add the vector for `woman` to obtain the vector for the answer `x`.

By calculating the cosine similarity between $x$ and other word vectors using the `most_similar` function, we can find the word that has the highest cosine similarity and is most similar to the vector $x$.

**Drawing a 2D example:**

Let's assume that `grandfather` lies at coordinates (2, 2) and `man` lies at coordinates (1, 1) in the 2D coordinate plane. If we draw a line connecting `man` and `grandfather`, it represents the relationship between them.

To find the vector for `woman`, we go from `grandfather` in the direction of this line but in the opposite direction. Let's say we go down and left. So, `woman` could lie at coordinates (0.5, 0.5) in this example.

Now, to find the vector for `x`, we start from `grandfather` and go in the same direction as we did for `woman`. If we continue the line, it would lead us to a point beyond `woman`, let's say at coordinates (0, 0) in this example.

So, in this case, the expression that maximizes the cosine similarity between $x$ and other vectors would be:

$x$ $=$ $g$ $-$ $m$ $+$ $w$ $=$ (2, 2) $-$ (1, 1) $+$ (0.5, 0.5) $=$ (1.5, 1.5)

Using the `most_similar` function with $x$ as the input, we can find the word that has the highest cosine similarity and is most similar to the vector (1.5, 1.5).

### Finding Analogies
a. For the previous example, it's clear that "grandmother" completes the analogy. But give an intuitive explanation as to why the `most_similar` function gives us words like "granddaughter", "daughter", or "mother?

### SOLUTION

The `most_similar` function in word vector models like `Word2Vec` or `GloVe` uses cosine similarity to find words that are most similar to a given vector. When we use the expression `man:grandfather::woman:x` and calculate $x$ $=$ $g$ $-$ $m$ $+$ $w$, the resulting vector $x$ represents the relationship between `woman` and the missing word in the analogy.

In this case, the vector $x$ $=$ $g$ $-$ $m$ $+$ $w$ captures the relationship `woman is to x` that is analogous to `man is to grandfather`. The resulting vector $x$ preserves some semantic and directional information based on the relationship between `man` and `grandfather`.

However, the most_similar function is not limited to finding exact matches or strict analogies. It operates by calculating cosine similarity between vectors. The function returns words that have high cosine similarity with the given vector, indicating that they are semantically related or have similar directional relationships.

In the example, `grandmother` is the most intuitive word to complete the analogy. However, words like `granddaughter`, `daughter`, and `mother` are also returned by the `most_similar` function because they share some semantic and directional similarities with the vector x. These words are related to the concept of family and have a similar relationship to `woman` as `grandfather` does to `man`. They might have been frequently mentioned together in similar contexts during the training of the word vectors, leading to higher cosine similarity scores.

It also suggests that the word vectors capture broader semantic associations related to family relationships and gender. These associations can influence the cosine similarity scores and result in those words being considered as similar to the vector `x`.

b. Find an example of analogy that holds according to these vectors (i.e. the intended word is ranked top). In your solution please state the full analogy in the form x:y :: a:b. If you believe the analogy is complicated, explain why the analogy holds in one or two sentences.

**Note**: You may have to try many analogies to find one that works!

In [21]:
x, y, a, b = "king", "queen", "man", "woman"
assert wv_from_bin.most_similar(positive=[a, y], negative=[x])[0][0] == b

In [23]:
# Define the analogy
x, y, a, b = "dog", "puppy", "cat", "kitten"

# Find the word that completes the analogy
result = wv_from_bin.most_similar(positive=[a, y], negative=[x])

# Check if the intended word is ranked at the top
if result[0][0] == b:
    print("The analogy holds: {}:{} :: {}:{}".format(x, y, a, b))
else:
    print("The analogy does not hold.")

# Print the full result for reference
print(result)

The analogy does not hold.
[('puppies', 0.6142244935035706), ('kitten', 0.5919069647789001), ('kittens', 0.5758378505706787), ('scaredy', 0.5365902185440063), ('adorable', 0.5015590786933899), ('skinny', 0.47961774468421936), ('cute', 0.4787493348121643), ('poodle', 0.4751545190811157), ('purr', 0.471113383769989), ('feline', 0.468252956867218)]


In [26]:
# Define the analogy
x, y, a, b = "dog", "puppies", "cat", "kittens"

# Find the word that completes the analogy
result = wv_from_bin.most_similar(positive=[a, y], negative=[x])

# Check if the intended word is ranked at the top
if result[0][0] == b:
    print("The analogy holds: {}:{} :: {}:{}".format(x, y, a, b))
else:
    print("The analogy does not hold.")

# Print the full result for reference
print(result)

The analogy holds: dog:puppies :: cat:kittens
[('kittens', 0.7058785557746887), ('kitten', 0.5104931592941284), ('puppy', 0.5103816390037537), ('hush', 0.5096200108528137), ('adorable', 0.47415581345558167), ('purr', 0.46291735768318176), ('cats', 0.4579334855079651), ('newborn', 0.4374859631061554), ('catnip', 0.4326137900352478), ('ferrets', 0.43254637718200684)]


In [22]:
# Define the analogy
x, y, a, b = "king", "queen", "man", "woman"

# Find the word that completes the analogy
result = wv_from_bin.most_similar(positive=[a, y], negative=[x])

# Check if the intended word is ranked at the top
if result[0][0] == b:
    print("The analogy holds: {}:{} :: {}:{}".format(x, y, a, b))
else:
    print("The analogy does not hold.")

# Print the full result for reference
print(result)

The analogy holds: king:queen :: man:woman
[('woman', 0.7250837087631226), ('girl', 0.5886719226837158), ('she', 0.5709657669067383), ('her', 0.5615235567092896), ('mother', 0.553316056728363), ('person', 0.5315726399421692), ('boy', 0.5261016488075256), ('teenager', 0.5241419076919556), ('beautiful', 0.5178192257881165), ('men', 0.515008270740509)]


### SOLUTION

In this analogy, `king` is to `queen` as `man` is to `woman`. The word vector model should rank `woman` as the top result when we calculate `x:y :: a:b` using the most_similar function, where $x$ represents `man`, `y` represents `queen`, `a` represents `king`, and `b` represents the `unknown word`.

### Incorrect Analogy
a. Below, we expect to see the intended analogy "hand : glove :: foot : **sock**", but we see an unexpected result instead. Give a potential reason as to why this particular analogy turned out the way it did?

In [27]:
pprint.pprint(wv_from_bin.most_similar(positive=['foot', 'glove'], negative=['hand']))

[('45,000-square', 0.4922032654285431),
 ('15,000-square', 0.4649604558944702),
 ('10,000-square', 0.4544755816459656),
 ('6,000-square', 0.44975775480270386),
 ('3,500-square', 0.444133460521698),
 ('700-square', 0.44257497787475586),
 ('50,000-square', 0.4356396794319153),
 ('3,000-square', 0.43486514687538147),
 ('30,000-square', 0.4330596923828125),
 ('footed', 0.43236875534057617)]


### SOLUTION

In this case, the analogy considered `foot` as `a metric for calculating distance`. `Foot` also has another meaning as `a part of the body`.
The first meaning is used in this analogy!

b. Find another example of analogy that does *not* hold according to these vectors. In your solution, state the intended analogy in the form x:y :: a:b, and state the **incorrect** value of b according to the word vectors (in the previous example, this would be **'45,000-square'**).

In [28]:
x, y, a, b = "car", "engine", "bicycle", "pedals"
pprint.pprint(wv_from_bin.most_similar(positive=[a, y], negative=[x]))

[('engines', 0.5682656764984131),
 ('bike', 0.5660476684570312),
 ('two-stroke', 0.5309532880783081),
 ('turbojet', 0.4797455370426178),
 ('inline', 0.47928866744041443),
 ('bikes', 0.47836801409721375),
 ('four-stroke', 0.4714185297489166),
 ('powerplant', 0.4571566879749298),
 ('powered', 0.44949251413345337),
 ('4-stroke', 0.44715797901153564)]


In [31]:
x, y, a, b = "gallon", "water", "bank", "river"
pprint.pprint(wv_from_bin.most_similar(positive=[a, y], negative=[x]))

[('banks', 0.6095024347305298),
 ('central', 0.5487160682678223),
 ('the', 0.5328366160392761),
 ('also', 0.5115862488746643),
 ('which', 0.5080747008323669),
 ('banking', 0.5048969388008118),
 ('financial', 0.5025336742401123),
 ('taken', 0.49411171674728394),
 ('west', 0.49251824617385864),
 ('part', 0.4886169731616974)]


### SOLUTION

In [32]:
x, y, a, b = "car", "engine", "bicycle", "pedals"
result = wv_from_bin.most_similar(positive=[a, y], negative=[x])

incorrect_b = result[0][0]

print("The analogy does not hold: {}:{} :: {}:{}".format(x, y, a, incorrect_b))
print(result)

The analogy does not hold: car:engine :: bicycle:engines
[('engines', 0.5682656764984131), ('bike', 0.5660476684570312), ('two-stroke', 0.5309532880783081), ('turbojet', 0.4797455370426178), ('inline', 0.47928866744041443), ('bikes', 0.47836801409721375), ('four-stroke', 0.4714185297489166), ('powerplant', 0.4571566879749298), ('powered', 0.44949251413345337), ('4-stroke', 0.44715797901153564)]


In [33]:
x, y, a, b = "gallon", "water", "bank", "river"
result = wv_from_bin.most_similar(positive=[a, y], negative=[x])

incorrect_b = result[0][0]

print("The analogy does not hold: {}:{} :: {}:{}".format(x, y, a, incorrect_b))
print(result)

The analogy does not hold: gallon:water :: bank:banks
[('banks', 0.6095024347305298), ('central', 0.5487160682678223), ('the', 0.5328366160392761), ('also', 0.5115862488746643), ('which', 0.5080747008323669), ('banking', 0.5048969388008118), ('financial', 0.5025336742401123), ('taken', 0.49411171674728394), ('west', 0.49251824617385864), ('part', 0.4886169731616974)]


### Guided Analysis of Bias in Word Vectors

It's important to be cognizant of the biases (gender, race, sexual orientation etc.) implicit in our word embeddings. Bias can be dangerous because it can reinforce stereotypes through applications that employ these models.

Run the cell below, to examine (a) which terms are most similar to "woman" and "profession" and most dissimilar to "man", and (b) which terms are most similar to "man" and "profession" and most dissimilar to "woman". Point out the difference between the list of female-associated words and the list of male-associated words, and explain how it is reflecting gender bias.

In [34]:
# Run this cell
# Here `positive` indicates the list of words to be similar to and `negative` indicates the list of words to be most dissimilar from.

pprint.pprint(wv_from_bin.most_similar(positive=['man', 'profession'], negative=['woman']))
print()
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'profession'], negative=['man']))

[('reputation', 0.5250176787376404),
 ('professions', 0.5178037881851196),
 ('skill', 0.49046966433525085),
 ('skills', 0.49005505442619324),
 ('ethic', 0.4897659420967102),
 ('business', 0.4875852167606354),
 ('respected', 0.485920250415802),
 ('practice', 0.482104629278183),
 ('regarded', 0.4778572618961334),
 ('life', 0.4760662019252777)]

[('professions', 0.5957457423210144),
 ('practitioner', 0.49884122610092163),
 ('teaching', 0.48292139172554016),
 ('nursing', 0.48211804032325745),
 ('vocation', 0.4788965880870819),
 ('teacher', 0.47160351276397705),
 ('practicing', 0.46937814354896545),
 ('educator', 0.46524327993392944),
 ('physicians', 0.4628995358943939),
 ('professionals', 0.4601394236087799)]


### SOLUTION

The results of the word vector analysis indicate a reflection of gender bias in the embedding model.

Female-associated words:
- reputation
- professions
- skill
- skills
- ethic
- business
- respected
- practice
- regarded
- life

Male-associated words:
- professions
- practitioner
- teaching
- nursing
- vocation
- teacher
- practicing
- educator
- physicians
- professionals

**Observations:**

1. **Female-associated words:** The list consists of words related to reputation, skills, ethics, and business. These words are more general and do not explicitly specify any particular profession. They focus more on qualities and attributes associated with professionalism. This reflects a bias by not providing specific profession-related terms for women.

2. **Male-associated words:** The list includes words directly related to specific professions such as practitioners, teaching, nursing, teachers, physicians, and professionals. These terms are more specific and directly associated with professional job roles. This suggests a bias by providing explicit profession-related terms for men.

The gender bias in the word vectors is evident in the difference between the lists. The female-associated words are more generic and lack explicit profession-related terms, while the male-associated words include specific professions. This bias can have real-world implications when these word vectors are used in applications such as resume screening or automated decision-making processes, potentially reinforcing stereotypes and discrimination against women in certain professions.

Also these words may have been stereotypically associated with men and women in the training corpus and they don't show any meaning or truth about men and women and the jobs they have.

### Independent Analysis of Bias in Word Vectors

Use the `most_similar` function to find another pair of analogies that demonstrates some bias is exhibited by the vectors. Please briefly explain the example of bias that you discover.

In [37]:
A = "run"
B = "april"
word = "spring"
pprint.pprint(wv_from_bin.most_similar(positive=[A, word], negative=[B]))
print()
pprint.pprint(wv_from_bin.most_similar(positive=[B, word], negative=[A]))


[('runs', 0.5651021003723145),
 ('running', 0.5456951260566711),
 ('walk', 0.4959622025489807),
 ('start', 0.49415791034698486),
 ('going', 0.49090850353240967),
 ('go', 0.4824409782886505),
 ('throw', 0.4803808331489563),
 ('come', 0.46764037013053894),
 ('break', 0.46636030077934265),
 ('summer', 0.4599555730819702)]

[('june', 0.740088939666748),
 ('july', 0.7383700013160706),
 ('february', 0.7344872355461121),
 ('october', 0.7323588132858276),
 ('september', 0.7309871912002563),
 ('august', 0.7274510264396667),
 ('january', 0.7268126606941223),
 ('december', 0.7199898362159729),
 ('november', 0.714846670627594),
 ('march', 0.6990302801132202)]


### SOLUTION

- When we have `run` and `spring` together, we get some verbs related to `running` which is the meaning of both.

- When we consider `april` and `spring` together, we get the names of the months. In this case, the other meaning of `spring` (which is "one of the seasons") is considered.

### Thinking About Bias

a. Give one explanation of how bias gets into the word vectors. Briefly describe a real-world example that demonstrates this source of bias.

### SOLUTION

One explanation of how bias gets into word vectors is through the biases present in the training data used to train the models. Word vectors are typically trained on large corpora of text, which can reflect the biases present in society and language usage.

For example, let's consider a scenario where the training data used for word vector models contains biased portrayals of gender roles in certain professions. If the data predominantly associates men with certain professions, it can lead to gender bias in the resulting word vectors. As a result, the vectors may associate male terms more strongly with those professions, reinforcing societal stereotypes.

For instance, if the training data predominantly contains sentences like "The doctor was a man" or "The engineer was a man," the model may learn to associate the words "doctor" and "engineer" more closely with male terms. Consequently, when performing similarity queries using word vectors, the model may rank male-associated terms higher for these professions, potentially perpetuating gender bias in the results.

This bias can have real-world implications. For example, if these word vectors are used in automated resume screening systems, the biased associations can lead to unfair outcomes where qualified female candidates may be overlooked for certain professions based on the gender associations learned by the model.

It is important to carefully curate and diversify training data, apply debiasing techniques, and conduct regular audits to mitigate and address bias in word vectors and ensure fair and unbiased representation of language and concepts.

b. What is one method you can use to mitigate bias exhibited by word vectors?  Briefly describe a real-world example that demonstrates this method.


### SOLUTION

One method to mitigate bias exhibited by word vectors is through the use of debiasing techniques. These techniques aim to neutralize or reduce the biased associations present in word vectors, promoting fairness and mitigating the reinforcement of stereotypes.

One such debiasing method is called "Hard Debiasing," which involves modifying the word vectors to remove or reduce gender or other biased associations. This can be done by identifying gender-specific terms and adjusting their vectors to make them more neutral in relation to specific professions or attributes.

For example, researchers at Stanford University developed the "GloVe Gender Debiasing" technique to address gender bias in word vectors. They introduced a set of gender-neutral words and transformed the gendered word vectors towards these neutral vectors. By applying this technique, they were able to reduce gender bias in the word vectors and mitigate the association of professions with specific genders.

In their research, they demonstrated this method by debiasing word vectors trained on large corpora of text. After debiasing, the association between gender and certain professions became more balanced, reducing the overrepresentation of gender stereotypes. This approach helps ensure fairer and more equitable representations in applications that utilize these debiased word vectors.

Debiasing techniques, like "Hard Debiasing," are important tools in mitigating biases in word vectors. By consciously addressing and reducing biased associations, we can promote more inclusive and unbiased applications that rely on these models, fostering equal opportunities and reducing the reinforcement of stereotypes.