1.Submit a Google Colab notebook containing your completed code and experimentation results.

2.Include comments and explanations in your code to help understand the implemented logic.

**Additional Notes:**
*   Ensure that the notebook runs successfully in Google Colab.
*   Document any issues encountered during experimentation and how you addressed them.

**Grading:**
*   Each task will be graded out of the specified points.
*   Points will be awarded for correctness, clarity of code, thorough experimentation, and insightful analysis.

# Prediction-Based Word Vectors

more recently prediction-based word vectors have demonstrated better performance, such as word2vec and GloVe (which also utilizes the benefit of counts). Here, we shall explore the embeddings produced by GloVe.

Then run the following cells to load the GloVe vectors into memory.

In [1]:
import gensim.downloader as api
import pprint
wv_from_bin = api.load("glove-wiki-gigaword-200")



### Words with Multiple Meanings
Polysemes and homonyms are words that have more than one meaning (see this [wiki page](https://en.wikipedia.org/wiki/Polysemy) to learn more about the difference between polysemes and homonyms ). Find a word with *at least two different meanings* such that the top-10 most similar words (according to cosine similarity) contain related words from *both* meanings. For example, "leaves" has both "go_away" and "a_structure_of_a_plant" meaning in the top 10, and "scoop" has both "handed_waffle_cone" and "lowdown". You will probably need to try several polysemous or homonymic words before you find one.

Please state the word you discover and the multiple meanings that occur in the top 10. Why do you think many of the polysemous or homonymic words you tried didn't work (i.e. the top-10 most similar words only contain **one** of the meanings of the words)?

**Note**: You should use the `wv_from_bin.most_similar(word)` function to get the top 10 similar words. This function ranks all other words in the vocabulary with respect to their cosine similarity to the given word. For further assistance, please check the __[GenSim documentation](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.FastTextKeyedVectors.most_similar)__.

In [2]:
wv_from_bin.most_similar('mouse')[:10]

[('mice', 0.6580958962440491),
 ('keyboard', 0.5548278093338013),
 ('rat', 0.5433949828147888),
 ('rabbit', 0.5192376971244812),
 ('cat', 0.5077415704727173),
 ('cursor', 0.5058691501617432),
 ('trackball', 0.5048902630805969),
 ('joystick', 0.49841049313545227),
 ('mickey', 0.47242844104766846),
 ('clicks', 0.4722806215286255)]

### SOLUTION

I have tried different words: bat, bank, saw, rock, ... but any of them worked except for mouse.

The alogithm chosen here is GloVe algorithm, which is a context-agnostic model that assigns a unique vector to each word, which remains static regardless of the word's varying context in different sentences. This means that it does not dynamically adapt to different usages of the same word, leading to a representation that cannot differentiate between the multiple meanings of polysemous words like "bank".

In the case of "bank," GloVe creates a singular, averaged representation that amalgamates all the contexts in which the word appears, resulting in a generalized semantic representation. This generalized representation may not capture the specific meaning of "bank" in a given context, such as "river bank" or "financial bank." Instead, it may only capture the broader meaning of the word, which in this case is related to the economical meaning. This explanation holds true for other words I listed above, as well.

But for the case of "mouse", it seems that there have been a good balance of both meanings in the dataset the GloVe was trained on and so we can see similar words to both of these meanings in the outputs. For example: mouse -> rat (for animal) and mouse -> keyboard (an electrical equipment)

### Synonyms & Antonyms

When considering Cosine Similarity, it's often more convenient to think of Cosine Distance, which is simply 1 - Cosine Similarity.

Find three words $(w_1,w_2,w_3)$ where $w_1$ and $w_2$ are synonyms and $w_1$ and $w_3$ are antonyms, but Cosine Distance $(w_1,w_3) <$ Cosine Distance $(w_1,w_2)$.

As an example, $w_1$="happy" is closer to $w_3$="sad" than to $w_2$="cheerful". Please find a different example that satisfies the above. Once you have found your example, please give a possible explanation for why this counter-intuitive result may have happened.

You should use the the `wv_from_bin.distance(w1, w2)` function here in order to compute the cosine distance between two words. Please see the __[GenSim documentation](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.FastTextKeyedVectors.distance)__ for further assistance.

In [3]:
w1 ='glad'
w2 ='joyous'
w3 ='sad'
w1_w2_dist = wv_from_bin.distance(w1, w2)
w1_w3_dist = wv_from_bin.distance(w1, w3)

print("Synonyms {}, {} have cosine distance: {}".format(w1, w2, w1_w2_dist))
print("Antonyms {}, {} have cosine distance: {}".format(w1, w3, w1_w3_dist))

Synonyms glad, joyous have cosine distance: 0.7180908620357513
Antonyms glad, sad have cosine distance: 0.49190449714660645


### SOLUTION
A possible explanation for this counter-intuitive result, where an antonym is closer to a word than a synonym in the embedding space, could be due to the way word embeddings are trained. GloVe embeddings are trained on co-occurrence statistics from a corpus. Words that often appear in similar contexts will be closer in the embedding space. Antonyms often appear in similar contexts because they are used in comparable situations, even though they have opposite meanings. For example, "glad" and "sad" might both be used frequently in contexts discussing emotion of people which could make their embeddings closer than expected.

Another contributing factor could be polysemy, where a word has multiple meanings. If "glad" is used in different contexts that align more closely with "sad" than with some of the contexts of "joyous", this could also bring "glad" and "sad" closer together in the embedding space.

Lastly, the training data itself and the frequency of certain words can influence the resulting embeddings. If the corpus used to train the embeddings has more instances where "glad" and "sad" are used together compared to "glad" and "joyous", this could also result in a smaller cosine distance between "glad" and "sad".

### Analogies with Word Vectors
Word vectors have been shown to *sometimes* exhibit the ability to solve analogies.

As an example, for the analogy "man : grandfather :: woman : x" (read: man is to grandfather as woman is to x), what is x?

In the cell below, we show you how to use word vectors to find x using the `most_similar` function from the __[GenSim documentation](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.most_similar)__. The function finds words that are most similar to the words in the `positive` list and most dissimilar from the words in the `negative` list. The answer to the analogy will have the highest cosine similarity (largest returned numerical value).

In [4]:
# Run this cell to answer the analogy -- man : grandfather :: woman : x
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'grandfather'], negative=['man']))

[('grandmother', 0.7608445286750793),
 ('granddaughter', 0.7200808525085449),
 ('daughter', 0.7168302536010742),
 ('mother', 0.7151536345481873),
 ('niece', 0.7005682587623596),
 ('father', 0.6659887433052063),
 ('aunt', 0.6623408794403076),
 ('grandson', 0.6618767976760864),
 ('grandparents', 0.644661009311676),
 ('wife', 0.6445354223251343)]


Let $m$, $g$, $w$, and $x$ denote the word vectors for `man`, `grandfather`, `woman`, and the answer, respectively. Using **only** vectors $m$, $g$, $w$, and the vector arithmetic operators $+$ and $-$ in your answer, to what expression are we maximizing $x$'s cosine similarity?

Hint: Recall that word vectors are simply multi-dimensional vectors that represent a word. It might help to draw out a 2D example using arbitrary locations of each vector. Where would `man` and `woman` lie in the coordinate plane relative to `grandfather` and the answer?

### SOLUTION

The mathematical expression is as follows:

> x = w + g - m



### Finding Analogies
a. For the previous example, it's clear that "grandmother" completes the analogy. But give an intuitive explanation as to why the `most_similar` function gives us words like "granddaughter", "daughter", or "mother?

### SOLUTION

The most similar words to grandmother are the words "granddaughter", "daughter" and "mother" as they are all part of close members in a family and so more likely to appear in most texts together.

b. Find an example of analogy that holds according to these vectors (i.e. the intended word is ranked top). In your solution please state the full analogy in the form x:y :: a:b. If you believe the analogy is complicated, explain why the analogy holds in one or two sentences.

**Note**: You may have to try many analogies to find one that works!

In [5]:
x, y, a, b = 'son', 'daughter', 'prince', 'princess'
assert wv_from_bin.most_similar(positive=[a, y], negative=[x])[0][0] == b

### SOLUTION

In the sample generated we have the analogy son:daughter :: prince:princess. As prince is what we call the son of a king, princess is the daughter of him.

### Incorrect Analogy
a. Below, we expect to see the intended analogy "hand : glove :: foot : **sock**", but we see an unexpected result instead. Give a potential reason as to why this particular analogy turned out the way it did?

In [6]:
pprint.pprint(wv_from_bin.most_similar(positive=['foot', 'glove'], negative=['hand']))

[('45,000-square', 0.4922032654285431),
 ('15,000-square', 0.4649604558944702),
 ('10,000-square', 0.4544755816459656),
 ('6,000-square', 0.44975775480270386),
 ('3,500-square', 0.444133460521698),
 ('700-square', 0.44257497787475586),
 ('50,000-square', 0.4356396794319153),
 ('3,000-square', 0.43486514687538147),
 ('30,000-square', 0.4330596923828125),
 ('footed', 0.43236875534057617)]


### SOLUTION

As we stated above, a polysomy word such that we have here "foot" may have similar words only to one of its meanings. And here it seems like the model was trained on the corpus with more data about measurements, in compared to contexts on body parts.

Plus the fact that the context on which model was trained on, didn't have enough data about desired outcome, or maybe the dataset was just not well balanced and vocabulary was diverse.

It is also possible that there were simply not enough examples of the relationship between "foot" and "sock" in the training data for the word vector model to recognize the analogy.

b. Find another example of analogy that does *not* hold according to these vectors. In your solution, state the intended analogy in the form x:y :: a:b, and state the **incorrect** value of b according to the word vectors (in the previous example, this would be **'45,000-square'**).

In [7]:
x, y, a, b = 'ambassador', 'embassy', 'priest', 'church'
pprint.pprint(wv_from_bin.most_similar(positive=[a, y], negative=[x]))

[('priests', 0.6184417605400085),
 ('church', 0.5651061534881592),
 ('catholic', 0.5578794479370117),
 ('clergy', 0.5117530822753906),
 ('archdiocese', 0.5048693418502808),
 ('parishioners', 0.5043907761573792),
 ('nuns', 0.4974152743816376),
 ('outside', 0.46567755937576294),
 ('priesthood', 0.46334171295166016),
 ('ordained', 0.46231380105018616)]


### SOLUTION

Intended Analogy: abassador:embassy :: priest:church
In fact each person is mapped to where he stays.

False Analogy: priesets

### Guided Analysis of Bias in Word Vectors

It's important to be cognizant of the biases (gender, race, sexual orientation etc.) implicit in our word embeddings. Bias can be dangerous because it can reinforce stereotypes through applications that employ these models.

Run the cell below, to examine (a) which terms are most similar to "woman" and "profession" and most dissimilar to "man", and (b) which terms are most similar to "man" and "profession" and most dissimilar to "woman". Point out the difference between the list of female-associated words and the list of male-associated words, and explain how it is reflecting gender bias.

In [8]:
# Run this cell
# Here `positive` indicates the list of words to be similar to and `negative` indicates the list of words to be most dissimilar from.

pprint.pprint(wv_from_bin.most_similar(positive=['man', 'profession'], negative=['woman']))
print()
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'profession'], negative=['man']))

[('reputation', 0.5250176787376404),
 ('professions', 0.5178037881851196),
 ('skill', 0.49046966433525085),
 ('skills', 0.49005505442619324),
 ('ethic', 0.4897659420967102),
 ('business', 0.4875852167606354),
 ('respected', 0.485920250415802),
 ('practice', 0.482104629278183),
 ('regarded', 0.4778572618961334),
 ('life', 0.4760662019252777)]

[('professions', 0.5957457423210144),
 ('practitioner', 0.49884122610092163),
 ('teaching', 0.48292139172554016),
 ('nursing', 0.48211804032325745),
 ('vocation', 0.4788965880870819),
 ('teacher', 0.47160351276397705),
 ('practicing', 0.46937814354896545),
 ('educator', 0.46524327993392944),
 ('physicians', 0.4628995358943939),
 ('professionals', 0.4601394236087799)]


### SOLUTION

The first list, shows the words most similar to man and profession, which are some quality related words, while the 2nd list, is more consisted we can see name of jobs themselves. These jobs are related to education or caring, and so are connected with social sectors, with possibly lower rate of income.

We can conclude that the word embedding model has leanrned a gender stereotype that men are more likely to possess these qualities than women and that women have some specific jobs that are defined for them. So if we want to use this type of embedding for an allocation task, then the output won't be good at all and will result in neglecting women for higer-paid and professional jobs.

### Independent Analysis of Bias in Word Vectors

Use the `most_similar` function to find another pair of analogies that demonstrates some bias is exhibited by the vectors. Please briefly explain the example of bias that you discover.

In [9]:

A = 'american'
B = 'african'
word = 'personality'
pprint.pprint(wv_from_bin.most_similar(positive=[A, word], negative=[B]))
print()
pprint.pprint(wv_from_bin.most_similar(positive=[B, word], negative=[A]))


[('character', 0.5096344351768494),
 ('personalities', 0.4690902531147003),
 ('mind', 0.4690796434879303),
 ('show', 0.4587042033672333),
 ('sense', 0.45704707503318787),
 ('reality', 0.4549425542354584),
 ('attitude', 0.44896262884140015),
 ('persona', 0.4450281262397766),
 ('comedian', 0.43983012437820435),
 ('kind', 0.435565710067749)]

[('personalities', 0.5307511687278748),
 ('traits', 0.49145638942718506),
 ('temperament', 0.46246960759162903),
 ('mbeki', 0.46054622530937195),
 ('schizotypal', 0.4567224681377411),
 ('trait', 0.4478680193424225),
 ('africa', 0.4334438443183899),
 ('antisocial', 0.43069028854370117),
 ('mandela', 0.4262062609195709),
 ('dynamic', 0.40384382009506226)]


### SOLUTION

Here we compared "american" and "african" people, in the field of personality. As you can observe, there are some good personalities like "kind" or "comedian" which are considered to be connected with American people, while some personalies like "schizotypal" or "antisocial" are connected with African people. These two personalities are not bad inherently, but when we give these two personalities to a group of people, the opinion we want to give based on the ouptut of model would become biased and would be considered that they may have disorder of such personalities.

### Thinking About Bias

a. Give one explanation of how bias gets into the word vectors. Briefly describe a real-world example that demonstrates this source of bias.

### SOLUTION

The most likely cause for such bias, is the dataset that we used for training. Data in training corpus may have had such bias in itself and thus the model leanred these implicit biases and stereotypes.

And it also turns out that embeddings don't just reflect the statistics of their input, but also amplify bias. This fact was concluded serveral papers like [(Zhao et al. 2017)](https://doi.org/10.18653/v1/D17-1323) which is one example of the fact that gendered terms become more gendered in embedding space than they were in the input text statistics.

Another real world example is from the paper [(Bolukbasi et al. 2016)](https://arxiv.org/abs/1607.06520) that finds that the closest word occupation to 'computer programmer' - 'man' + 'woman' in word2vec embeddings tained on news text is 'homemaker', and that the embeddings similarly suggest the analogy 'father':'docter' :: 'mother':'nurse'. This kind of analogy leaned by embedding model, may cause allocational harm, which is referred to when a system allocates resources (jobs or credit) unfairly to different groups.

These explanations were taken from reference book (Jurafsky).

b. What is one method you can use to mitigate bias exhibited by word vectors?  Briefly describe a real-world example that demonstrates this method.


### SOLUTION

A real-world example that demonstrates this method is the work by [Bolukbasi et al. (2016b)](https://arxiv.org/abs/1607.06520), who proposed a post-processing debiasing method, known as "Hard Debiasing" that makes changes to the word vectors they achieved from the news corpus described above, in order to reduce gender bias for all words that are not inherently gendered.

They achieved this by first identifying gender-specific words and concepts by looking at the cosine similarity between pairs of words and analyzing which pairs showed the largest differences between male and female pronouns. Then
They zeroed the gender projection of each word on a predefined gender direction and ensured that all neutral words are equally close to each of the two gendered words in an equality set.

So by doing this, they developed a transformation of the embedding space that removes gender stereotypes but preserves definitional gender. However, although these sorts of debiasing may reduce bias in embeddings, they do not eliminate it [(Gonen and Goldberg, 2019)](https://aclanthology.org/W19-3621/), and this remains an open problem.

This part answered using ChatGPT and reference book (Jurafsky).