1.Submit a Google Colab notebook containing your completed code and experimentation results.

2.Include comments and explanations in your code to help understand the implemented logic.

**Additional Notes:**
*   Ensure that the notebook runs successfully in Google Colab.
*   Document any issues encountered during experimentation and how you addressed them.

**Grading:**
*   Each task will be graded out of the specified points.
*   Points will be awarded for correctness, clarity of code, thorough experimentation, and insightful analysis.

# Prediction-Based Word Vectors

more recently prediction-based word vectors have demonstrated better performance, such as word2vec and GloVe (which also utilizes the benefit of counts). Here, we shall explore the embeddings produced by GloVe.

Then run the following cells to load the GloVe vectors into memory.

In [1]:
import gensim.downloader as api
import pprint
wv_from_bin = api.load("glove-wiki-gigaword-200")



### Words with Multiple Meanings
Polysemes and homonyms are words that have more than one meaning (see this [wiki page](https://en.wikipedia.org/wiki/Polysemy) to learn more about the difference between polysemes and homonyms ). Find a word with *at least two different meanings* such that the top-10 most similar words (according to cosine similarity) contain related words from *both* meanings. For example, "leaves" has both "go_away" and "a_structure_of_a_plant" meaning in the top 10, and "scoop" has both "handed_waffle_cone" and "lowdown". You will probably need to try several polysemous or homonymic words before you find one.

Please state the word you discover and the multiple meanings that occur in the top 10. Why do you think many of the polysemous or homonymic words you tried didn't work (i.e. the top-10 most similar words only contain **one** of the meanings of the words)?

**Note**: You should use the `wv_from_bin.most_similar(word)` function to get the top 10 similar words. This function ranks all other words in the vocabulary with respect to their cosine similarity to the given word. For further assistance, please check the __[GenSim documentation](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.FastTextKeyedVectors.most_similar)__.

In [2]:
### CODE HERE
word = 'left'
similar_words = wv_from_bin.most_similar(word)
similar_words
# len(similar_words) = 10

[('leaving', 0.8048902153968811),
 ('right', 0.7165082097053528),
 ('back', 0.7087967991828918),
 ('went', 0.6952051520347595),
 ('out', 0.6867292523384094),
 ('leave', 0.6815778017044067),
 ('when', 0.677730917930603),
 ('returned', 0.6693609356880188),
 ('came', 0.6677366495132446),
 ('but', 0.6625413298606873)]

### SOLUTION
Please state the word you discover and the multiple meanings that occur in the top 10. Why do you think many of the polysemous or homonymic words you tried didn't work (i.e. the top-10 most similar words only contain **one** of the meanings of the words)?<br>

**Answer :**<br>
it can have many reasons why many of polysemous of homynomic didn't work. one of them can be co-occurence with one of its meaning is much more than the other one like the word "bank".<br>
if we test this word we can see words about finances.<br>
this can cause dominant meaning for one word and most_similar reflect one meaning of a word.<br>
**Example :**<br> 
for "left" , we have two meanings : 1) direction 2) past tense of Leave <br>
in the output we can observe similar words for both meanings.

### Synonyms & Antonyms

When considering Cosine Similarity, it's often more convenient to think of Cosine Distance, which is simply 1 - Cosine Similarity.

Find three words $(w_1,w_2,w_3)$ where $w_1$ and $w_2$ are synonyms and $w_1$ and $w_3$ are antonyms, but Cosine Distance $(w_1,w_3) <$ Cosine Distance $(w_1,w_2)$.

As an example, $w_1$="happy" is closer to $w_3$="sad" than to $w_2$="cheerful". Please find a different example that satisfies the above. Once you have found your example, please give a possible explanation for why this counter-intuitive result may have happened.

You should use the the `wv_from_bin.distance(w1, w2)` function here in order to compute the cosine distance between two words. Please see the __[GenSim documentation](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.FastTextKeyedVectors.distance)__ for further assistance.

In [3]:
w1 = 'advantage'
w2 = 'benefit'
w3 = 'disadvantage'
w1_w2_dist = wv_from_bin.distance(w1, w2)
w1_w3_dist = wv_from_bin.distance(w1, w3)

print("Synonyms {}, {} have cosine distance: {}".format(w1, w2, w1_w2_dist))
print("Antonyms {}, {} have cosine distance: {}".format(w1, w3, w1_w3_dist))

Synonyms advantage, benefit have cosine distance: 0.48551487922668457
Antonyms advantage, disadvantage have cosine distance: 0.36569714546203613


### SOLUTION

**Question :** Please find a different example that satisfies the above. Once you have found your example, please give a possible explanation for why this counter-intuitive result may have happened.

**Answer :**<br>
first, my example : advantage and disadvantage are antonyms and have cosine distance = 0.36 which is less than cosine distance between advantage and benefit which are almost synonyms.<br>

for explaining the counter-intuitive result which some antonyms have less cosine distance than some synonyms,we may occur many reasons:<br>
**1)** word co-occurence : we know that glove algorithm works with co-occurence matrix => maybe two antonyms can be observed with each other more than synonyms and it can lead us to more similar vectors and less cosine distance.<br>

**2)** there is also another reason that sometimes appear in glove ,but not related with my examples, is some gender biases.<br>

**Result** = for this algorithm,GLoVE, because of more co-occurences of antonyms it can cause this happening.

### Analogies with Word Vectors
Word vectors have been shown to *sometimes* exhibit the ability to solve analogies.

As an example, for the analogy "man : grandfather :: woman : x" (read: man is to grandfather as woman is to x), what is x?

In the cell below, we show you how to use word vectors to find x using the `most_similar` function from the __[GenSim documentation](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.most_similar)__. The function finds words that are most similar to the words in the `positive` list and most dissimilar from the words in the `negative` list. The answer to the analogy will have the highest cosine similarity (largest returned numerical value).

In [4]:
# Run this cell to answer the analogy -- man : grandfather :: woman : x
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'grandfather'], negative=['man']))

[('grandmother', 0.7608445286750793),
 ('granddaughter', 0.7200808525085449),
 ('daughter', 0.7168302536010742),
 ('mother', 0.7151536345481873),
 ('niece', 0.7005682587623596),
 ('father', 0.6659887433052063),
 ('aunt', 0.6623408794403076),
 ('grandson', 0.6618767976760864),
 ('grandparents', 0.644661009311676),
 ('wife', 0.6445354223251343)]


Let $m$, $g$, $w$, and $x$ denote the word vectors for `man`, `grandfather`, `woman`, and the answer, respectively. Using **only** vectors $m$, $g$, $w$, and the vector arithmetic operators $+$ and $-$ in your answer, to what expression are we maximizing $x$'s cosine similarity?

Hint: Recall that word vectors are simply multi-dimensional vectors that represent a word. It might help to draw out a 2D example using arbitrary locations of each vector. Where would `man` and `woman` lie in the coordinate plane relative to `grandfather` and the answer?

### SOLUTION
**Question :** Using only vectors  m ,  g ,  w , and the vector arithmetic operators  +  and  −  in your answer, to what expression are we maximizing  x 's cosine similarity?<br>

**Answer :**<br>
the arithmetic relationship that we set : **x = g + (w - m)**
this causes the result grandmother which has the most rank in our code output.<br>

cosine similarity : Cosine similarity measures how similar two vectors are in direction, regardless of their magnitude. In our case, we want the answer vector (x) to have a direction as close as possible to a specific direction defined by other word vectors.<br>

for plotting 2d part we can assume **gender** as x-axis and another arbitary like age or ...whatever as y-axis.

### Finding Analogies
a. For the previous example, it's clear that "grandmother" completes the analogy. But give an intuitive explanation as to why the `most_similar` function gives us words like "granddaughter", "daughter", or "mother?

### SOLUTION
**Question :**  For the previous example, it's clear that "grandmother" completes the analogy. But give an intuitive explanation as to why the most_similar function gives us words like "granddaughter", "daughter", or "mother? <br>
**Answer :**<br>
The most_similar function considers overall vector similarity, not just directional shifts. Words like "daughter" or "mother" might be closer to "woman" in the vector space due to various semantic factors, even though they don't perfectly capture the gender and generational shift like "grandmother".
Word vectors can encode complex relationships. "Daughter" might be close to "woman" because they both relate to family, even though it doesn't fulfill the specific analogy requirements.

b. Find an example of analogy that holds according to these vectors (i.e. the intended word is ranked top). In your solution please state the full analogy in the form x:y :: a:b. If you believe the analogy is complicated, explain why the analogy holds in one or two sentences.

**Note**: You may have to try many analogies to find one that works!

In [7]:
x, y, a, b = "man", "woman", "king", "queen"
assert wv_from_bin.most_similar(positive=[a, y], negative=[x])[0][0] == b

### SOLUTION
**Question : ** In your solution please state the full analogy in the form x:y :: a:b. If you believe the analogy is complicated, explain why the analogy holds in one or two sentences.<br>
Analogy : man:woman :: king:queen**

### Incorrect Analogy
a. Below, we expect to see the intended analogy "hand : glove :: foot : **sock**", but we see an unexpected result instead. Give a potential reason as to why this particular analogy turned out the way it did?

In [8]:
pprint.pprint(wv_from_bin.most_similar(positive=['foot', 'glove'], negative=['hand']))

[('45,000-square', 0.4922032654285431),
 ('15,000-square', 0.4649604558944702),
 ('10,000-square', 0.4544755816459656),
 ('6,000-square', 0.44975775480270386),
 ('3,500-square', 0.444133460521698),
 ('700-square', 0.44257497787475586),
 ('50,000-square', 0.4356396794319153),
 ('3,000-square', 0.43486514687538147),
 ('30,000-square', 0.4330596923828125),
 ('footed', 0.43236875534057617)]


### SOLUTION
**Question :**<br> Below, we expect to see the intended analogy "hand : glove :: foot : sock", but we see an unexpected result instead. Give a potential reason as to why this particular analogy turned out the way it did?<br>
**Answer :**<br>
**1)** Polysemy and Word Embeddings: Words like "foot" and "glove" have multiple meanings (polysemy). Word embeddings often capture the most frequent or dominant meaning for a word. In this case, the word "foot" might be primarily associated with units of area (square footage) in the word vector space you're using. This overshadows the intended meaning related to the body part.<br>
**2)** Dataset Bias: The training data used to create the word vectors might influence the resulting embeddings. If the data primarily uses "foot" in the context of area measurements, the resulting vector will likely reflect that dominant meaning.<br>

:b. Find another example of analogy that does *not* hold according to these vectors. In your solution, state the intended analogy in the form x:y :: a:b, and state the **incorrect** value of b according to the word vectors (in the previous example, this would be **'45,000-square'**).

In [9]:
x, y, a, b = "sun", "hot", "snow","cold"
pprint.pprint(wv_from_bin.most_similar(positive=[a, y], negative=[x]))

[('snowfall', 0.5256644487380981),
 ('heavy', 0.463622123003006),
 ('slippery', 0.45968130230903625),
 ('snowfalls', 0.4593408405780792),
 ('wet', 0.4542512893676758),
 ('snows', 0.4535849094390869),
 ('stuck', 0.45239710807800293),
 ('cold', 0.44439494609832764),
 ('ice', 0.44195353984832764),
 ('mud', 0.44139257073402405)]


### SOLUTION
**Question :** Find another example of analogy that does not hold according to these vectors. In your solution, state the intended analogy in the form x:y :: a:b, and state the incorrect value of b according to the word vectors<br>

**Answer :**<br>
analog -> sun:hot :: snow:cold <br>
the output result is : snowfall,heavy,slippery ,...<br>
which are not correct because sun is the source of warmth and hot is a weather situation which is caused by son. and snow is kinda source of cold weather.<br>
but in the output cold is ranked among the last ones which is incorrect.

### Guided Analysis of Bias in Word Vectors

It's important to be cognizant of the biases (gender, race, sexual orientation etc.) implicit in our word embeddings. Bias can be dangerous because it can reinforce stereotypes through applications that employ these models.

Run the cell below, to examine (a) which terms are most similar to "woman" and "profession" and most dissimilar to "man", and (b) which terms are most similar to "man" and "profession" and most dissimilar to "woman". Point out the difference between the list of female-associated words and the list of male-associated words, and explain how it is reflecting gender bias.

In [10]:
# Run this cell
# Here `positive` indicates the list of words to be similar to and `negative` indicates the list of words to be most dissimilar from.

pprint.pprint(wv_from_bin.most_similar(positive=['man', 'profession'], negative=['woman']))
print()
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'profession'], negative=['man']))

[('reputation', 0.5250176787376404),
 ('professions', 0.5178037881851196),
 ('skill', 0.49046966433525085),
 ('skills', 0.49005505442619324),
 ('ethic', 0.4897659420967102),
 ('business', 0.4875852167606354),
 ('respected', 0.485920250415802),
 ('practice', 0.482104629278183),
 ('regarded', 0.4778572618961334),
 ('life', 0.4760662019252777)]

[('professions', 0.5957457423210144),
 ('practitioner', 0.49884122610092163),
 ('teaching', 0.48292139172554016),
 ('nursing', 0.48211804032325745),
 ('vocation', 0.4788965880870819),
 ('teacher', 0.47160351276397705),
 ('practicing', 0.46937814354896545),
 ('educator', 0.46524327993392944),
 ('physicians', 0.4628995358943939),
 ('professionals', 0.4601394236087799)]


### SOLUTION

**Question :** explain how it is reflecting gender bias.<br>
**Answer :**<br>
This bias likely arises from the training data used to create the word embeddings. If the data primarily contains text where professions are associated with specific genders, the word vectors will reflect those biases.  For example, if there are many mentions of "doctor" and "he" together, the word vector for "doctor" might lean towards male doctors.

### Independent Analysis of Bias in Word Vectors

Use the `most_similar` function to find another pair of analogies that demonstrates some bias is exhibited by the vectors. Please briefly explain the example of bias that you discover.

In [41]:

A = "american"
B = "asian"
word = "food"
pprint.pprint(wv_from_bin.most_similar(positive=[A, word], negative=[B]))
print()
pprint.pprint(wv_from_bin.most_similar(positive=[B, word], negative=[A]))


[('foods', 0.5010936260223389),
 ('care', 0.4917422831058502),
 ('nutrition', 0.4827004373073578),
 ('for', 0.47490009665489197),
 ('u.s.', 0.47357606887817383),
 ('americans', 0.4723329544067383),
 ('supplies', 0.4687662422657013),
 ('products', 0.4552023708820343),
 ('canadian', 0.4544133245944977),
 ('medicine', 0.45267990231513977)]

[('vegetables', 0.5511559844017029),
 ('asia', 0.5381689071655273),
 ('foods', 0.4933627247810364),
 ('fresh', 0.47798046469688416),
 ('supplies', 0.4774106442928314),
 ('prices', 0.47332215309143066),
 ('thailand', 0.46969538927078247),
 ('seafood', 0.463384747505188),
 ('markets', 0.4559096097946167),
 ('eaten', 0.45554429292678833)]


### SOLUTION
**Question :** Use the most_similar function to find another pair of analogies that demonstrates some bias is exhibited by the vectors. Please briefly explain the example of bias that you discover.<br>
**Answer :**<br>
i found words american and asian and their bias occurs with word "food". for sure,they have some common words but they have some different words that similar with american word and dissimilar with asian and opposite.<br>
for example the word seafood or vegetables are common food in asian meals but not so common in american food.
actualy this example reflects the bias in race or location.


### Thinking About Bias

a. Give one explanation of how bias gets into the word vectors. Briefly describe a real-world example that demonstrates this source of bias.

### SOLUTION

**َAnswer :**<br>
One primary way bias gets into word vectors is through data bias. This happens when the training data used to create the word embeddings itself reflects societal biases. Word embeddings are trained on massive amounts of text data, and the statistical patterns within this data are captured in the resulting vectors.
**Real World example :**<br>
Imagine a word embedding model trained on a dataset consisting primarily of news articles. These articles might frequently mention male doctors and female nurses. As a result, the word vector for "doctor" might end up closer to words like "male" and "surgeon" in the vector space, while the word vector for "nurse" might be closer to words like "female" and "caring." This reflects the existing societal bias in these professions within the training data, perpetuating the stereotype in the word embeddings.

b. What is one method you can use to mitigate bias exhibited by word vectors?  Briefly describe a real-world example that demonstrates this method.

### SOLUTION

**َAnswer :**<br>
**1) Debiasing Techniques :**  Analyze the vector space to identify directions that encode specific biases (e.g., gender, race). This can be done by studying known biased word pairs.Several debiasing techniques exist, such as projecting out the bias direction or rotating the word vectors in a way that reduces bias.<br>

**2) Balanced Training Data :** Ensure the training data used to create the word embeddings is balanced and representative of the population you want to capture. This means including text from diverse sources and avoiding datasets that reinforce stereotypes.<br>
Techniques like data augmentation can be used to artificially create more balanced datasets from existing data. This can help reduce bias if the original data is skewed.