<center><div style="direction:rtl;font-family:B Lotus, B Nazanin, Tahoma">به نام خدا</div></center>
<img src="./logo.png" alt="class.vision" style="width: 200px;"/>
<h1><center><div style="direction:rtl;font-family:B Lotus, B Nazanin, Tahoma">قیاس کلمات (Word analogies)</div></center></h1>

<div style="direction:rtl;text-align:right;font-family:Tahoma">
کدها  با تغییرات برگرفته از کورس Sequence Models پروفسور Andrew NG است.
</div>

[https://www.coursera.org/learn/nlp-sequence-models](https://www.coursera.org/learn/nlp-sequence-models)

<div style="direction:rtl;text-align:right;font-family:Tahoma">
بردار از قبل آموزش داده شده را می‌توانید از اینجا دانلود کنید:</div>

https://nlp.stanford.edu/projects/glove/

http://nlp.stanford.edu/data/glove.6B.zip


In [1]:
import numpy as np
import os

In [2]:
glove_dir = './temp'
embedding_index = {}
with open(os.path.join(glove_dir, 'glove.6B.100d.txt'), 'r') as file:
    for line in file.readlines():
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], dtype='float32')
        embedding_index[word] = vector

# 1 - Cosine similarity

To measure how similar two words are, we need a way to measure the degree of similarity between two embedding vectors for the two words. Given two vectors $u$ and $v$, cosine similarity is defined as follows: 

### Definition
The cosine of two non-zero vectors can be derived by using the Euclidean dot product formula:
$$
\begin{equation}
    A \cdot B = \lVert A \rVert \lVert B \rVert cos(\theta)
\end{equation}
$$
Given two $n$-dimensional vectors of attributes, $\mathbf{A}$ and $\mathbf{B}$, the cosine similarity, $cos(\theta)$, is represented using a dot product and magnitude as:
$$
\begin{align}
    \text{Cosine Similarity} = S_C(A, B) &= cos(\theta)\\
    &= \frac{\mathbf{A \cdot B}}{\lVert \mathbf{A} \rVert \lVert \mathbf{B} \rVert}\\
    &= \frac{\sum_{i=1}^{n}A_iB_i}{\sqrt{\sum_{i=1}^{n}A_i^2} \cdot \sqrt{\sum_{i = 1}^{n}B_i^2}}
\end{align}
$$

<img src="images/cosine_sim.png" style="width:800px;height:250px;">


In [3]:
def similarity(u, v):
    dot = np.dot(u, v.T)
    length_prod = np.sqrt(np.sum(np.square(u), axis=-1)) * np.sqrt(np.sum(np.square(v), axis=-1))
    return dot / (length_prod + 1e-12)

In [4]:
father = embedding_index["father"]
mother = embedding_index["mother"]
ball = embedding_index["ball"]
crocodile = embedding_index["crocodile"]
france = embedding_index["france"]
tehran = embedding_index["tehran"]
paris = embedding_index["paris"]
iran = embedding_index["iran"]
print("cosine_similarity(father, mother) = ", similarity(father, mother))
print("cosine_similarity(ball, crocodile) = ",similarity(ball, crocodile))
print("cosine_similarity(france - paris, tehran - iran) = ",similarity(france - paris, tehran - iran))

cosine_similarity(father, mother) =  0.8656660919003184
cosine_similarity(ball, crocodile) =  0.1520657243738079
cosine_similarity(france - paris, tehran - iran) =  -0.6854124336246552


## 2 - Word analogy task

In the word analogy task, we complete the sentence <font color='brown'>"*a* is to *b* as *c* is to **____**"</font>. An example is <font color='brown'> '*man* is to *woman* as *king* is to *queen*' </font>. In detail, we are trying to find a word *d*, such that the associated word vectors $e_a, e_b, e_c, e_d$ are related in the following manner: $e_b - e_a \approx e_d - e_c$. We will measure the similarity between $e_b - e_a$ and $e_d - e_c$ using cosine similarity. 


In [5]:
embedding_index["father"]

array([ 0.64706 , -0.068067,  0.15468 , -0.17408 , -0.29134 ,  0.76999 ,
       -0.3192  , -0.25663 , -0.25082 , -0.036737, -0.25509 ,  0.29636 ,
        0.5776  ,  0.49641 ,  0.19167 , -0.83888 ,  0.58482 , -0.38717 ,
       -0.71591 ,  0.9519  , -0.37966 , -0.1131  ,  0.47154 ,  0.20921 ,
        0.38197 ,  0.067582, -0.92879 , -1.1237  ,  0.84831 ,  0.68744 ,
       -0.15472 ,  0.92714 ,  0.53371 , -0.037392, -0.856   ,  0.19056 ,
       -0.014594,  0.15186 ,  0.53514 , -0.20306 , -0.35164 ,  0.33152 ,
        1.1306  , -0.72787 , -0.19724 ,  0.031659, -0.24041 , -0.057617,
        0.60473 , -0.49233 , -0.24405 , -0.3184  ,  0.96156 ,  1.0895  ,
        0.21534 , -2.0542  , -1.0615  ,  0.052439,  0.57958 ,  0.2748  ,
        0.91587 ,  0.85195 ,  0.36113 , -0.31901 ,  0.7784  , -0.36865 ,
        0.64387 ,  0.33104 , -0.27181 ,  0.58524 , -0.15143 ,  0.11121 ,
        0.2126  , -0.60345 ,  0.16148 ,  0.32952 , -0.1354  , -0.30629 ,
       -0.89143 ,  0.091912,  0.49753 ,  0.55932 , 

In [6]:
def analogy(word_a, word_b, word_c, embedding_index):
    # convert words to lower case
    word_a, word_b, word_c = word_a.lower(), word_b.lower(), word_c.lower()
    # Get the word embeddings v_a, v_b and v_c 
    e_a, e_b, e_c = embedding_index[word_a], embedding_index[word_b], embedding_index[word_c]
    # Get all the words in an array
    words = np.array(list(embedding_index.keys()))

    # Get index of each word
    ## This is done so to avoid word_d being the same thing
    index_a = np.argmax(words == word_a)
    index_b = np.argmax(words == word_b)
    index_c = np.argmax(words == word_c)
    
    embedding_matrix = np.array(list(embedding_index.values()), dtype='float32')
    # Calculate e_b - e_a
    part_1 = e_b - e_a
    # Calculate every word as e_d - e_c
    part_2 = embedding_matrix - e_c
    # Calculate similarities for each word (return (num_words, 1))
    similarities = similarity(part_2, part_1)
    # avoid word_d to be the same word as word_a, word_b and word_c
    similarities[[index_a, index_b, index_c]] = -100
    # select the most similar word
    selected_word = words[np.argmax(similarities)]
    return selected_word

In [7]:
analogy('china', 'chinese', 'iran', embedding_index)

'iranian'

In [8]:
analogy('india', 'delhi', 'iran', embedding_index)

'tehran'

In [9]:
analogy('man', 'woman', 'boy', embedding_index)

'girl'

In [10]:
analogy('small', 'smaller', 'big', embedding_index)

'bigger'

In [11]:
analogy('iran', 'farsi', 'canada', embedding_index)

'inuktitut'

<div class="alert alert-block alert-info">
<div style="direction:rtl;text-align:right;font-family:B Lotus, B Nazanin, Tahoma"> دوره پیشرفته یادگیری عمیق<br>علیرضا اخوان پور<br>  آبان و آذر 1399<br>
</div>
<a href="http://class.vision">Class.Vision</a> - <a href="http://AkhavanPour.ir">AkhavanPour.ir</a> - <a href="https://github.com/Alireza-Akhavan/">GitHub</a>

</div>