## A4-Q3

Word2Vec model can capture semantic relationships between words by representing them as dense vector embeddings in a continuous vector space. It is a powerful tool to determine analog words by comparing the similarity between their embedding vectors. To measure the similarity between two vectors, Cosine-Similarity is a common metric. It calculates the cosine of the angle between the vectors, which determines whether the vectors are pointing in roughly the same direction or not. The formula of Cosine-Similarity is as following:
$$CosSim(A,B)=(A∙B)/(|A||B|)$$
Where A∙B is the dot product of the vectors A and B; |*| is the Euclidean norm of a vector.

We trained a word2Vec model by using the corpus of S&P500 earning call transcripts. It is loaded in the variable `model` above. As learnt from the class, you can use word2Vec model as a dictionary which maps the word to its embedding vector, such as `model['hello']` returns the vector of word `hello`.

Use cosine similarity to calculate the similarities among the following words:
```Python
['fall', 'loss', 'reduction', 'success', 'process', 'pleased', 'confident']
```

Which two words are most similar? What is the similarity score?


In [2]:
# Download the trained word2vec model
import urllib.request

urllib.request.urlretrieve("https://storage.googleapis.com/rotman-ncs-data-buket/word2vec_300.model", 
                           "word2vec_300.model")
urllib.request.urlretrieve("https://storage.googleapis.com/rotman-ncs-data-buket/word2vec_300.model.syn1neg.npy",
                           "word2vec_300.model.syn1neg.npy")
urllib.request.urlretrieve("https://storage.googleapis.com/rotman-ncs-data-buket/word2vec_300.model.wv.vectors.npy", 
                           "word2vec_300.model.wv.vectors.npy") 

('word2vec_300.model.wv.vectors.npy',
 <http.client.HTTPMessage at 0x1b74ad737f0>)

In [3]:
# Load the word2vec model
from gensim.models import Word2Vec
import numpy as np

model = Word2Vec.load("word2vec_300.model").wv

In [4]:
# Word list to be compared
words_list = ['fall', 'loss', 'reduction', 'success', 'process', 'pleased', 'confident']

In [5]:
# Retrieve word vectors as looking up a word in the dictionary
word_vectors = [model[word] for word in words_list]

In [9]:
# Write your code here to calculate cosine similarity between the word vectors
def cosine_similarity(vector1, vector2):
    dot_product = np.dot(vector1, vector2)
    norm_1 = np.linalg.norm(vector1)
    norm_2 = np.linalg.norm(vector2)
    cosine_similarity = dot_product / (norm_1 * norm_2)
    return cosine_similarity

In [11]:
# Find the most similar pair of words in the list, and get the cosine similarity score
max_sim = -1

for i in range(len(words_list)):
    for j in range(i + 1, len(words_list)):
        sim = cosine_similarity(word_vectors[i], word_vectors[j])
        if sim > max_sim:
            max_sim = sim
            max_pair = (words_list[i], words_list[j])
print("The most similar pair of words are: ", max_pair)
print("The similarity score is: ", max_sim)

The most similar pair of words are:  ('pleased', 'confident')
The similarity score is:  0.47998467
