# Cosinus distance or Euclidian distance ?

Both cosine distance and Euclidean distance are commonly used in NLP to measure the similarity or distance between word vectors. Cosine distance is often preferred over Euclidean distance in NLP because it is more effective in capturing semantic similarity. It measures the cosine of the angle between two vectors, which essentially measures the similarity of their direction, regardless of their magnitude. This makes it more robust to differences in word frequency and length.

Moreover, there is also a difference between them in term of getting the nearest neighbour. On the one hand, the cosinus similiraty should be maximized and on the other hand the euclidian distance should be minimized

In [1]:
!pip install gensim



Word2Vec Pre-trained vectors trained on a part of the Google News dataset (about 100 billion words).

In [6]:
import gensim.downloader as api
path = api.load("word2vec-google-news-300", return_path=True)
print(path)

C:\Users\33629/gensim-data\word2vec-google-news-300\word2vec-google-news-300.gz


In [7]:
model = gensim.models.KeyedVectors.load_word2vec_format(path, binary=True)

In [9]:
from scipy.spatial.distance import cosine, euclidean

In [12]:
# Get the word vector for "left"
left_vec = model.get_vector("left")

# Compute the cosine similarities and Euclidean distances between "left" and all other words in the vocabulary
cosine_sims = [1 - cosine(left_vec, model.get_vector(word)) for word in model.index_to_key]
euclidean_dists = [euclidean(left_vec, model.get_vector(word)) for word in model.index_to_key]

# Sort the results in descending order of similarity or distance
cosine_similarities_sorted = sorted(zip(model.index_to_key, cosine_sims), key=lambda x: x[1], reverse=True)
euclidean_distances_sorted = sorted(zip(model.index_to_key, euclidean_dists), key=lambda x: x[1])

# Get the top 10 nearest neighbors using cosine similarity
cosine_neighbors = cosine_similarities_sorted[1:11]
print("Nearest neighbors of 'left' using cosine similarity:")
for word, similarity in cosine_neighbors:
    print(word, similarity)

# Get the top 10 nearest neighbors using Euclidean distance
euclidean_neighbors = euclidean_distances_sorted[1:11]
print("\nNearest neighbors of 'left' using Euclidean distance:")
for word, distance in euclidean_neighbors:
    print(word, distance)

Nearest neighbors of 'left' using cosine similarity:
leaving 0.6707000732421875
leave 0.525093138217926
leaves 0.522864580154419
returned 0.5059226751327515
right 0.49213990569114685
departed 0.49109703302383423
limped 0.48599502444267273
went 0.4719873368740082
remaining 0.465037077665329
empty 0.4546155035495758

Nearest neighbors of 'left' using Euclidean distance:
leaving 1.6367921829223633
McCuin_dialed_authorities 1.7510281801223755
trembled_uncontrollably 1.7627252340316772
Illini_Shaun_Pruitt 1.7631374597549438
Our_pomegranate_orchard 1.7666441202163696
ball_richocheted 1.780613660812378
deked_Lundqvist 1.7857111692428589
crease_JVR_punched 1.7858235836029053
redirected_Jason_Pominville 1.7860292196273804
halfback_scampered 1.7880470752716064


The results are definitely different.
Therefore, It is important to choose carefully the distance/similarity tool in NLP. Indeed, the cosinus similarity is more robust and deals with unbalanced data better than euclidian distance.