# Cosine similarities
This notebook illustrates how to calculate and display cosine similarities between wordvectors.
As input, we use a file with embeddings generated by [embiggen](https://pypi.org/project/embiggen/]) together
with a file with the corresponding word labels.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
import seaborn as sns
from scipy.spatial.distance import cosine
from collections import defaultdict

The following code allows us to import the ``kcet`` module from the local repository.

In [2]:
import os
import sys
sys.path.insert(0, os.path.abspath('..'))
from kcet import Wordvec2Cosine

The constructor of ``Wordvec2Cosine`` loads the word embeddings and words into a pandas dataframe.

In [6]:
data_directory = 'data'
if not os.path.isdir(data_directory):
    raise FileNotFoundError("Could not find data directory")
embedding_file = os.path.join(data_directory, "embedding_skipgram_dim100.npy")
words_file = os.path.join(data_directory, "words_before2021_jan3.txt")
w2c = Wordvec2Cosine(embeddings=embedding_file, words=words_file)
df = w2c.get_embeddings()
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
cell,8.434354,-1.301705,-0.096402,8.134256,1.92091,0.926111,-4.443069,1.682222,6.917873,1.928016,...,-4.835361,-2.297898,-1.920775,-0.465857,0.514665,3.127317,3.846459,-0.568883,3.231197,2.706048
patient,8.775193,-2.617628,0.749227,9.040712,1.679668,0.293989,-6.171941,3.743819,7.335784,2.977857,...,-3.589026,-2.087615,-1.168177,0.23211,-0.159556,4.528789,3.694627,0.023134,1.677531,2.749773
meshd009369,8.508118,-4.368992,-0.529461,8.065137,0.859943,1.024093,-5.529982,2.339031,7.453519,2.226591,...,-4.020745,-2.279984,-3.091373,-0.013547,0.403936,4.710061,5.096478,-0.771608,4.10152,2.822914
0,7.366462,-2.015644,0.103874,7.46722,2.352617,0.799381,-7.434361,2.850947,7.761542,3.999322,...,-3.143945,-2.406357,-2.340659,-0.079155,0.768798,2.829236,3.35903,-0.015921,3.031416,1.629173
study,8.567193,-2.818339,0.221348,8.043354,1.631235,0.741792,-3.279874,4.170556,6.876228,3.561809,...,-1.841853,-2.17976,-2.473173,-0.025654,0.555928,3.907662,4.981505,-0.832214,3.822674,3.205337


## Top n most similar words
We retrieve the top n most similar words. The function ``n_most_similar_words`` returns a list of tuples,
and ``n_most_similar_words_df`` returns a Pandas dataframe.

In [7]:
target_word = 'ncbigene695' #BTK
n = 50
cosine_similarities = w2c.n_most_similar_words_df(target_word=target_word, n=n)
cosine_similarities.head()

Unnamed: 0,word,similarity
0,ncbigene695,1.0
1,ncbigene3718,0.994439
2,ncbigene6850,0.992054
3,ncbigene3717,0.991348
4,ncbigene3716,0.991039


In [7]:
target_word ='meshd007938' # Leukemia Leukemias
n = 50
cosine_similarities  = w2c.n_most_similar_words_df(target_word,n)
cosine_similarities.head()

Unnamed: 0,word,similarity
0,meshd007938,1.0
1,meshd007951,0.994456
2,meshd015470,0.993775
3,meshd054198,0.993462
4,meshd007945,0.993184


In [8]:
target_word ='meshd001943' #breast neoplasms
n = 20
cosine_similarities = w2c.n_most_similar_words_df(target_word=target_word, n=n)
cosine_similarities.head()

Unnamed: 0,word,similarity
0,meshd001943,1.0
1,bc,0.993237
2,tnbc,0.991588
3,meshd010051,0.990166
4,meshd016889,0.989668


## Top n least similar words

In [9]:
target_word ='meshd007938' # Leukemia Leukemias
n = 5
cosine_similarities  = w2c.n_least_similar_words_df(target_word=target_word,n=n)
cosine_similarities

Unnamed: 0,word,similarity
0,bacu,-0.270391
1,canthu,-0.230543
2,apropo,-0.21805
3,famou,-0.206321
4,atadenoviru,-0.178634


In [11]:
target_word ='meshd001943' #breast neoplasms
n = 5
cosine_similarities  = w2c.n_least_similar_words_df(target_word=target_word,n=n)
cosine_similarities

Unnamed: 0,word,similarity
0,bacu,-0.309134
1,famou,-0.21533
2,atadenoviru,-0.212366
3,canthu,-0.207619
4,apropo,-0.206231
