#### Word Embeddings
* Similar words have similar vectors
* Dimensions are low [Like ideal size we use : 300]
* Dense vector
* Different Techniques: Word2Vec, Glove, fastText, Transformer(BERT, GPT), LSTM(ELMo)
* Word2Vec: We use neural network to find the feature vector for each word using a fake problem which in return due to side effect give us the word vectors

### Cosine Similarity
* We try to compare two vectors by finding the angle between them.
* The less the angle, the more similar they will be
* To convert angle into a range of 0-1 we use cosine (cos(0) = 1)

In [1]:
import spacy

# pre-trained model
nlp = spacy.load("en_core_web_lg")

In [4]:
doc = nlp("dog cat banana afskfsd")

for token in doc:
    print(token.text, "|", "Vector:", token.has_vector,"|", "OOV:", token.is_oov)

dog | Vector: True | OOV: False
cat | Vector: True | OOV: False
banana | Vector: True | OOV: False
afskfsd | Vector: False | OOV: True


In [5]:
# spacy model comes with inbuild word vectors for different words

doc[0].vector.shape

(300,)

In [8]:
base_token = nlp("bread")
base_token.vector.shape

(300,)

In [9]:
# we are comparing "bread" with each word in this corpus

doc = nlp("bread sandwich burger car tiger human wheat")

for token in doc:
    print(f"{token.text} <-> {base_token.text}:", token.similarity(base_token))

bread <-> bread: 0.9999999744752309
sandwich <-> bread: 0.634106782477101
burger <-> bread: 0.4752069113758708
car <-> bread: 0.06451533308853552
tiger <-> bread: 0.04764611675903374
human <-> bread: 0.2151154210812192
wheat <-> bread: 0.6150360888607199


In [10]:
def print_similarity(base_word, words_to_compare):
    base_token = nlp(base_word)
    doc = nlp(words_to_compare)

    for token in doc:
        print(f"{token.text} <-> {base_token.text}:", token.similarity(base_token))

In [12]:
print_similarity("iphone", "apple samsung iphone dog kitten")

apple <-> iphone: 0.4387907748060368
samsung <-> iphone: 0.670859081425417
iphone <-> iphone: 0.9999999983096304
dog <-> iphone: 0.08211864228011527
kitten <-> iphone: 0.10222318459666081


In [14]:
king = nlp.vocab["king"].vector
man = nlp.vocab["man"].vector
woman = nlp.vocab["woman"].vector
queen = nlp.vocab["queen"].vector

result = king - man + woman

In [15]:
# using cosine similarity to compare vectors

from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity([result], [queen])

array([[0.6178014]], dtype=float32)