## Word Vectors with Spacy

Spacy provide us with packages that have pre-computed some `word embeddings` that we can use. For example one such package is **en_core_web_lg**.

To compute those embeddings Space is using `GloVe` embedding technique.

**Note**: Spacy is not only providing us with word embeddings over a document, but also computes the `embedding` for the entire `document` and also for a `sub-document` originate from the original document. It can provide that functionality by simple taking the average vector from all words vectors appear in the (sub)document.

In [1]:
!python -m spacy download en_core_web_lg -q

2023-04-09 08:27:12.064513: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


In [2]:
import spacy

## Initalizing the NLP Object

In [3]:
nlp = spacy.load("en_core_web_lg")

## We can see If a token has word embeddings pre-computed in Spacy or If it is OOV

In [None]:
doc = nlp("orange, apple, dog cat and banana with asdasfag")

for token in doc:
    print(token.text, " | ", token.has_vector, " | ", token.is_oov)

orange  |  True  |  False
,  |  True  |  False
apple  |  True  |  False
,  |  True  |  False
dog  |  True  |  False
cat  |  True  |  False
and  |  True  |  False
banana  |  True  |  False
with  |  True  |  False
asdasfag  |  False  |  True


## Getting the Word Embedding Vectors

In [None]:
# Let's say we want to get the word vector of the token banana
banana = doc[3]

print(banana.vector.shape)
banana.vector[:36]

(300,)


array([-3.3899 , -4.7034 , -0.56101,  1.2291 ,  4.3298 , -1.0775 ,
       -1.3006 ,  8.7939 , -0.16669, -4.3738 ,  2.3697 ,  2.6438 ,
       -5.4589 ,  3.3491 ,  4.0331 ,  5.1368 , -3.0016 ,  4.3627 ,
       -3.1921 , -4.6624 ,  6.065  ,  1.0278 , -2.302  ,  2.6546 ,
       -1.9866 , -0.21586, -4.6756 , -4.2126 ,  4.552  ,  0.77829,
       -2.3145 , -5.2688 , -0.83724,  1.5414 , -3.5657 , -2.157  ],
      dtype=float32)

In [None]:
print(doc[:3].vector.shape)
doc[:3].vector[:36]

(300,)


array([-2.4707    , -2.3475866 , -1.41162   ,  2.8259335 ,  1.18327   ,
       -3.123     , -2.2702668 ,  5.885967  , -2.48483   , -0.41277325,
        4.2376    , -0.5593333 , -3.8496    ,  2.0623567 ,  1.6014701 ,
        0.80606   , -0.5474667 ,  0.34885335,  0.56983334, -3.9805    ,
        1.1141567 ,  2.6033332 , -2.3343668 , -0.87796664, -1.6004766 ,
       -2.0258865 , -3.4727333 , -1.1013669 ,  1.8469567 ,  1.14383   ,
       -1.56239   , -2.29303   , -1.4446467 , -1.1601    , -0.01396668,
       -1.768     ], dtype=float32)

## Comparing those Words Vectors

In [None]:
# Creating a token embedding for the word `apple`
base_token = nlp("apple")

for token in doc:
    if token.has_vector:
        print(f"{token.text} <-> {base_token.text}: {token.similarity(base_token)}")

orange <-> apple: 0.6135187212109956
, <-> apple: 0.12710916405242928
apple <-> apple: 1.0
, <-> apple: 0.12710916405242928
dog <-> apple: 0.22881005140483499
cat <-> apple: 0.20368060357742446
and <-> apple: 0.13160242794047908
banana <-> apple: 0.6646700599790674
with <-> apple: 0.12418893150611857


Because those embeddings are being claculated suing GloVe, this similarity has to do with how many times those two words are beaing captured close to each other in a large corpus.

## Arithmetic using those Vectors

In [None]:
from sklearn.metrics.pairwise import cosine_similarity


king_emb = nlp("king").vector
queen_emb = nlp("queen").vector
man_emb = nlp("man").vector
woman_emb = nlp("woman").vector

result = king_emb - man_emb + woman_emb

print(f"Similarity of result and queen:{cosine_similarity([result], [queen_emb])[0][0] * 100: .2f}%")

Similarity of result and queen: 61.78%
