## Word Embeddings with Gensim

Spacy, only computes word embeddigns using the GloVe technique. Gensim from the other hand gives us the ability to use any algorithm for creating vector representations.

For all the available model gensim provide us look: https://github.com/RaRe-Technologies/gensim-data

In [2]:
!python -m spacy download en_core_web_sm -q

2023-04-09 09:20:49.359063: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m35.3 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [3]:
import spacy
import gensim.downloader as api             # For downloading an existing model to compute word embeddings
from gensim.utils import simple_preprocess  # For tokenize a document into words

## Initializing the NLP object

In [4]:
nlp = spacy.load("en_core_web_sm")

nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

## Initializing the Embedding object

In [5]:
wv = api.load("word2vec-google-news-300")

## Loading a `Word2Vec` Object

In [6]:
wv = api.load("word2vec-google-news-300")

This model is being trained over google news and has an embedding size of 300

## Similarity of Vector Representation

By `smimilarity` we don't mean how relevant is their semantinc information, rather how often, the frequency of which those vectors appear in a similar context context, at least for Word2Vec and GloVe, because they ae window based approaches.

An example of two words that appear is a similar context is `good` and `bad`. Consider the following example:
* I was feeling _good_ as it was holiday.
* I was feeling _bad_ as it was Monday.
Another example is `cat` and `dog`.

In [7]:
# We can get the similarity of two elements using:
wv.similarity(w1="good", w2="bad")

0.7190051

In [8]:
# We can also ge a list of the `closest` words of a given word
wv.most_similar("good")

[('great', 0.7291510105133057),
 ('bad', 0.7190051078796387),
 ('terrific', 0.6889115571975708),
 ('decent', 0.6837348341941833),
 ('nice', 0.6836092472076416),
 ('excellent', 0.644292950630188),
 ('fantastic', 0.6407778263092041),
 ('better', 0.6120728850364685),
 ('solid', 0.5806034803390503),
 ('lousy', 0.576420247554779)]

## Arithmetic using Vector Embeddings

In [9]:
# France - Paris + Berlin = Germany

wv.most_similar(positive=["France", "Berlin"], negative=["Paris"])

[('Germany', 0.7901254892349243),
 ('Austria', 0.6026812195777893),
 ('German', 0.6004959940910339),
 ('Germans', 0.5851002931594849),
 ('Poland', 0.5847075581550598),
 ('Hungary', 0.5271855592727661),
 ('BBC_Tristana_Moore', 0.5249711275100708),
 ('symbol_RSTI', 0.5245768427848816),
 ('Belgium', 0.5221248269081116),
 ('Germnay', 0.5199405550956726)]

In [10]:
# King - man + woman = Queen

wv.most_similar(positive=["King", "woman"], negative=["man"])

[('Queen', 0.5515626668930054),
 ('Oprah_BFF_Gayle', 0.47597548365592957),
 ('Geoffrey_Rush_Exit', 0.46460166573524475),
 ('Princess', 0.4533674716949463),
 ('Yvonne_Stickney', 0.4507041573524475),
 ('L._Bonauto', 0.4422135353088379),
 ('gal_pal_Gayle', 0.4408389925956726),
 ('Alveda_C.', 0.4402790665626526),
 ('Tupou_V.', 0.4373864233493805),
 ('K._Letourneau', 0.4351031482219696)]

## Matching with the Context

Another functionality of gensim is that given a list of words it can tell us which one doesn't match with the other.

Actually it returns the word that is farthest from the others using cosine similarity.

In [11]:
wv.doesnt_match(["google", "apple", "dog", "twitter"])

'dog'

In [12]:
wv.doesnt_match(["good", "great", "bad", "terrific"])

'bad'

## Getting the Embedding

In [13]:
print(wv["great"].shape)

wv["great"][:20]

(300,)


array([ 0.07177734,  0.20800781, -0.02844238,  0.17871094,  0.1328125 ,
       -0.09960938,  0.09619141, -0.11669922, -0.00854492,  0.1484375 ,
       -0.03344727, -0.18554688,  0.04101562, -0.08984375,  0.02172852,
        0.06933594,  0.18066406,  0.22265625, -0.10058594, -0.06933594],
      dtype=float32)

## Vector Representation of Sentences

By default `gensim` has not created a way for automaticly get the embeddings of a sentence just like `spacy`.

To create those embeddings we are taking the `mean` of all the tokens of the sentence.

In [23]:
sentence = nlp("Thor's finding a new job as we got fired from his previous one")

sentence_v = [wv.get_mean_vector(token.lemma_) for token in sentence if (not token.is_stop) and (not token.is_punct)]

In [21]:
print(len(sentence_v))
print(len(sentence_v[0]))
print(sentence_v[0][:20])

7
300
[-0.1020783   0.07230248 -0.00247384  0.02924935 -0.03211807 -0.00690309
 -0.06591261 -0.04352893 -0.03119576  0.01347279 -0.02085569 -0.02568291
 -0.08093277 -0.0019637  -0.07012518  0.04586982 -0.00617665  0.07006936
 -0.01070635 -0.02028628]
