# Using word embeddings


In this recipe we switch gears and learn how to represent words using word embeddings,
which are powerful because they are a result of training a neural network that predicts
a word from all other words in the sentence. The resulting vector embeddings are
similar for words that occur in similar contexts. We will use the embeddings to show
these similarities.

# How to do itâ€¦
We will load the model, demonstrate some features of the gensim package, and then
compute a sentence vector using the word embeddings.

IMPORTATION

In [13]:
!pip install gensim



In [2]:
from gensim.models import keyedvectors
import numpy as np

ASSIGN THE MODEL PATH TO A VARIABLE

In [3]:
import gensim.downloader as api
w2v_model = api.load("word2vec-google-news-300")



In [8]:
word2vec_model_path = r"/root/gensim-data/word2vec-google-news-300/word2vec-google-news-300.gz"

LOAD THE PRETRAINED MODEL:

In [12]:
model = keyedvectors.load_word2vec_format(word2vec_model_path, binary = True) # binary = True, this takes care of UnicodeDecodeError

Using the pretrained model, we can now load individual word vectors:

In [18]:
print(model["hello"])

[-0.05419922  0.01708984 -0.00527954  0.33203125 -0.25       -0.01397705
 -0.15039062 -0.265625    0.01647949  0.3828125  -0.03295898 -0.09716797
 -0.16308594 -0.04443359  0.00946045  0.18457031  0.03637695  0.16601562
  0.36328125 -0.25585938  0.375       0.171875    0.21386719 -0.19921875
  0.13085938 -0.07275391 -0.02819824  0.11621094  0.15332031  0.09082031
  0.06787109 -0.0300293  -0.16894531 -0.20800781 -0.03710938 -0.22753906
  0.26367188  0.012146    0.18359375  0.31054688 -0.10791016 -0.19140625
  0.21582031  0.13183594 -0.03515625  0.18554688 -0.30859375  0.04785156
 -0.10986328  0.14355469 -0.43554688 -0.0378418   0.10839844  0.140625
 -0.10595703  0.26171875 -0.17089844  0.39453125  0.12597656 -0.27734375
 -0.28125     0.14746094 -0.20996094  0.02355957  0.18457031  0.00445557
 -0.27929688 -0.03637695 -0.29296875  0.19628906  0.20703125  0.2890625
 -0.20507812  0.06787109 -0.43164062 -0.10986328 -0.2578125  -0.02331543
  0.11328125  0.23144531 -0.04418945  0.10839844 -0.28

We can also get words that are most similar to a given word. For example, let's print
out the words most similar to hello (lowercase, since all the words are lowercased
in the training process):

In [24]:
print(model.most_similar(["hello"], topn=5))

[('hi', 0.6548984050750732), ('goodbye', 0.6399056315422058), ('howdy', 0.6310956478118896), ('goodnight', 0.5920578241348267), ('greeting', 0.5855877995491028)]


We can now also compute a sentence vector by averaging all the word vectors in the
sentence. We will use the sentence It was not that he felt any emotion akin to love for
Irene Adler:

In [20]:
sentence = "It was not that he felt any emotion akin to love for Irene Adler."

Let's define a function that will take a sentences and a model and will return a list of the sentence word vectors:

In [21]:
def get_word_vectors(sentence, model):
  word_vectors = []
  for word in sentence:
    try:
      word_vector = model.get_vector(word.lower())
      word_vectors.append(word_vector)
    except KeyError:
      continue
  return word_vectors

Now, let's defne a function that will take the word vector list and compute the
sentence vector:

In [22]:
def get_sentence_vector(word_vectors):
  matrix = np.array(word_vectors)
  centroid = np.mean(matrix[:, :], axis = 0)
  return centroid

We can now compute the sentence vector:

In [23]:
word_vectors = get_word_vectors(sentence, model)
sentence_vector = get_sentence_vector(word_vectors)
print(sentence_vector)

[-0.1625751   0.117904   -0.03327743  0.13714865 -0.0290534   0.04307888
 -0.07734017 -0.02747113 -0.0376985   0.03732565 -0.04847514 -0.07030454
 -0.23939314  0.00413978 -0.11796238  0.10221531  0.11549311  0.17339823
 -0.01006516  0.01300944 -0.2710863  -0.04825393  0.14016724  0.04992394
 -0.10000245  0.04447595 -0.2545962   0.06052366 -0.0249793  -0.02394436
 -0.00060902  0.03851567 -0.05589493 -0.10330465 -0.12221361  0.09696181
 -0.21117567  0.09080704 -0.04517754  0.07360575 -0.05247696 -0.06609842
  0.06061587  0.08914715  0.04468835 -0.04243366 -0.06161698 -0.1939538
 -0.1545702   0.09573762 -0.20171323  0.24152938 -0.04674497  0.25169572
  0.05233366  0.13106106 -0.16313371 -0.08871593 -0.0040522  -0.16744332
 -0.15011995 -0.08387823 -0.17923239 -0.04825065 -0.03933384 -0.17564325
 -0.11914261  0.12546705 -0.07503078  0.06912629  0.04439644 -0.05143406
  0.05790644  0.0139797  -0.05246701 -0.01477847  0.14195849  0.02867591
 -0.00335229 -0.09226393 -0.13522206 -0.02071812 -0.