# Word Embeddings with Gensim

## Word2vec by Google and GloVe by Stanford

Word embeddings are an **improvement over simpler bag-of-words model** word encoding schemes like word counts and frequencies that result in large and sparse vectors (mostly 0 values) that describe documents but not the meaning of the words. It provides a **dense vector representation of words that capture something about their meaning**.

It is **defining a word by the company that it keeps** that allows the word embedding to learn something about the meaning of words. The vector space representation of the words provides a projection where words with similar meanings are locally clustered within the space.

We are going to look at how to use two different word embedding methods called **word2vec by researchers at Google and GloVe by researchers at Stanford**.

## Gensim library

Gensim is a mature, focused, and efficient suite of NLP tools for topic modeling
1. It supports an implementation of the Word2Vec word embedding for **learning new word vectors** from text
2. It also provides tools for **loading pre-trained word embeddings** in a few formats and for making use and querying a loaded embedding.

Research Paper: https://arxiv.org/pdf/1301.3781.pdf

In [None]:
!pip install nltk
!pip install gensim



In [None]:
from gensim.models import Word2Vec

In [None]:
#Data Requirement :
# Genism word2vec requires that a format of ‘list of lists’ for training 
# where every document is contained in a list and every list contains lists 
# of tokens of that document. 

In [None]:
# define training data
sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'],
            ['this', 'is', 'the', 'second', 'sentence'],
            ['yet', 'another', 'sentence'],
            ['one', 'more', 'sentence'],
            ['and', 'the', 'final', 'sentence']]

# train model
model = Word2Vec(sentences, min_count=1)

There are many parameters on this constructor for **Word2Vec()**; a few noteworthy arguments you may wish to configure are:
* **size**: (default 100) The number of dimensions of the embedding, e.g. the length of the dense vector to represent each token (word).
* **window**: (default 5) The maximum distance between a target word and words around the target word.
* **min_count**: (default 5) The minimum count of words to consider when training the model; words with an occurrence less than this count will be ignored.
* **sg**: (default 0 or CBOW) The training algorithm, either CBOW (0) or skip gram (1).

https://radimrehurek.com/gensim/models/word2vec.html

In [None]:
# summarize the loaded model
print(model)

Word2Vec(vocab=14, size=100, alpha=0.025)


After the model is trained, it is accessible via the “wv” attribute. This is the actual word vector model in which queries can be made

In [None]:
# summarize vocabulary
words = list(model.wv.vocab)
print(words)

['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec', 'second', 'yet', 'another', 'one', 'more', 'and', 'final']


In [None]:
# access vector for one word
print(model.wv['this'])

[-4.3893862e-03  3.7961050e-03  4.6180272e-03 -1.2384416e-03
 -3.7356431e-03  4.2216345e-03 -2.9612479e-03 -1.8302110e-03
  3.6823978e-03  4.0696221e-03  1.3373365e-03 -8.3006092e-04
 -9.2130020e-04  2.2681216e-03  2.7022394e-03  3.1971852e-03
 -2.6186912e-03 -3.1495015e-03  3.9227861e-03 -1.9460836e-03
 -2.3027512e-03  4.9813343e-03 -2.3493753e-03 -1.3224503e-03
  4.8802290e-03  6.9500098e-04 -4.8406135e-05 -9.4956590e-04
 -6.0881230e-05 -1.5883758e-03 -2.6180863e-03  2.0279612e-03
  4.9604066e-03 -4.0868670e-03 -4.7215814e-04 -1.6543757e-03
 -4.7288765e-03 -7.8805478e-04  5.1260885e-04 -2.8562951e-03
  1.4741262e-03  4.4005169e-03  4.9622855e-03  3.6938949e-03
 -1.0272657e-03 -4.8746476e-03  2.9128657e-03  2.4387762e-03
 -4.8734355e-03 -1.6346725e-03 -2.7242315e-03  4.3639466e-03
 -2.3159257e-03  9.5855270e-04  2.4707962e-03  6.1542093e-04
 -1.7018864e-03 -8.5802545e-04 -1.3015893e-03  2.5290588e-03
  2.2975812e-03  2.8553384e-03 -2.8669008e-03  1.7357068e-03
 -1.9255104e-03 -2.25432

In [None]:
# access vector for one word
print(len(model.wv['this']))

100


In [None]:
#Cosine Similarity between words
model.wv.similarity('more',"second")

-0.1077418

In [None]:
#Find Cosine Similarity between two sentences
model.wv.n_similarity(sentences[0],sentences[1])

0.66410255

In [None]:
model.wv.most_similar("first")

[('final', 0.15396729111671448),
 ('the', 0.13147643208503723),
 ('yet', 0.09594659507274628),
 ('second', 0.0774758830666542),
 ('sentence', 0.06391753256320953),
 ('and', 0.06266756355762482),
 ('for', 0.058620207011699677),
 ('one', 0.05060993880033493),
 ('more', 0.045300088822841644),
 ('is', -0.013383306562900543)]

In [None]:
# save model
model.save('model.bin')
# load model
new_model = Word2Vec.load('model.bin')
print(new_model)

Word2Vec(vocab=14, size=100, alpha=0.025)


## Load Google’s Word2Vec Embedding
Training your own word vectors may be the best approach for a given NLP problem. But it can take a long time, a fast computer with a lot of RAM and disk space, and perhaps some expertise in finessing the input data and training algorithm.

An alternative is to simply use an existing pre-trained word embedding. Along with the paper and code for word2vec, Google also published a pre-trained word2vec model on the <a href='https://code.google.com/archive/p/word2vec/'>Word2Vec Google Code Project</a>

A pre-trained model is nothing more than a file containing tokens and their associated word vectors. The pre-trained Google word2vec model was trained on Google news data (about 100 billion words); it contains 3 million words and phrases and was fit using 300-dimensional word vectors. It is a 1.53 Gigabytes file. You can download it from here: <a href='https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing'>GoogleNews-vectors-negative300.bin.gz</a>.  Unzipped, the binary file (GoogleNews-vectors-negative300.bin) is 3.4 Gigabytes.

We can load directly from genism package

In [None]:
import gensim.downloader
print(list(gensim.downloader.info()['models'].keys()))

['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']


In [None]:
# Download the "word2vec-google-news-300" embeddings
glove_vectors = gensim.downloader.load('word2vec-google-news-300')



An interesting thing that you can do is do a little linear algebra arithmetic with words. For example, a popular example described in lectures and introduction papers is: queen = (king - man) + woman

That is the word queen is the closest word given the subtraction of the notion of man from king and adding the word woman. The “man-ness” in king is replaced with “woman-ness” to give us queen. A very cool concept.

Gensim provides an interface for performing these types of operations in the **most_similar()** function on the trained or loaded model. For example:

In [None]:
glove_vectors.most_similar('king')

[('kings', 0.7138046026229858),
 ('queen', 0.6510956883430481),
 ('monarch', 0.6413194537162781),
 ('crown_prince', 0.6204220056533813),
 ('prince', 0.6159993410110474),
 ('sultan', 0.5864822864532471),
 ('ruler', 0.5797567367553711),
 ('princes', 0.5646552443504333),
 ('Prince_Paras', 0.543294370174408),
 ('throne', 0.5422104597091675)]

In [None]:
glove_vectors['queen']

array([ 0.00524902, -0.14355469, -0.06933594,  0.12353516,  0.13183594,
       -0.08886719, -0.07128906, -0.21679688, -0.19726562,  0.05566406,
       -0.07568359, -0.38085938,  0.10400391, -0.00081635,  0.1328125 ,
        0.11279297,  0.07275391, -0.046875  ,  0.06591797,  0.09423828,
        0.19042969,  0.13671875, -0.23632812, -0.11865234,  0.06542969,
       -0.05322266, -0.30859375,  0.09179688,  0.18847656, -0.16699219,
       -0.15625   , -0.13085938, -0.08251953,  0.21289062, -0.35546875,
       -0.13183594,  0.09619141,  0.26367188, -0.09472656,  0.18359375,
        0.10693359, -0.41601562,  0.26953125, -0.02770996,  0.17578125,
       -0.11279297, -0.00411987,  0.14550781,  0.15625   ,  0.26757812,
       -0.01794434,  0.09863281,  0.05297852, -0.03125   , -0.16308594,
       -0.05810547, -0.34375   , -0.17285156,  0.11425781, -0.09033203,
        0.13476562,  0.27929688, -0.04980469,  0.12988281,  0.17578125,
       -0.22167969, -0.01190186,  0.140625  , -0.18164062,  0.11

In [None]:
len(glove_vectors["queen"])

300

## Load Stanford’s GloVe Embedding
Stanford researchers also have their own word embedding algorithm like word2vec called Global Vectors for Word Representation, or GloVe for short.Generally, NLP practitioners seem to prefer GloVe over Word2Vec at the moment based on results.

Like word2vec, the GloVe researchers also provide pre-trained word vectors, in this case, a great selection to choose from.

In [None]:
# Download the 'glove-wiki-gigaword-50' embeddings
glove_vectors = gensim.downloader.load('glove-wiki-gigaword-50')



In [None]:
glove_vectors['queen']

array([ 0.37854  ,  1.8233   , -1.2648   , -0.1043   ,  0.35829  ,
        0.60029  , -0.17538  ,  0.83767  , -0.056798 , -0.75795  ,
        0.22681  ,  0.98587  ,  0.60587  , -0.31419  ,  0.28877  ,
        0.56013  , -0.77456  ,  0.071421 , -0.5741   ,  0.21342  ,
        0.57674  ,  0.3868   , -0.12574  ,  0.28012  ,  0.28135  ,
       -1.8053   , -1.0421   , -0.19255  , -0.55375  , -0.054526 ,
        1.5574   ,  0.39296  , -0.2475   ,  0.34251  ,  0.45365  ,
        0.16237  ,  0.52464  , -0.070272 , -0.83744  , -1.0326   ,
        0.45946  ,  0.25302  , -0.17837  , -0.73398  , -0.20025  ,
        0.2347   , -0.56095  , -2.2839   ,  0.0092753, -0.60284  ],
      dtype=float32)

In [None]:
len(glove_vectors['queen'])

50

In [None]:
glove_vectors.most_similar("queen")

[('princess', 0.851516604423523),
 ('lady', 0.805060863494873),
 ('elizabeth', 0.787304162979126),
 ('king', 0.7839042544364929),
 ('prince', 0.7821860909461975),
 ('coronation', 0.7692778706550598),
 ('consort', 0.7626097202301025),
 ('royal', 0.7442864775657654),
 ('crown', 0.7382649779319763),
 ('victoria', 0.7285771369934082)]

In [None]:
glove_vectors.most_similar("man")

[('woman', 0.8860337734222412),
 ('boy', 0.8564431071281433),
 ('another', 0.8452839851379395),
 ('old', 0.8372182846069336),
 ('one', 0.8276063203811646),
 ('who', 0.8244696259498596),
 ('him', 0.8194693922996521),
 ('turned', 0.8154467940330505),
 ('whose', 0.8119741678237915),
 ('himself', 0.807725727558136)]