#  Word2Vec Algorithm

This is the most popular algorithm for computing embeddings. It basically consists of a mini neural network that tries to learn a language model. Remember how we tried to generate text by picking probabilistically the next word? In its simplest form, the neural network can learn what is the next word after a given input node. Obviously, the results will be rather simplistic. We need more information about the context of a word in order to learn good embeddings.
## CBOW vs Skip-Gram

**CBOW (Continuous Bag-Of-Words**) is about creating a network that tries to predict the word in the middle given some surrounding words: [W[-3], W[-2], W[-1], W[1], W[2], W[3]] => W[0]

**Skip-Gram** is the opposite of CBOW, try to predict the surrounding words given the word in the middle: W[0] => [W[-3], W[-2], W[-1], W[1], W[2], W[3]]

The computed network weights are actually the word embeddings we were looking for. If you don’t have any neural network experience, don’t worry, it’s not needed for doing the practical exercises in this tutorial.
Word2Vec with Gensim

Gensim provides a quality implementation of the Word2Vec model. Let’s see it in action on the Brown Corpus:

In [4]:
>>> import nltk
>>> nltk.download('brown')

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\rzouga\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\brown.zip.


True

In [5]:
from nltk.corpus import brown
from gensim.models import Word2Vec
 
print(brown.sents())
w2v_model = Word2Vec(brown.sents(), size=128, window=5, min_count=3, workers=4)

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]


Let’s now take the model for a spin:

In [6]:
# Getting the vector for a word
print(w2v_model.wv['Italy'], w2v_model.wv['France'])

[ 2.72964776e-01 -4.32864279e-02 -8.88026059e-02  1.18742704e-01
  1.23760151e-02  2.07855478e-01 -7.17213675e-02 -1.23750448e-01
  4.34991382e-02  3.04020971e-01  2.05373392e-01  2.18561757e-02
  1.36980131e-01 -7.64713716e-03 -1.53590009e-01 -1.13585025e-01
  2.60129631e-01 -2.14413181e-01  5.43992519e-01 -4.22305129e-02
 -5.32292202e-02  2.98417270e-01  1.18801601e-01 -6.26715198e-02
  1.38511822e-01  1.14757903e-02  2.63060689e-01  3.70042413e-01
 -1.95784763e-01  3.46381664e-01  1.17206506e-01  2.66734362e-01
  3.11203927e-01  4.96482626e-02  1.34677754e-03 -2.59095728e-01
 -6.33635046e-03 -6.47251680e-02 -5.11556491e-02 -7.06180409e-02
  4.45125774e-02  9.72890295e-03 -1.68661684e-01  4.98738401e-02
  1.75717562e-01  3.97206992e-02  1.61752731e-01 -1.80850789e-01
 -2.39074558e-01  6.04641512e-02  1.51380375e-02 -4.40965556e-02
  1.18253222e-02 -4.47812304e-02 -4.36309929e-04 -1.12051353e-01
 -4.52137589e-02  2.38479957e-01 -7.89065734e-02 -6.74944073e-02
 -2.59511888e-01 -1.35544

In [7]:
# Getting most similar vectors
print(w2v_model.wv.most_similar('Paris'))

[('Italy', 0.975581169128418), ('France', 0.9638313055038452), ('Rome', 0.9636051058769226), ('Eugene', 0.9622185230255127), ('headquarters', 0.9616745114326477), ('London', 0.9605855345726013), ('dancing', 0.9596837759017944), ('breakfast', 0.9589269757270813), ('Harvard', 0.957659125328064), ('Vienna', 0.9567881226539612)]


In [8]:
# "King" - "Man" + "Woman" == "Queen"
print(w2v_model.wv.most_similar(positive=['woman', 'king'], negative=['man']))
print(w2v_model.wv.most_similar(positive=["Rome", "France"], negative=["Italy"]))

[('extracting', 0.96043461561203), ('united', 0.9594190120697021), ('duties', 0.9566020965576172), ('savage', 0.9526326060295105), ('stem', 0.9486423134803772), ('measuring', 0.944673478603363), ('patronage', 0.9441121816635132), ('Colonial', 0.9437551498413086), ('trigger', 0.9437280297279358), ('original', 0.9435445666313171)]
[('reputed', 0.9475245475769043), ('earth', 0.9457082152366638), ('beach', 0.9398365020751953), ('sink', 0.9354729652404785), ('grounds', 0.9338986873626709), ('planks', 0.9321511387825012), ('Church', 0.9311625361442566), ('transformed', 0.9308785200119019), ('airfield', 0.9303730130195618), ('plantation', 0.9300895929336548)]


If you’ve been coding along, you probably are pretty disappointed of the results. Let’s try a bigger corpus:

In [None]:
from gensim.models.word2vec import Text8Corpus

# Go here and download + unzip the Text8 Corpus: http://mattmahoney.net/dc/text8.zip
# We take only words that appear more than 150 times for doing a visualization later
w2v_model2 = Word2Vec(Text8Corpus('C:/Users/rzouga/Downloads/text8/text8'), size=100, window=5, min_count=150, workers=4)
 

We opted to only use the most popular words so that it’s easier to make a visualization later. Let’s see how does the new model perform (words and values will differ a bit):

In [None]:
# Getting most similar vectors
print(w2v_model2.wv.most_similar('paris'))
# [('louvre', 0.7243613004684448), 
#  ('venice', 0.7047281265258789), 
#  ('vienna', 0.7043783068656921),
#  ('montparnasse', 0.7016372680664062), 
#  ('le', 0.6870340704917908), 
#  ('sur', 0.6818796396255493), 
#  ('chapelle', 0.6787714958190918), 
#  ('rodin', 0.6766049265861511), 
#  ('bologna', 0.6761612892150879), 
#  ('munich', 0.6749240159988403)]
 
# "King" - "Man" + "Woman" == "Queen"
print(w2v_model2.most_similar(['woman', 'king'], ['man'], topn=3))
# [('queen', 0.6777610778808594), ('throne', 0.6143913269042969), ('elizabeth', 0.593910813331604)]
 
# "Father" - "Boy" + "Girl" == "Mother"
print(w2v_model2.most_similar(['girl', 'father'], ['boy'], topn=3))
# [('mother', 0.7972878813743591), ('wife', 0.7469687461853027), ('grandmother', 0.7419005632400513)]
 
# "Paris" - "France" + "Italy" = "Rome"
print(w2v_model2.most_similar(['paris', 'italy'], ['france'], topn=3))
# [('venice', 0.7461134195327759), ('vienna', 0.7134193778038025), ('florence', 0.7019181251525879)]