## Training Embeddings Using Gensim

In [1]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [2]:
from gensim.models import Word2Vec
import warnings
warnings.filterwarnings('ignore')

In [3]:
# define training data
#Genism word2vec requires that a format of ‘list of lists’ be provided for training where every document contained in a list.
#Every list contains lists of tokens of that document.
corpus = [['dog','bites','man'], ["man", "bites" ,"dog"],["dog","eats","meat"],["man", "eats","food"]]

#Training the model
model_cbow = Word2Vec(corpus, min_count=1,sg=0) #using CBOW Architecture for trainnig
model_skipgram = Word2Vec(corpus, min_count=1,sg=1)#using skipGram Architecture for training 

## Continuous Bag of Words (CBOW)
In CBOW, the primary task is to build a language model that correctly predicts the center word given the context words in which the center word appears.

In [4]:
#Summarize the loaded model
print(model_cbow)

#Summarize vocabulary
words = list(model_cbow.wv.index_to_key)
print(words)

# Get the index of a words
print('Index of [man] -->',model_cbow.wv.get_index('man'))
print('Index of [dog] -->',model_cbow.wv.get_index('dog'))
print('Index of [eats] -->',model_cbow.wv.get_index('eats'))

#Acess vector for one word
model_cbow.wv['dog']

Word2Vec(vocab=6, vector_size=100, alpha=0.025)
['man', 'dog', 'eats', 'bites', 'food', 'meat']
Index of [man] --> 0
Index of [dog] --> 1
Index of [eats] --> 2


array([-8.6196875e-03,  3.6657380e-03,  5.1898835e-03,  5.7419371e-03,
        7.4669169e-03, -6.1676763e-03,  1.1056137e-03,  6.0472824e-03,
       -2.8400517e-03, -6.1735227e-03, -4.1022300e-04, -8.3689503e-03,
       -5.6000138e-03,  7.1045374e-03,  3.3525396e-03,  7.2256685e-03,
        6.8002464e-03,  7.5307419e-03, -3.7891555e-03, -5.6180713e-04,
        2.3483753e-03, -4.5190332e-03,  8.3887316e-03, -9.8581649e-03,
        6.7646410e-03,  2.9144168e-03, -4.9328329e-03,  4.3981862e-03,
       -1.7395759e-03,  6.7113829e-03,  9.9648498e-03, -4.3624449e-03,
       -5.9933902e-04, -5.6956387e-03,  3.8508223e-03,  2.7866268e-03,
        6.8910765e-03,  6.1010956e-03,  9.5384959e-03,  9.2734173e-03,
        7.8980681e-03, -6.9895051e-03, -9.1558648e-03, -3.5575390e-04,
       -3.0998420e-03,  7.8943158e-03,  5.9385728e-03, -1.5456629e-03,
        1.5109634e-03,  1.7900396e-03,  7.8175711e-03, -9.5101884e-03,
       -2.0553112e-04,  3.4691954e-03, -9.3897345e-04,  8.3817719e-03,
      

The trained word vectors are stored in a KeyedVectors instance, as model.wv:
[models.word2vec – Word2vec embeddings](https://radimrehurek.com/gensim/models/word2vec.html)

In [5]:
#Compute similarity 
print("Similarity between eats and bites:",model_cbow.wv.similarity('eats', 'bites'))
print("Similarity between eats and man:",model_cbow.wv.similarity('eats', 'man'))

Similarity between eats and bites: -0.013497097
Similarity between eats and man: -0.052354384


From the above similarity scores we can conclude that eats is more similar to bites than man.

In [6]:
#Most similarity
model_cbow.wv.most_similar('meat')

[('food', 0.13887985050678253),
 ('bites', 0.13149003684520721),
 ('eats', 0.06422408670186996),
 ('dog', 0.009391186758875847),
 ('man', -0.05987628176808357)]

In [7]:
# save model
model_cbow.save('model_cbow.bin')

# load model
new_model_cbow = Word2Vec.load('model_cbow.bin')
print(new_model_cbow)

Word2Vec(vocab=6, vector_size=100, alpha=0.025)


## SkipGram
In skipgram, the task is to predict the context words from the center word.

In [8]:
#Summarize the loaded model
print(model_skipgram)

#Summarize vocabulary
words = list(model_skipgram.wv.index_to_key)
print(words)

#Acess vector for one word
print(model_skipgram.wv['dog'])

Word2Vec(vocab=6, vector_size=100, alpha=0.025)
['man', 'dog', 'eats', 'bites', 'food', 'meat']
[-8.6196875e-03  3.6657380e-03  5.1898835e-03  5.7419371e-03
  7.4669169e-03 -6.1676763e-03  1.1056137e-03  6.0472824e-03
 -2.8400517e-03 -6.1735227e-03 -4.1022300e-04 -8.3689503e-03
 -5.6000138e-03  7.1045374e-03  3.3525396e-03  7.2256685e-03
  6.8002464e-03  7.5307419e-03 -3.7891555e-03 -5.6180713e-04
  2.3483753e-03 -4.5190332e-03  8.3887316e-03 -9.8581649e-03
  6.7646410e-03  2.9144168e-03 -4.9328329e-03  4.3981862e-03
 -1.7395759e-03  6.7113829e-03  9.9648498e-03 -4.3624449e-03
 -5.9933902e-04 -5.6956387e-03  3.8508223e-03  2.7866268e-03
  6.8910765e-03  6.1010956e-03  9.5384959e-03  9.2734173e-03
  7.8980681e-03 -6.9895051e-03 -9.1558648e-03 -3.5575390e-04
 -3.0998420e-03  7.8943158e-03  5.9385728e-03 -1.5456629e-03
  1.5109634e-03  1.7900396e-03  7.8175711e-03 -9.5101884e-03
 -2.0553112e-04  3.4691954e-03 -9.3897345e-04  8.3817719e-03
  9.0107825e-03  6.5365052e-03 -7.1162224e-04  7.7

From the above similarity scores we can conclude that eats is more similar to bites than man.

In [9]:
#Most similarity
model_skipgram.wv.most_similar('meat')

[('food', 0.13887986540794373),
 ('bites', 0.1314900517463684),
 ('eats', 0.06406084448099136),
 ('dog', 0.009391188621520996),
 ('man', -0.059876274317502975)]

In [10]:
# save model
model_skipgram.save('model_skipgram.bin')

# load model
new_model_skipgram = Word2Vec.load('model_skipgram.bin')
print(new_model_skipgram)

Word2Vec(vocab=6, vector_size=100, alpha=0.025)
