<a href="https://www.kaggle.com/code/pawankumargunjan/training-embeddings-using-gensim?scriptVersionId=112271467" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

## Training Embeddings Using Gensim

In [1]:
import os
import requests
import bz2
import tqdm
from gensim.models import Word2Vec
import warnings
warnings.filterwarnings('ignore')

In [2]:
# define training data
#Genism word2vec requires that a format of ‘list of lists’ be provided for training where every document contained in a list.
#Every list contains lists of tokens of that document.
corpus = [['dog','bites','man'], ["man", "bites" ,"dog"],["dog","eats","meat"],["man", "eats","food"]]

#Training the model
model_cbow = Word2Vec(corpus, min_count=1,sg=0) #using CBOW Architecture for trainnig
model_skipgram = Word2Vec(corpus, min_count=1,sg=1)#using skipGram Architecture for training 

## Continuous Bag of Words (CBOW)
In CBOW, the primary task is to build a language model that correctly predicts the center word given the context words in which the center word appears.

In [3]:
#Summarize the loaded model
print(model_cbow)

#Summarize vocabulary
words = list(model_cbow.wv.index_to_key)
print(words)

# Get the index of a words
print('Index of [man] -->',model_cbow.wv.get_index('man'))
print('Index of [dog] -->',model_cbow.wv.get_index('dog'))
print('Index of [eats] -->',model_cbow.wv.get_index('eats'))

#Acess vector for one word
model_cbow.wv['dog']

Word2Vec(vocab=6, vector_size=100, alpha=0.025)
['man', 'dog', 'eats', 'bites', 'food', 'meat']
Index of [man] --> 0
Index of [dog] --> 1
Index of [eats] --> 2


array([-8.6196875e-03,  3.6657380e-03,  5.1898835e-03,  5.7419371e-03,
        7.4669169e-03, -6.1676763e-03,  1.1056137e-03,  6.0472824e-03,
       -2.8400517e-03, -6.1735227e-03, -4.1022300e-04, -8.3689503e-03,
       -5.6000138e-03,  7.1045374e-03,  3.3525396e-03,  7.2256685e-03,
        6.8002464e-03,  7.5307419e-03, -3.7891555e-03, -5.6180713e-04,
        2.3483753e-03, -4.5190332e-03,  8.3887316e-03, -9.8581649e-03,
        6.7646410e-03,  2.9144168e-03, -4.9328329e-03,  4.3981862e-03,
       -1.7395759e-03,  6.7113829e-03,  9.9648498e-03, -4.3624449e-03,
       -5.9933902e-04, -5.6956387e-03,  3.8508223e-03,  2.7866268e-03,
        6.8910765e-03,  6.1010956e-03,  9.5384959e-03,  9.2734173e-03,
        7.8980681e-03, -6.9895051e-03, -9.1558648e-03, -3.5575390e-04,
       -3.0998420e-03,  7.8943158e-03,  5.9385728e-03, -1.5456629e-03,
        1.5109634e-03,  1.7900396e-03,  7.8175711e-03, -9.5101884e-03,
       -2.0553112e-04,  3.4691954e-03, -9.3897345e-04,  8.3817719e-03,
      

The trained word vectors are stored in a KeyedVectors instance, as model.wv:
[models.word2vec – Word2vec embeddings](https://radimrehurek.com/gensim/models/word2vec.html)

In [4]:
#Compute similarity 
print("Similarity between eats and bites:",model_cbow.wv.similarity('eats', 'bites'))
print("Similarity between eats and man:",model_cbow.wv.similarity('eats', 'man'))

Similarity between eats and bites: -0.013497097
Similarity between eats and man: -0.052354384


From the above similarity scores we can conclude that eats is more similar to bites than man.

In [5]:
#Most similarity
model_cbow.wv.most_similar('meat')

[('food', 0.13887985050678253),
 ('bites', 0.13149003684520721),
 ('eats', 0.06422408670186996),
 ('dog', 0.009391186758875847),
 ('man', -0.05987628176808357)]

In [6]:
# save model
model_cbow.save('model_cbow.bin')

# load model
new_model_cbow = Word2Vec.load('model_cbow.bin')
print(new_model_cbow)

Word2Vec(vocab=6, vector_size=100, alpha=0.025)


## SkipGram
In skipgram, the task is to predict the context words from the center word.

In [7]:
#Summarize the loaded model
print(model_skipgram)

#Summarize vocabulary
words = list(model_skipgram.wv.index_to_key)
print(words)

#Acess vector for one word
print(model_skipgram.wv['dog'])

Word2Vec(vocab=6, vector_size=100, alpha=0.025)
['man', 'dog', 'eats', 'bites', 'food', 'meat']
[-8.6196875e-03  3.6657380e-03  5.1898835e-03  5.7419371e-03
  7.4669169e-03 -6.1676763e-03  1.1056137e-03  6.0472824e-03
 -2.8400517e-03 -6.1735227e-03 -4.1022300e-04 -8.3689503e-03
 -5.6000138e-03  7.1045374e-03  3.3525396e-03  7.2256685e-03
  6.8002464e-03  7.5307419e-03 -3.7891555e-03 -5.6180713e-04
  2.3483753e-03 -4.5190332e-03  8.3887316e-03 -9.8581649e-03
  6.7646410e-03  2.9144168e-03 -4.9328329e-03  4.3981862e-03
 -1.7395759e-03  6.7113829e-03  9.9648498e-03 -4.3624449e-03
 -5.9933902e-04 -5.6956387e-03  3.8508223e-03  2.7866268e-03
  6.8910765e-03  6.1010956e-03  9.5384959e-03  9.2734173e-03
  7.8980681e-03 -6.9895051e-03 -9.1558648e-03 -3.5575390e-04
 -3.0998420e-03  7.8943158e-03  5.9385728e-03 -1.5456629e-03
  1.5109634e-03  1.7900396e-03  7.8175711e-03 -9.5101884e-03
 -2.0553112e-04  3.4691954e-03 -9.3897345e-04  8.3817719e-03
  9.0107825e-03  6.5365052e-03 -7.1162224e-04  7.7

From the above similarity scores we can conclude that eats is more similar to bites than man.

In [8]:
#Most similarity
model_skipgram.wv.most_similar('meat')

[('food', 0.13887986540794373),
 ('bites', 0.1314900517463684),
 ('eats', 0.06406084448099136),
 ('dog', 0.009391188621520996),
 ('man', -0.059876274317502975)]

In [9]:
# save model
model_skipgram.save('model_skipgram.bin')

# load model
new_model_skipgram = Word2Vec.load('model_skipgram.bin')
print(new_model_skipgram)

Word2Vec(vocab=6, vector_size=100, alpha=0.025)


## Training Your Embedding on Wiki Corpus
The corpus download page : [enwiki dump progress on 20221101](https://dumps.wikimedia.org/enwiki/20221101/)

The entire wiki corpus as of 20/10/2022 is just over 19.9GB in size. We will take a part of this corpus due to computation constraints and train our word2vec and fasttext embeddings.

The file size is 234MB so it can take a while to download.

Source for code which downloads files from Google Drive: https://stackoverflow.com/questions/25010369/wget-curl-large-file-from-google-drive/39225039#39225039

In [10]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/wikicorpus/enwiki-20221020-pages-articles-multistream19.xml-p30121851p31308442.bz2


In [11]:
#from gensim.corpora.wikicorpus import WikiCorpus

from gensim.test.utils import datapath, get_tmpfile
from gensim.corpora import WikiCorpus, MmCorpus

from gensim.models.word2vec import Word2Vec
from gensim.models.fasttext import FastText
import time

### `corpora.wikicorpus` [Corpus from a Wikipedia dump](https://radimrehurek.com/gensim/corpora/wikicorpus.html)

In [12]:
%%time
fname = datapath('/kaggle/input/wikicorpus/enwiki-20221020-pages-articles-multistream19.xml-p30121851p31308442.bz2')


#Preparing the Training data
wiki = WikiCorpus(fname, dictionary={})  # create word->word_id mapping, ~8h on full wiki

CPU times: user 29 µs, sys: 4 µs, total: 33 µs
Wall time: 38.1 µs


In [13]:
%%time
print(wiki.fname)
print(wiki.token_max_len)
print(wiki.token_min_len)

/kaggle/input/wikicorpus/enwiki-20221020-pages-articles-multistream19.xml-p30121851p31308442.bz2
15
2
CPU times: user 332 µs, sys: 0 ns, total: 332 µs
Wall time: 322 µs


In [14]:
%%time
start = time.time()

sentences = list(wiki.get_texts())

end = time.time()
print('Time consumed :',(end-start)/3600.0)

Time consumed : 0.09144315719604493
CPU times: user 31.9 s, sys: 8.08 s, total: 39.9 s
Wall time: 5min 29s


In [15]:
print(len(sentences))
print(len(sentences[0]))
sentences[-1][:10]

77542
973


['the',
 'japanese',
 'motorcycle',
 'grand',
 'prix',
 'was',
 'the',
 'fifteenth',
 'round',
 'of']

## Hyperparameters
1. sg - Selecting the training algorithm: 1 for skip-gram else its 0 for CBOW. Default is CBOW.
2. min_count- Ignores all words with total frequency lower than this.

There are many more hyperparamaeters whose list can be found in the official documentation [here](https://radimrehurek.com/gensim/models/word2vec.html).

### CBOW

In [16]:
%%time
#CBOW
start = time.time()
Word2vec_cbow = Word2Vec(sentences,min_count=10, sg=0)
end = time.time()

print("CBOW Model Training Complete.\nTime taken for training is:{:.2f} hrs ".format((end-start)/3600.0))

CBOW Model Training Complete.
Time taken for training is:0.06 hrs 
CPU times: user 9min 18s, sys: 2.05 s, total: 9min 20s
Wall time: 3min 26s


In [17]:
%%time

#Summarize the loaded model
print(Word2vec_cbow)
print("-"*30)

#Summarize vocabulary
words = list(Word2vec_cbow.wv.index_to_key)
print(f"Length of vocabulary: {len(words)}")
print("Printing the first 30 words.")
print(words[:30])
print("-"*30)

#Acess vector for one word
print(f"Length of vector: {len(Word2vec_cbow.wv['film'])}")
print(Word2vec_cbow.wv['film'])
print("-"*30)

#Compute similarity 
print("Similarity between film and drama:",Word2vec_cbow.wv.similarity('film', 'drama'))
print("Similarity between film and tiger:",Word2vec_cbow.wv.similarity('film', 'tiger'))
print("-"*30)

Word2Vec(vocab=111673, vector_size=100, alpha=0.025)
------------------------------
Length of vocabulary: 111673
Printing the first 30 words.
['the', 'of', 'and', 'in', 'to', 'was', 'on', 'for', 'is', 'as', 'by', 'with', 'he', 'at', 'from', 'that', 'his', 'it', 'an', 'were', 'also', 'which', 'are', 'first', 'this', 'be', 'new', 'had', 'has', 'or']
------------------------------
Length of vector: 100
[ 3.1694582  -4.454948    0.0878444   1.2660654   0.8283105   0.9658423
  3.7621875  -0.30470476 -4.6764154  -2.3791826   0.08099426 -0.15254942
 -1.8295522   2.6030166   1.2766672   0.9889377   1.8710304  -1.9499241
  1.4759715   0.7074961  -0.10871384  0.14381945  1.0873828  -2.0123026
 -3.5031784   1.9438506  -2.9056702  -0.07010394  2.1229343  -0.6083821
 -0.96965057 -2.6384518   1.0665815  -2.6463773   1.675288    0.87991565
  0.8272501  -0.7926914  -4.431944   -2.889108    0.8323046  -0.14915292
 -0.8471119   1.0418599  -0.99331754 -0.20755775  0.44330654  1.2130358
  1.7086853  -2.85

#### Save and load the model

In [18]:
# save model
Word2vec_cbow.wv.save_word2vec_format('word2vec_cbow.bin')

### skipGram

In [19]:
#SkipGram
start = time.time()
Word2vec_skipgram = Word2Vec(sentences,min_count=10, sg=1)
end = time.time()

print("SkipGram Model Training Complete\nTime taken for training is:{:.2f} hrs ".format((end-start)/3600.0))

SkipGram Model Training Complete
Time taken for training is:0.20 hrs 


In [20]:
#Summarize the loaded model
print(Word2vec_skipgram)
print("-"*30)

#Summarize vocabulary
words = list(Word2vec_skipgram.wv.index_to_key)
print(f"Length of vocabulary: {len(words)}")
print("Printing the first 30 words.")
print(words[:30])
print("-"*30)

#Acess vector for one word
print(f"Length of vector: {len(Word2vec_skipgram.wv['film'])}")
print(Word2vec_skipgram.wv['film'])
print("-"*30)

#Compute similarity 
print("Similarity between film and drama:",Word2vec_skipgram.wv.similarity('film', 'drama'))
print("Similarity between film and tiger:",Word2vec_skipgram.wv.similarity('film', 'tiger'))
print("-"*30)

Word2Vec(vocab=111673, vector_size=100, alpha=0.025)
------------------------------
Length of vocabulary: 111673
Printing the first 30 words.
['the', 'of', 'and', 'in', 'to', 'was', 'on', 'for', 'is', 'as', 'by', 'with', 'he', 'at', 'from', 'that', 'his', 'it', 'an', 'were', 'also', 'which', 'are', 'first', 'this', 'be', 'new', 'had', 'has', 'or']
------------------------------
Length of vector: 100
[ 0.26736256  0.03360388 -0.14563836  0.05526635  0.14444895 -0.29273304
  0.487039    0.10935839 -0.7665564   0.04546815 -0.03544531 -0.16210724
  0.37799376 -0.16219042  0.11787659  0.12785837  0.26526293 -0.10306424
 -0.03573714  0.24010913  0.14967093 -0.28675485  0.35209003 -0.11923608
 -0.94598454 -0.07384407  0.37476546  0.0744867  -0.50035137 -0.04362003
  0.25012168 -0.1710057  -0.11080901 -0.08486956  0.71094215 -0.6680979
  0.31692755  0.3558242  -0.10730179 -0.3143575  -0.49486795 -0.0287613
 -0.13663754  0.19066735 -0.3746497  -0.19040103  0.5039522   0.17239863
  0.7895494  -0

### Save and laod Word2vec_skipgram model

In [21]:
# save model
Word2vec_skipgram.wv.save_word2vec_format('word2vec_sg.bin', binary=True)

# load model
word2vec_skipgram = Word2Vec.load('model_skipgram.bin')
print(word2vec_skipgram)

Word2Vec(vocab=6, vector_size=100, alpha=0.025)


## FastText

### CBOW

In [22]:
#CBOW
start = time.time()
fasttext_cbow = FastText(sentences, sg=0, min_count=10)
end = time.time()

print("FastText CBOW Model Training Complete\nTime taken for training is:{:.2f} hrs ".format((end-start)/3600.0))

FastText CBOW Model Training Complete
Time taken for training is:0.21 hrs 


In [23]:
#Summarize the loaded model
print(fasttext_cbow)
print("-"*30)

#Summarize vocabulary
words = list(fasttext_cbow.wv.index_to_key)
print(f"Length of vocabulary: {len(words)}")
print("Printing the first 30 words.")
print(words[:30])
print("-"*30)

#Acess vector for one word
print(f"Length of vector: {len(fasttext_cbow.wv['film'])}")
print(fasttext_cbow.wv['film'])
print("-"*30)

#Compute similarity 
print("Similarity between film and drama:",fasttext_cbow.wv.similarity('film', 'drama'))
print("Similarity between film and tiger:",fasttext_cbow.wv.similarity('film', 'tiger'))
print("-"*30)

FastText(vocab=111673, vector_size=100, alpha=0.025)
------------------------------
Length of vocabulary: 111673
Printing the first 30 words.
['the', 'of', 'and', 'in', 'to', 'was', 'on', 'for', 'is', 'as', 'by', 'with', 'he', 'at', 'from', 'that', 'his', 'it', 'an', 'were', 'also', 'which', 'are', 'first', 'this', 'be', 'new', 'had', 'has', 'or']
------------------------------
Length of vector: 100
[-1.8614615   0.5741876   3.5077143  -2.157831   -0.25449887 -4.500002
  2.91234     0.12153244  4.4217067  -0.07569537  0.66587853 -0.6200798
 -0.06538694  1.9388207  -1.4861209   1.7248286   4.161297   -6.194644
  2.66238     4.4324136  -2.3266153   6.002615   -2.6028821  -1.1359528
 -4.8820834  -2.0344772   3.7973008  -0.10506229  3.3753996  -2.0619798
  3.229767    0.88663155 -1.5092486  -4.86947    -1.9937739  -3.3753724
  1.3583407  -3.5202637  -1.0740303  -0.6337754  -2.0431354  -4.586764
 -2.8852248  -2.7304537   1.4096444  -3.5508888   0.86877847  1.7016754
 -0.48791644  4.3738213 

In [24]:
# save model
from gensim.models import Word2Vec, KeyedVectors   
fasttext_cbow.wv.save_word2vec_format('fasttext_cbow.bin', binary=True)

### skipGram

In [25]:
#SkipGram
start = time.time()
fasttext_skipgram = FastText(sentences, sg=1, min_count=10)
end = time.time()

print("FastText SkipGram Model Training Complete\nTime taken for training is:{:.2f} hrs ".format((end-start)/3600.0))

FastText SkipGram Model Training Complete
Time taken for training is:0.36 hrs 


In [26]:
#Summarize the loaded model
print(fasttext_skipgram)
print("-"*30)

#Summarize vocabulary
words = list(fasttext_skipgram.wv.index_to_key)
print(f"Length of vocabulary: {len(words)}")
print("Printing the first 30 words.")
print(words[:30])
print("-"*30)

#Acess vector for one word
print(f"Length of vector: {len(fasttext_skipgram.wv['film'])}")
print(fasttext_skipgram.wv['film'])
print("-"*30)

#Compute similarity 
print("Similarity between film and drama:",fasttext_skipgram.wv.similarity('film', 'drama'))
print("Similarity between film and tiger:",fasttext_skipgram.wv.similarity('film', 'tiger'))
print("-"*30)

FastText(vocab=111673, vector_size=100, alpha=0.025)
------------------------------
Length of vocabulary: 111673
Printing the first 30 words.
['the', 'of', 'and', 'in', 'to', 'was', 'on', 'for', 'is', 'as', 'by', 'with', 'he', 'at', 'from', 'that', 'his', 'it', 'an', 'were', 'also', 'which', 'are', 'first', 'this', 'be', 'new', 'had', 'has', 'or']
------------------------------
Length of vector: 100
[-0.1518332  -0.17759429  0.03001726 -0.10886565  0.02227775 -0.34577343
 -0.0374246   0.70050097  0.5571169   0.13575003 -0.23983465 -0.10684963
  0.38629514  0.7093717  -0.24469987  0.15505223 -0.18050985 -0.13012356
  0.21945451  0.0971472  -0.18258832  0.6405899  -0.1332966   0.78917503
 -0.32681963 -0.3499852  -0.05584091 -0.5915512   0.09944031 -0.57027274
 -0.04627981 -0.12266276  0.11577561 -0.155383   -0.17567195 -0.46998903
 -0.06239457  0.87981665  0.3904618   0.36565825 -0.0072217  -0.4131936
 -0.10504191 -0.42271093 -0.2555753   0.04925466 -0.33578447 -0.1272126
 -0.24456462 -0

In [27]:
# save model
from gensim.models import Word2Vec, KeyedVectors   
fasttext_skipgram.wv.save_word2vec_format('fasttext_skipgram.bin', binary=True)

#### CBOW trains faster than SkipGram in both cases.