<a href="https://www.kaggle.com/code/pawankumargunjan/training-embeddings-using-gensim?scriptVersionId=171075232" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

## Training Embeddings Using Gensim

In [1]:
import os
import requests
import bz2
import tqdm
from gensim.models import Word2Vec
import warnings
warnings.filterwarnings('ignore')

In [2]:
# define training data
#Genism word2vec requires that a format of ‘list of lists’ be provided for training where every document contained in a list.
#Every list contains lists of tokens of that document.
corpus = [['dog','bites','man'], ["man", "bites" ,"dog"],["dog","eats","meat"],["man", "eats","food"]]

#Training the model
model_cbow = Word2Vec(corpus, min_count=1,sg=0) #using CBOW Architecture for trainnig
model_skipgram = Word2Vec(corpus, min_count=1,sg=1)#using skipGram Architecture for training 

## Continuous Bag of Words (CBOW)
In CBOW, the primary task is to build a language model that correctly predicts the center word given the context words in which the center word appears.

In [3]:
#Summarize the loaded model
print(model_cbow)

#Summarize vocabulary
words = list(model_cbow.wv.index_to_key)
print(words)

# Get the index of a words
print('Index of [man] -->',model_cbow.wv.get_index('man'))
print('Index of [dog] -->',model_cbow.wv.get_index('dog'))
print('Index of [eats] -->',model_cbow.wv.get_index('eats'))

#Acess vector for one word
model_cbow.wv['dog']

Word2Vec(vocab=6, vector_size=100, alpha=0.025)
['man', 'dog', 'eats', 'bites', 'food', 'meat']
Index of [man] --> 0
Index of [dog] --> 1
Index of [eats] --> 2


array([-8.6196875e-03,  3.6657380e-03,  5.1898835e-03,  5.7419371e-03,
        7.4669169e-03, -6.1676763e-03,  1.1056137e-03,  6.0472824e-03,
       -2.8400517e-03, -6.1735227e-03, -4.1022300e-04, -8.3689503e-03,
       -5.6000138e-03,  7.1045374e-03,  3.3525396e-03,  7.2256685e-03,
        6.8002464e-03,  7.5307419e-03, -3.7891555e-03, -5.6180713e-04,
        2.3483753e-03, -4.5190332e-03,  8.3887316e-03, -9.8581649e-03,
        6.7646410e-03,  2.9144168e-03, -4.9328329e-03,  4.3981862e-03,
       -1.7395759e-03,  6.7113829e-03,  9.9648498e-03, -4.3624449e-03,
       -5.9933902e-04, -5.6956387e-03,  3.8508223e-03,  2.7866268e-03,
        6.8910765e-03,  6.1010956e-03,  9.5384959e-03,  9.2734173e-03,
        7.8980681e-03, -6.9895051e-03, -9.1558648e-03, -3.5575390e-04,
       -3.0998420e-03,  7.8943158e-03,  5.9385728e-03, -1.5456629e-03,
        1.5109634e-03,  1.7900396e-03,  7.8175711e-03, -9.5101884e-03,
       -2.0553112e-04,  3.4691954e-03, -9.3897345e-04,  8.3817719e-03,
      

The trained word vectors are stored in a KeyedVectors instance, as model.wv:
[models.word2vec – Word2vec embeddings](https://radimrehurek.com/gensim/models/word2vec.html)

In [4]:
#Compute similarity 
print("Similarity between eats and bites:",model_cbow.wv.similarity('eats', 'bites'))
print("Similarity between eats and man:",model_cbow.wv.similarity('eats', 'man'))

Similarity between eats and bites: -0.013497097
Similarity between eats and man: -0.052354384


From the above similarity scores we can conclude that eats is more similar to bites than man.

In [5]:
#Most similarity
model_cbow.wv.most_similar('meat')

[('food', 0.13887985050678253),
 ('bites', 0.13149003684520721),
 ('eats', 0.06422408670186996),
 ('dog', 0.009391186758875847),
 ('man', -0.05987628176808357)]

In [6]:
# save model
model_cbow.save('model_cbow.bin')

# load model
new_model_cbow = Word2Vec.load('model_cbow.bin')
print(new_model_cbow)

Word2Vec(vocab=6, vector_size=100, alpha=0.025)


## SkipGram
In skipgram, the task is to predict the context words from the center word.

In [7]:
#Summarize the loaded model
print(model_skipgram)

#Summarize vocabulary
words = list(model_skipgram.wv.index_to_key)
print(words)

#Acess vector for one word
print(model_skipgram.wv['dog'])

Word2Vec(vocab=6, vector_size=100, alpha=0.025)
['man', 'dog', 'eats', 'bites', 'food', 'meat']
[-8.6196875e-03  3.6657380e-03  5.1898835e-03  5.7419371e-03
  7.4669169e-03 -6.1676763e-03  1.1056137e-03  6.0472824e-03
 -2.8400517e-03 -6.1735227e-03 -4.1022300e-04 -8.3689503e-03
 -5.6000138e-03  7.1045374e-03  3.3525396e-03  7.2256685e-03
  6.8002464e-03  7.5307419e-03 -3.7891555e-03 -5.6180713e-04
  2.3483753e-03 -4.5190332e-03  8.3887316e-03 -9.8581649e-03
  6.7646410e-03  2.9144168e-03 -4.9328329e-03  4.3981862e-03
 -1.7395759e-03  6.7113829e-03  9.9648498e-03 -4.3624449e-03
 -5.9933902e-04 -5.6956387e-03  3.8508223e-03  2.7866268e-03
  6.8910765e-03  6.1010956e-03  9.5384959e-03  9.2734173e-03
  7.8980681e-03 -6.9895051e-03 -9.1558648e-03 -3.5575390e-04
 -3.0998420e-03  7.8943158e-03  5.9385728e-03 -1.5456629e-03
  1.5109634e-03  1.7900396e-03  7.8175711e-03 -9.5101884e-03
 -2.0553112e-04  3.4691954e-03 -9.3897345e-04  8.3817719e-03
  9.0107825e-03  6.5365052e-03 -7.1162224e-04  7.7

From the above similarity scores we can conclude that eats is more similar to bites than man.

In [8]:
#Most similarity
model_skipgram.wv.most_similar('meat')

[('food', 0.13887986540794373),
 ('bites', 0.1314900517463684),
 ('eats', 0.06406084448099136),
 ('dog', 0.009391188621520996),
 ('man', -0.059876274317502975)]

In [9]:
# save model
model_skipgram.save('model_skipgram.bin')

# load model
new_model_skipgram = Word2Vec.load('model_skipgram.bin')
print(new_model_skipgram)

Word2Vec(vocab=6, vector_size=100, alpha=0.025)


## Training Your Embedding on Wiki Corpus
The corpus download page : [enwiki dump progress on 20221101](https://dumps.wikimedia.org/enwiki/20221101/)

The entire wiki corpus as of 20/10/2022 is just over 19.9GB in size. We will take a part of this corpus due to computation constraints and train our word2vec and fasttext embeddings.

The file size is 234MB so it can take a while to download.

Source for code which downloads files from Google Drive: https://stackoverflow.com/questions/25010369/wget-curl-large-file-from-google-drive/39225039#39225039

In [10]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/wikicorpus/enwiki-20221020-pages-articles-multistream19.xml-p30121851p31308442.bz2


In [11]:
#from gensim.corpora.wikicorpus import WikiCorpus

from gensim.test.utils import datapath, get_tmpfile
from gensim.corpora import WikiCorpus, MmCorpus

from gensim.models.word2vec import Word2Vec
from gensim.models.fasttext import FastText
import time

### `corpora.wikicorpus` [Corpus from a Wikipedia dump](https://radimrehurek.com/gensim/corpora/wikicorpus.html)

In [12]:
%%time
fname = datapath('/kaggle/input/wikicorpus/enwiki-20221020-pages-articles-multistream19.xml-p30121851p31308442.bz2')


#Preparing the Training data
wiki = WikiCorpus(fname, dictionary={})  # create word->word_id mapping, ~8h on full wiki

CPU times: user 74 µs, sys: 11 µs, total: 85 µs
Wall time: 89.2 µs


In [13]:
%%time
print(wiki.fname)
print(wiki.token_max_len)
print(wiki.token_min_len)

/kaggle/input/wikicorpus/enwiki-20221020-pages-articles-multistream19.xml-p30121851p31308442.bz2
15
2
CPU times: user 94 µs, sys: 13 µs, total: 107 µs
Wall time: 104 µs


In [14]:
%%time
start = time.time()

sentences = list(wiki.get_texts())

end = time.time()
print('Time consumed :',(end-start)/3600.0)

Time consumed : 0.16808209935824076
CPU times: user 39 s, sys: 9.09 s, total: 48.1 s
Wall time: 10min 5s


In [15]:
print(len(sentences))
print(len(sentences[0]))
sentences[-1][:10]

77542
973


['the',
 'japanese',
 'motorcycle',
 'grand',
 'prix',
 'was',
 'the',
 'fifteenth',
 'round',
 'of']

## Hyperparameters
1. sg - Selecting the training algorithm: 1 for skip-gram else its 0 for CBOW. Default is CBOW.
2. min_count- Ignores all words with total frequency lower than this.

There are many more hyperparamaeters whose list can be found in the official documentation [here](https://radimrehurek.com/gensim/models/word2vec.html).

### CBOW

In [16]:
%%time
#CBOW
start = time.time()
Word2vec_cbow = Word2Vec(sentences,min_count=10, sg=0)
end = time.time()

print("CBOW Model Training Complete.\nTime taken for training is:{:.2f} hrs ".format((end-start)/3600.0))

CBOW Model Training Complete.
Time taken for training is:0.07 hrs 
CPU times: user 11min 20s, sys: 2.4 s, total: 11min 23s
Wall time: 4min 23s


In [17]:
%%time

#Summarize the loaded model
print(Word2vec_cbow)
print("-"*30)

#Summarize vocabulary
words = list(Word2vec_cbow.wv.index_to_key)
print(f"Length of vocabulary: {len(words)}")
print("Printing the first 30 words.")
print(words[:30])
print("-"*30)

#Acess vector for one word
print(f"Length of vector: {len(Word2vec_cbow.wv['film'])}")
print(Word2vec_cbow.wv['film'])
print("-"*30)

#Compute similarity 
print("Similarity between film and drama:",Word2vec_cbow.wv.similarity('film', 'drama'))
print("Similarity between film and tiger:",Word2vec_cbow.wv.similarity('film', 'tiger'))
print("-"*30)

Word2Vec(vocab=111673, vector_size=100, alpha=0.025)
------------------------------
Length of vocabulary: 111673
Printing the first 30 words.
['the', 'of', 'and', 'in', 'to', 'was', 'on', 'for', 'is', 'as', 'by', 'with', 'he', 'at', 'from', 'that', 'his', 'it', 'an', 'were', 'also', 'which', 'are', 'first', 'this', 'be', 'new', 'had', 'has', 'or']
------------------------------
Length of vector: 100
[ 4.73307    -2.600825   -0.0720249   0.93540573  0.23687458 -0.8208761
  4.7425876  -1.6461378  -3.4990501  -0.8347007  -1.0722163  -2.8576753
 -1.8255085  -0.8302489   0.61604726  2.7242968  -1.504793   -2.6115117
 -1.1911737  -0.7272145  -2.3321075   1.1135737   1.5169348  -2.1729434
 -2.1434288   1.3702474   0.36419854 -1.016192   -0.8031033   2.228342
 -1.4918561  -3.8824546   2.5115235  -2.333802    2.8952913   0.9667327
 -0.509641   -1.5418124  -2.280423   -2.324807    1.2353895   0.95467395
 -1.1565114   0.12283167 -1.535814    1.0802101  -1.5903248   0.9770953
  5.1761537  -2.78755

#### Save and load the model

In [18]:
# save model
Word2vec_cbow.wv.save_word2vec_format('word2vec_cbow.bin')

### skipGram

In [19]:
#SkipGram
start = time.time()
Word2vec_skipgram = Word2Vec(sentences,min_count=10, sg=1)
end = time.time()

print("SkipGram Model Training Complete\nTime taken for training is:{:.2f} hrs ".format((end-start)/3600.0))

SkipGram Model Training Complete
Time taken for training is:0.22 hrs 


In [20]:
#Summarize the loaded model
print(Word2vec_skipgram)
print("-"*30)

#Summarize vocabulary
words = list(Word2vec_skipgram.wv.index_to_key)
print(f"Length of vocabulary: {len(words)}")
print("Printing the first 30 words.")
print(words[:30])
print("-"*30)

#Acess vector for one word
print(f"Length of vector: {len(Word2vec_skipgram.wv['film'])}")
print(Word2vec_skipgram.wv['film'])
print("-"*30)

#Compute similarity 
print("Similarity between film and drama:",Word2vec_skipgram.wv.similarity('film', 'drama'))
print("Similarity between film and tiger:",Word2vec_skipgram.wv.similarity('film', 'tiger'))
print("-"*30)

Word2Vec(vocab=111673, vector_size=100, alpha=0.025)
------------------------------
Length of vocabulary: 111673
Printing the first 30 words.
['the', 'of', 'and', 'in', 'to', 'was', 'on', 'for', 'is', 'as', 'by', 'with', 'he', 'at', 'from', 'that', 'his', 'it', 'an', 'were', 'also', 'which', 'are', 'first', 'this', 'be', 'new', 'had', 'has', 'or']
------------------------------
Length of vector: 100
[ 0.29663208 -0.25254762 -0.22182879  0.22640647  0.45253375 -0.8291155
  0.5757798   0.3504657  -0.7144307   0.33901218  0.24383788 -0.5427319
  0.50206876  0.5776825   0.43836883  0.10569337  0.15244819  0.08242504
  0.0724417  -0.40258744 -0.00862465  0.05370342  0.01245706 -0.05606331
 -0.23122866  0.08549467  0.3597803   0.33796063 -0.60475636 -0.21971364
 -0.06565281 -0.12912393  0.0014826   0.05125794  0.82695746 -0.25070658
  0.33139658 -0.08331545 -0.10246328 -0.56432843 -0.7249759   0.0236499
 -0.01887291  0.07544398 -0.03165641 -0.21970482  0.42034775  0.14738277
  0.8683433  -0.

### Save and laod Word2vec_skipgram model

In [21]:
# save model
Word2vec_skipgram.wv.save_word2vec_format('word2vec_sg.bin', binary=True)

# load model
word2vec_skipgram = Word2Vec.load('model_skipgram.bin')
print(word2vec_skipgram)

Word2Vec(vocab=6, vector_size=100, alpha=0.025)


## FastText

### CBOW

In [22]:
#CBOW
start = time.time()
fasttext_cbow = FastText(sentences, sg=0, min_count=10)
end = time.time()

print("FastText CBOW Model Training Complete\nTime taken for training is:{:.2f} hrs ".format((end-start)/3600.0))

FastText CBOW Model Training Complete
Time taken for training is:0.24 hrs 


In [23]:
#Summarize the loaded model
print(fasttext_cbow)
print("-"*30)

#Summarize vocabulary
words = list(fasttext_cbow.wv.index_to_key)
print(f"Length of vocabulary: {len(words)}")
print("Printing the first 30 words.")
print(words[:30])
print("-"*30)

#Acess vector for one word
print(f"Length of vector: {len(fasttext_cbow.wv['film'])}")
print(fasttext_cbow.wv['film'])
print("-"*30)

#Compute similarity 
print("Similarity between film and drama:",fasttext_cbow.wv.similarity('film', 'drama'))
print("Similarity between film and tiger:",fasttext_cbow.wv.similarity('film', 'tiger'))
print("-"*30)

FastText(vocab=111673, vector_size=100, alpha=0.025)
------------------------------
Length of vocabulary: 111673
Printing the first 30 words.
['the', 'of', 'and', 'in', 'to', 'was', 'on', 'for', 'is', 'as', 'by', 'with', 'he', 'at', 'from', 'that', 'his', 'it', 'an', 'were', 'also', 'which', 'are', 'first', 'this', 'be', 'new', 'had', 'has', 'or']
------------------------------
Length of vector: 100
[-4.4085     -0.2758395   3.831173   -0.56286997 -1.8058575  -4.9290104
  2.697067   -0.19324198  2.3154976  -3.823425    3.1586633  -2.2600644
  3.9083784  -1.3863469  -2.9755433   2.646758   -0.6507824  -1.181202
 -1.2064092   4.0462046   0.9284435   7.340335   -0.9425789   2.7071877
 -1.5736026   2.3323107   5.957654    3.9484968  -1.9216007   1.8548813
  2.312521   -5.524084   -2.1530185  -5.1539245  -1.154922    0.6869037
  4.005543   -1.9270756  -0.5720548   2.3177302   2.0693724  -2.866289
 -0.57111967 -4.7139707   2.2145097  -2.250022    1.0474919  -1.7315602
  4.144138   -1.3406186

In [24]:
# save model
from gensim.models import Word2Vec, KeyedVectors   
fasttext_cbow.wv.save_word2vec_format('fasttext_cbow.bin', binary=True)

### skipGram

In [25]:
#SkipGram
start = time.time()
fasttext_skipgram = FastText(sentences, sg=1, min_count=10)
end = time.time()

print("FastText SkipGram Model Training Complete\nTime taken for training is:{:.2f} hrs ".format((end-start)/3600.0))

FastText SkipGram Model Training Complete
Time taken for training is:0.41 hrs 


In [26]:
#Summarize the loaded model
print(fasttext_skipgram)
print("-"*30)

#Summarize vocabulary
words = list(fasttext_skipgram.wv.index_to_key)
print(f"Length of vocabulary: {len(words)}")
print("Printing the first 30 words.")
print(words[:30])
print("-"*30)

#Acess vector for one word
print(f"Length of vector: {len(fasttext_skipgram.wv['film'])}")
print(fasttext_skipgram.wv['film'])
print("-"*30)

#Compute similarity 
print("Similarity between film and drama:",fasttext_skipgram.wv.similarity('film', 'drama'))
print("Similarity between film and tiger:",fasttext_skipgram.wv.similarity('film', 'tiger'))
print("-"*30)

FastText(vocab=111673, vector_size=100, alpha=0.025)
------------------------------
Length of vocabulary: 111673
Printing the first 30 words.
['the', 'of', 'and', 'in', 'to', 'was', 'on', 'for', 'is', 'as', 'by', 'with', 'he', 'at', 'from', 'that', 'his', 'it', 'an', 'were', 'also', 'which', 'are', 'first', 'this', 'be', 'new', 'had', 'has', 'or']
------------------------------
Length of vector: 100
[ 0.09975784  0.3225717   0.20232396 -0.5837296   0.10650757 -0.2609484
 -0.01321004  0.48489895  0.38979766 -0.30038863 -0.25607148  0.1537441
 -0.610517    0.6641616  -0.56465584 -0.26840204  0.26675722 -0.2473016
 -0.01118858 -0.06004076 -0.2400133   0.5134029  -0.40039787  0.42534608
  0.16175415 -0.17682227 -0.26693296  0.3300141  -0.13781406 -0.72802293
  0.366928   -0.2063148   0.32246718 -0.30360916 -0.1968363  -0.49228448
  0.07265243  0.71337456  0.4101354   0.18095018 -0.28502285  0.37981156
 -0.01572184 -0.28634146 -0.05004412 -0.40868586  0.02008785 -0.14703867
  0.5373546  -0.

In [27]:
# save model
from gensim.models import Word2Vec, KeyedVectors   
fasttext_skipgram.wv.save_word2vec_format('fasttext_skipgram.bin', binary=True)

#### CBOW trains faster than SkipGram in both cases.