# Wolaytta Word Embedding with Gensim

## Word embeddings 

A word embedding is an approach to provide a dense vector representation of words that capture something about their meaning. The vector space representation of the words provides a projection where words with similar meanings are locally clustered within the space. It is modern approach for representing text in natural language processing. Embedding algorithms like Word2Vec and GloVe are key to the state-of-the-art results achieved by neural network models on natural language processing problems like machine translation.

Wolaytta is poorly resourced and highly inflected language from Afro-asiatic language family. This is the first attempt to share a Wolaytta small text corpus, Word Embedding, and wolaytta wordsim100. The Wolaytta-word-embedding is a pre-trained distributed word representation, wordsim100 - provides human annotated scores of relatedness between term pairs collected form potential users which was used to evaluate word embedding model.

In [1]:
## Gensim is an open source Python library for natural language processing
import gensim, logging
import os
## from gensim.models import FastText
from gensim.models import Word2Vec, KeyedVectors
from gensim.models.fasttext import FastText
from gensim.models import Word2Vec
import re

In [2]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/xgebt/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
!pip3 install sentencepiece



In [4]:
import os, re
from nltk.tokenize import sent_tokenize
RAW_DATA_DIR = './data'
CLEAN_DATA_DIR = './clean'

In [5]:
import os, re
from nltk.tokenize import sent_tokenize
file_count = 1
saved_files = os.listdir(CLEAN_DATA_DIR)
if len(saved_files)>0:
    file_count = len(saved_files)
else:        
    for fname in os.listdir(RAW_DATA_DIR):
        try:
            if fname.endswith(".txt"):
                txt_file = open(os.path.join(RAW_DATA_DIR, fname),encoding='utf8').readlines()
                sen = []
                for s in txt_file:
                    sent = sent_tokenize(s)
                    for st in sent:
                        st = st.lower()
                        b = re.sub(r'\w+7\w+', '\'', st)
                        a = ('\n'+ b.replace('”', '\'\'').replace('’', '\'').replace('7', '\'').replace('_', '').replace('-', ''))
                        c = re.sub(r'\d+', 'NUM', a)
                        words = re.findall(r'\w+\'*\w+', c)
                        ss = ' '.join(x for x in words if x)
                        if ss not in sen:
                            sen.append(ss.encode("ascii", "ignore").decode())
                new_fname = f"{file_count}.txt"
                clean_file = open(os.path.join(CLEAN_DATA_DIR, new_fname),'w+',encoding='utf8')
                sen_per_line_test = ''
                for s in sen:
                    sen_per_line_test = sen_per_line_test +'\n'+s
                clean_file.write(sen_per_line_test)

                file_count+=1
        except:
            continue
f"{file_count} files created to {CLEAN_DATA_DIR}"

'2880 files created to ./clean'

In [6]:
if not os.path.exists('./subword-model/all_in_one.txt'):
    for fname in os.listdir(CLEAN_DATA_DIR):
        txt_file = open(os.path.join(CLEAN_DATA_DIR, fname),encoding='utf8').readlines()
        clean_file = open(os.path.join('./subword-model', 'all_in_one.txt'),'a+',encoding='utf8')
        for s in txt_file:
            clean_file.write(s)

In [None]:
#SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based 
#text generation systems where the vocabulary size is predetermined prior to the neural model training. 
#SentencePiece implements subword units BPE and unigram language model.
#It treats the sentences just as sequences of Unicode characters. 
import sentencepiece as spm
spm.SentencePieceTrainer.Train('--input=' + os.path.join('./subword-model', 'all_in_one.txt') +
            ' --model_prefix=wol_sp --vocab_size=16000 --hard_vocab_limit=false')

#It is a unigram_model_trainer.cc(138) LOG(INFO) Making suffix array...
#unigram_model_trainer extracts frequent sub strings...
# Then save file at /Wolaytta_Word_Embedding.ipynb

In [7]:
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.Load("./subword-model/wol_sp.model")
sp.EncodeAsPieces("higgiyyappe")

['▁hi', 'ggi', 'yya', 'ppe']

In [8]:
#Open Wolaytta sentence list
file = open("./subword-model/all_in_one.txt", "r", encoding="utf8")
number_of_lines = 0
number_of_words = 0
number_of_characters = 0
for line in file:
    line = line.strip("\n")
    words = line.split()
    number_of_lines += 1
    number_of_words += len(words)
    number_of_characters += len(line)
file.close()

print("lines:", number_of_lines, "words:", number_of_words, "characters:", number_of_characters)

lines: 863006 words: 6310966 characters: 50965677


In [49]:
import logging
import os
from gensim.models.fasttext import FastText

EMBEDDING_DIR='./emb_models'
PREPROCESSED_DIR='./clean'
class WEConfig(object):
    """Training parameters"""
    window=5 #Maximum skip length window between words
    emb_dim=100 # Set size of word vectors
    emb_lr=0.05 #learning rate for SGD estimation.
    nepoach=20 #number of training epochs
    nthread=20 #number of training threads
    sample = 0 #Set threshold for occurrence of words. Those that appear with higher frequency in the training data will be randomly down-sampled
    negative = 15 #negative sampling is used with defined negative example
    hs = 0 #0 Use Hierarchical Softmax; default is 0 (not used)
    binary=0 # 0 means not saved as .bin. Change to 1 if allowed to binary format
    sg=1 # 0 means CBOW model is used. Change to 1 to use Skip-gram model
    iterate=10 # Run more training iterations
    minFreq=2 #This will discard words that appear less than minFreq times 
    WORD_VECTOR_CACHE=EMBEDDING_DIR+'wol_word_vectors_sts.npy'
    if sg==0:
      model_name='wol_fasttext_cbow_'+str(emb_dim)+'D'
    elif sg==1:
      model_name='wol_fasttext_sg_'+str(emb_dim)+'D'
    
class corpus_sentences(object):# accept sentence stored one per line in list of files inside defined directory
    def __init__(self, dirname, sub_word = True):
        self.dirname = dirname
        self.sub_word = sub_word
        
        if self.sub_word:
            self.sp=spm.SentencePieceProcessor()
            self.sp_model=self.sp.Load("./subword-model/wol_sp.model")
    
    def __iter__(self):
        for fname in os.listdir(self.dirname):
            for line in open(os.path.join(self.dirname, fname),encoding='utf8'):
                 if self.sub_word:
                    yield self.sp.EncodeAsPieces(line)
                 else:
                    yield line.split()

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.DEBUG)


In [50]:
# wol_model=load_wol_word_vectors(sub_word = True)

In [51]:
def train_w2v_model(sub_word = True):
    print('Loading Sentences with memory freindly iterator ...\n')
    sentences = corpus_sentences(PREPROCESSED_DIR, sub_word) # a memory-friendly iterator 
    if WEConfig.sg==0:
        model_type='CBOW'
    else:
        model_type='Skip-gram'
    if sub_word:
        model_name=f'subword_wol_W2V_{model_type}_{WEConfig.emb_dim}D'
    else:
        model_name=f'wol_W2V_{model_type}_{WEConfig.emb_dim}D'
    print('Training Sentence Piece Word2Vec '+model_type+' with '+str(WEConfig.emb_dim)+' dimension\n') 
    _model = Word2Vec(sentences, vector_size=WEConfig.emb_dim, window=WEConfig.window, 
                            min_count=WEConfig.minFreq, workers=WEConfig.nthread,sg=WEConfig.sg,
                            epochs=WEConfig.iterate,negative=WEConfig.negative,
                            hs=WEConfig.hs,sorted_vocab=1)  
    _model.build_vocab(sentences)
    #trim unneeded model memory = use (much) less RAM
    _model.init_sims(replace=True)
    
    #Saving model   
    model_path=os.path.join(EMBEDDING_DIR,model_name)
    _model.save(model_path)

    return _model

In [45]:
def train_FastText_model(sub_word = True):
    print('Loading Sentences with memory freindly iterator ...\n')
    sentences = corpus_sentences(PREPROCESSED_DIR, sub_word) # a memory-friendly iterator 
    if WEConfig.sg==0:
        model_type='CBOW'
    else:
        model_type='Skip-gram'
    if sub_word:
        model_name=f'subword_wol_FastText_{model_type}_{WEConfig.emb_dim}D'
    else:
        model_name=f'wol_FastText_{model_type}_{WEConfig.emb_dim}D'
    print('Training Sentence Piece Word2Vec '+model_type+' with '+str(WEConfig.emb_dim)+' dimension\n') 
    _model = FastText(sentences, vector_size=WEConfig.emb_dim, window=WEConfig.window, 
                            min_count=WEConfig.minFreq, workers=WEConfig.nthread,sg=WEConfig.sg,
                            epochs=WEConfig.iterate,negative=WEConfig.negative,
                            hs=WEConfig.hs,sorted_vocab=1)
    _model.build_vocab(sentences)

    #trim unneeded model memory = use (much) less RAM
    _model.init_sims(replace=True)
    
    #Saving model   
    model_path=os.path.join(EMBEDDING_DIR,model_name)
    _model.save(model_path)

    return _model

In [13]:
train_FastText_model() #Fasttext sg=0 CBOW model with 200 dim

2021-12-01 01:58:41,666 : INFO : collecting all words and their counts
2021-12-01 01:58:41,671 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-12-01 01:58:41,715 : INFO : PROGRESS: at sentence #10000, processed 94367 words, keeping 25860 word types
2021-12-01 01:58:41,739 : INFO : PROGRESS: at sentence #20000, processed 137135 words, keeping 34975 word types
2021-12-01 01:58:41,755 : INFO : PROGRESS: at sentence #30000, processed 157135 words, keeping 43701 word types
2021-12-01 01:58:41,768 : INFO : PROGRESS: at sentence #40000, processed 177135 words, keeping 52299 word types
2021-12-01 01:58:41,781 : INFO : PROGRESS: at sentence #50000, processed 197135 words, keeping 60914 word types
2021-12-01 01:58:41,794 : INFO : PROGRESS: at sentence #60000, processed 217135 words, keeping 69543 word types
2021-12-01 01:58:41,807 : INFO : PROGRESS: at sentence #70000, processed 237135 words, keeping 77508 word types
2021-12-01 01:58:41,819 : INFO : PROGRESS: at s

Loading Sentences with memory freindly iterator ...

Training Sentence Piece Word2Vec CBOW with 200 dimension



2021-12-01 01:58:41,891 : INFO : PROGRESS: at sentence #110000, processed 398973 words, keeping 103853 word types
2021-12-01 01:58:41,929 : INFO : PROGRESS: at sentence #120000, processed 495786 words, keeping 109784 word types
2021-12-01 01:58:41,950 : INFO : PROGRESS: at sentence #130000, processed 541473 words, keeping 110742 word types
2021-12-01 01:58:41,960 : INFO : PROGRESS: at sentence #140000, processed 555554 words, keeping 110894 word types
2021-12-01 01:58:41,976 : INFO : PROGRESS: at sentence #150000, processed 582222 words, keeping 111943 word types
2021-12-01 01:58:42,012 : INFO : PROGRESS: at sentence #160000, processed 672309 words, keeping 116135 word types
2021-12-01 01:58:42,048 : INFO : PROGRESS: at sentence #170000, processed 761491 words, keeping 121442 word types
2021-12-01 01:58:42,082 : INFO : PROGRESS: at sentence #180000, processed 845253 words, keeping 123145 word types
2021-12-01 01:58:42,117 : INFO : PROGRESS: at sentence #190000, processed 942882 words, 

2021-12-01 01:58:44,021 : INFO : PROGRESS: at sentence #830000, processed 5987014 words, keeping 167834 word types
2021-12-01 01:58:44,058 : INFO : PROGRESS: at sentence #840000, processed 6091595 words, keeping 170779 word types
2021-12-01 01:58:44,096 : INFO : PROGRESS: at sentence #850000, processed 6188067 words, keeping 172114 word types
2021-12-01 01:58:44,128 : INFO : PROGRESS: at sentence #860000, processed 6268618 words, keeping 173751 word types
2021-12-01 01:58:44,151 : INFO : collected 174111 word types from a corpus of 6320648 raw words and 865689 sentences
2021-12-01 01:58:44,152 : INFO : Creating a fresh vocabulary
2021-12-01 01:58:44,716 : DEBUG : starting a new internal lifecycle event log for FastText
2021-12-01 01:58:44,717 : INFO : FastText lifecycle event {'msg': 'effective_min_count=2 retains 140478 unique words (80.68301256095249%% of original 174111, drops 33633)', 'datetime': '2021-12-01T01:58:44.687081', 'gensim': '4.0.1', 'python': '3.8.8 (default, Apr 13 202

2021-12-01 01:59:46,869 : INFO : worker thread finished; awaiting finish of 15 more threads
2021-12-01 01:59:46,880 : DEBUG : worker exiting, processed 33 jobs
2021-12-01 01:59:46,880 : INFO : worker thread finished; awaiting finish of 14 more threads
2021-12-01 01:59:46,882 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 01:59:46,883 : INFO : worker thread finished; awaiting finish of 13 more threads
2021-12-01 01:59:46,894 : DEBUG : worker exiting, processed 30 jobs
2021-12-01 01:59:46,894 : INFO : worker thread finished; awaiting finish of 12 more threads
2021-12-01 01:59:46,928 : DEBUG : worker exiting, processed 33 jobs
2021-12-01 01:59:46,928 : INFO : worker thread finished; awaiting finish of 11 more threads
2021-12-01 01:59:46,931 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 01:59:46,931 : INFO : worker thread finished; awaiting finish of 10 more threads
2021-12-01 01:59:46,947 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 01:59:46,948 : INFO : worker 

2021-12-01 02:00:31,502 : INFO : worker thread finished; awaiting finish of 14 more threads
2021-12-01 02:00:31,513 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 02:00:31,514 : INFO : worker thread finished; awaiting finish of 13 more threads
2021-12-01 02:00:31,530 : DEBUG : worker exiting, processed 33 jobs
2021-12-01 02:00:31,530 : INFO : worker thread finished; awaiting finish of 12 more threads
2021-12-01 02:00:31,542 : DEBUG : worker exiting, processed 34 jobs
2021-12-01 02:00:31,542 : INFO : worker thread finished; awaiting finish of 11 more threads
2021-12-01 02:00:31,573 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 02:00:31,573 : INFO : worker thread finished; awaiting finish of 10 more threads
2021-12-01 02:00:31,616 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 02:00:31,616 : INFO : worker thread finished; awaiting finish of 9 more threads
2021-12-01 02:00:31,629 : DEBUG : worker exiting, processed 30 jobs
2021-12-01 02:00:31,630 : INFO : worker t

2021-12-01 02:01:16,065 : DEBUG : worker exiting, processed 30 jobs
2021-12-01 02:01:16,065 : INFO : worker thread finished; awaiting finish of 12 more threads
2021-12-01 02:01:16,085 : DEBUG : worker exiting, processed 30 jobs
2021-12-01 02:01:16,085 : INFO : worker thread finished; awaiting finish of 11 more threads
2021-12-01 02:01:16,091 : DEBUG : worker exiting, processed 30 jobs
2021-12-01 02:01:16,091 : INFO : worker thread finished; awaiting finish of 10 more threads
2021-12-01 02:01:16,094 : DEBUG : worker exiting, processed 30 jobs
2021-12-01 02:01:16,094 : INFO : worker thread finished; awaiting finish of 9 more threads
2021-12-01 02:01:16,146 : DEBUG : worker exiting, processed 30 jobs
2021-12-01 02:01:16,147 : INFO : EPOCH 3 - PROGRESS: at 99.04% examples, 133244 words/s, in_qsize 8, out_qsize 1
2021-12-01 02:01:16,148 : INFO : worker thread finished; awaiting finish of 8 more threads
2021-12-01 02:01:16,150 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 02:01:16,1

2021-12-01 02:02:00,603 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 02:02:00,604 : INFO : worker thread finished; awaiting finish of 11 more threads
2021-12-01 02:02:00,604 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 02:02:00,605 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 02:02:00,606 : INFO : worker thread finished; awaiting finish of 10 more threads
2021-12-01 02:02:00,609 : INFO : worker thread finished; awaiting finish of 9 more threads
2021-12-01 02:02:00,649 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 02:02:00,649 : INFO : worker thread finished; awaiting finish of 8 more threads
2021-12-01 02:02:00,651 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 02:02:00,651 : INFO : worker thread finished; awaiting finish of 7 more threads
2021-12-01 02:02:00,655 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 02:02:00,655 : INFO : worker thread finished; awaiting finish of 6 more threads
2021-12-01 02:02:00,704 : DEBUG : worker exi

2021-12-01 02:02:45,046 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 02:02:45,048 : INFO : worker thread finished; awaiting finish of 10 more threads
2021-12-01 02:02:45,114 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 02:02:45,115 : INFO : worker thread finished; awaiting finish of 9 more threads
2021-12-01 02:02:45,134 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 02:02:45,134 : INFO : worker thread finished; awaiting finish of 8 more threads
2021-12-01 02:02:45,183 : DEBUG : worker exiting, processed 30 jobs
2021-12-01 02:02:45,183 : INFO : worker thread finished; awaiting finish of 7 more threads
2021-12-01 02:02:45,200 : DEBUG : worker exiting, processed 33 jobs
2021-12-01 02:02:45,200 : INFO : worker thread finished; awaiting finish of 6 more threads
2021-12-01 02:02:45,232 : DEBUG : worker exiting, processed 33 jobs
2021-12-01 02:02:45,232 : INFO : worker thread finished; awaiting finish of 5 more threads
2021-12-01 02:02:45,279 : DEBUG : worker exit

2021-12-01 02:03:29,564 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 02:03:29,565 : INFO : worker thread finished; awaiting finish of 9 more threads
2021-12-01 02:03:29,606 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 02:03:29,606 : INFO : worker thread finished; awaiting finish of 8 more threads
2021-12-01 02:03:29,611 : DEBUG : worker exiting, processed 30 jobs
2021-12-01 02:03:29,612 : INFO : worker thread finished; awaiting finish of 7 more threads
2021-12-01 02:03:29,678 : DEBUG : worker exiting, processed 30 jobs
2021-12-01 02:03:29,678 : INFO : worker thread finished; awaiting finish of 6 more threads
2021-12-01 02:03:29,745 : DEBUG : worker exiting, processed 33 jobs
2021-12-01 02:03:29,745 : INFO : worker thread finished; awaiting finish of 5 more threads
2021-12-01 02:03:29,765 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 02:03:29,765 : INFO : worker thread finished; awaiting finish of 4 more threads
2021-12-01 02:03:29,791 : DEBUG : worker exiti

2021-12-01 02:04:14,037 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 02:04:14,037 : INFO : worker thread finished; awaiting finish of 8 more threads
2021-12-01 02:04:14,090 : DEBUG : worker exiting, processed 30 jobs
2021-12-01 02:04:14,091 : INFO : worker thread finished; awaiting finish of 7 more threads
2021-12-01 02:04:14,095 : DEBUG : worker exiting, processed 34 jobs
2021-12-01 02:04:14,096 : INFO : worker thread finished; awaiting finish of 6 more threads
2021-12-01 02:04:14,096 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 02:04:14,097 : INFO : worker thread finished; awaiting finish of 5 more threads
2021-12-01 02:04:14,131 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 02:04:14,131 : INFO : worker thread finished; awaiting finish of 4 more threads
2021-12-01 02:04:14,178 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 02:04:14,178 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-12-01 02:04:14,210 : DEBUG : worker exiti

2021-12-01 02:04:58,488 : DEBUG : worker exiting, processed 30 jobs
2021-12-01 02:04:58,488 : INFO : worker thread finished; awaiting finish of 6 more threads
2021-12-01 02:04:58,527 : DEBUG : worker exiting, processed 33 jobs
2021-12-01 02:04:58,527 : INFO : worker thread finished; awaiting finish of 5 more threads
2021-12-01 02:04:58,565 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 02:04:58,565 : INFO : worker thread finished; awaiting finish of 4 more threads
2021-12-01 02:04:58,637 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 02:04:58,637 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-12-01 02:04:58,680 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 02:04:58,680 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-12-01 02:04:58,702 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 02:04:58,702 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-12-01 02:04:58,749 : DEBUG : worker exiti

2021-12-01 02:05:43,072 : DEBUG : worker exiting, processed 30 jobs
2021-12-01 02:05:43,072 : INFO : worker thread finished; awaiting finish of 5 more threads
2021-12-01 02:05:43,126 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 02:05:43,126 : INFO : worker thread finished; awaiting finish of 4 more threads
2021-12-01 02:05:43,179 : DEBUG : worker exiting, processed 30 jobs
2021-12-01 02:05:43,179 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-12-01 02:05:43,247 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 02:05:43,247 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-12-01 02:05:43,250 : DEBUG : worker exiting, processed 33 jobs
2021-12-01 02:05:43,250 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-12-01 02:05:43,253 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 02:05:43,253 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-12-01 02:05:43,255 : INFO : EPOCH - 9 : t

2021-12-01 02:06:27,553 : INFO : worker thread finished; awaiting finish of 4 more threads
2021-12-01 02:06:27,658 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 02:06:27,658 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-12-01 02:06:27,731 : DEBUG : worker exiting, processed 30 jobs
2021-12-01 02:06:27,731 : INFO : EPOCH 10 - PROGRESS: at 99.80% examples, 134013 words/s, in_qsize 2, out_qsize 1
2021-12-01 02:06:27,733 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-12-01 02:06:27,735 : DEBUG : worker exiting, processed 30 jobs
2021-12-01 02:06:27,736 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-12-01 02:06:27,742 : DEBUG : worker exiting, processed 35 jobs
2021-12-01 02:06:27,743 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-12-01 02:06:27,745 : INFO : EPOCH - 10 : training on 6320648 raw words (5978449 effective words) took 44.5s, 134399 effective words/s
2021-12-01 02:06:27,

2021-12-01 02:06:38,797 : INFO : PROGRESS: at sentence #570000, processed 3859930 words, keeping 157746 word types
2021-12-01 02:06:38,829 : INFO : PROGRESS: at sentence #580000, processed 3950524 words, keeping 157746 word types
2021-12-01 02:06:38,867 : INFO : PROGRESS: at sentence #590000, processed 4037925 words, keeping 158361 word types
2021-12-01 02:06:38,909 : INFO : PROGRESS: at sentence #600000, processed 4144770 words, keeping 160540 word types
2021-12-01 02:06:38,944 : INFO : PROGRESS: at sentence #610000, processed 4232026 words, keeping 160927 word types
2021-12-01 02:06:38,985 : INFO : PROGRESS: at sentence #620000, processed 4333626 words, keeping 162542 word types
2021-12-01 02:06:39,016 : INFO : PROGRESS: at sentence #630000, processed 4407095 words, keeping 163281 word types
2021-12-01 02:06:39,045 : INFO : PROGRESS: at sentence #640000, processed 4479264 words, keeping 166015 word types
2021-12-01 02:06:39,076 : INFO : PROGRESS: at sentence #650000, processed 456747

<gensim.models.fasttext.FastText at 0x7fd3cf1fcac0>

In [14]:
train_w2v_model() #w2v sg=0 CBOW model with 200 dim

2021-12-01 02:06:57,203 : INFO : collecting all words and their counts
2021-12-01 02:06:57,207 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-12-01 02:06:57,247 : INFO : PROGRESS: at sentence #10000, processed 94367 words, keeping 25860 word types
2021-12-01 02:06:57,269 : INFO : PROGRESS: at sentence #20000, processed 137135 words, keeping 34975 word types
2021-12-01 02:06:57,281 : INFO : PROGRESS: at sentence #30000, processed 157135 words, keeping 43701 word types
2021-12-01 02:06:57,293 : INFO : PROGRESS: at sentence #40000, processed 177135 words, keeping 52299 word types
2021-12-01 02:06:57,305 : INFO : PROGRESS: at sentence #50000, processed 197135 words, keeping 60914 word types
2021-12-01 02:06:57,317 : INFO : PROGRESS: at sentence #60000, processed 217135 words, keeping 69543 word types
2021-12-01 02:06:57,329 : INFO : PROGRESS: at sentence #70000, processed 237135 words, keeping 77508 word types
2021-12-01 02:06:57,341 : INFO : PROGRESS: at s

Loading Sentences with memory freindly iterator ...

Training Sentence Piece Word2Vec CBOW with 200 dimension



2021-12-01 02:06:57,411 : INFO : PROGRESS: at sentence #110000, processed 398973 words, keeping 103853 word types
2021-12-01 02:06:57,448 : INFO : PROGRESS: at sentence #120000, processed 495786 words, keeping 109784 word types
2021-12-01 02:06:57,469 : INFO : PROGRESS: at sentence #130000, processed 541473 words, keeping 110742 word types
2021-12-01 02:06:57,480 : INFO : PROGRESS: at sentence #140000, processed 555554 words, keeping 110894 word types
2021-12-01 02:06:57,496 : INFO : PROGRESS: at sentence #150000, processed 582222 words, keeping 111943 word types
2021-12-01 02:06:57,531 : INFO : PROGRESS: at sentence #160000, processed 672309 words, keeping 116135 word types
2021-12-01 02:06:57,566 : INFO : PROGRESS: at sentence #170000, processed 761491 words, keeping 121442 word types
2021-12-01 02:06:57,599 : INFO : PROGRESS: at sentence #180000, processed 845253 words, keeping 123145 word types
2021-12-01 02:06:57,635 : INFO : PROGRESS: at sentence #190000, processed 942882 words, 

2021-12-01 02:06:59,528 : INFO : PROGRESS: at sentence #830000, processed 5987014 words, keeping 167834 word types
2021-12-01 02:06:59,566 : INFO : PROGRESS: at sentence #840000, processed 6091595 words, keeping 170779 word types
2021-12-01 02:06:59,603 : INFO : PROGRESS: at sentence #850000, processed 6188067 words, keeping 172114 word types
2021-12-01 02:06:59,636 : INFO : PROGRESS: at sentence #860000, processed 6268618 words, keeping 173751 word types
2021-12-01 02:06:59,658 : INFO : collected 174111 word types from a corpus of 6320648 raw words and 865689 sentences
2021-12-01 02:06:59,659 : INFO : Creating a fresh vocabulary
2021-12-01 02:07:00,199 : DEBUG : starting a new internal lifecycle event log for Word2Vec
2021-12-01 02:07:00,200 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=2 retains 140478 unique words (80.68301256095249%% of original 174111, drops 33633)', 'datetime': '2021-12-01T02:07:00.199798', 'gensim': '4.0.1', 'python': '3.8.8 (default, Apr 13 202

2021-12-01 02:07:20,531 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 02:07:20,531 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 02:07:20,531 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 02:07:20,532 : INFO : worker thread finished; awaiting finish of 19 more threads
2021-12-01 02:07:20,540 : INFO : worker thread finished; awaiting finish of 18 more threads
2021-12-01 02:07:20,541 : INFO : worker thread finished; awaiting finish of 17 more threads
2021-12-01 02:07:20,550 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 02:07:20,550 : INFO : worker thread finished; awaiting finish of 16 more threads
2021-12-01 02:07:20,554 : DEBUG : worker exiting, processed 30 jobs
2021-12-01 02:07:20,555 : INFO : worker thread finished; awaiting finish of 15 more threads
2021-12-01 02:07:20,557 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 02:07:20,557 : INFO : worker thread finished; awaiting finish of 14 more threads
2021-12-01 02:07:20,565 : DEBUG : worker

2021-12-01 02:07:36,608 : INFO : EPOCH 4 - PROGRESS: at 84.24% examples, 669883 words/s, in_qsize 0, out_qsize 0
2021-12-01 02:07:37,637 : INFO : EPOCH 4 - PROGRESS: at 94.96% examples, 680598 words/s, in_qsize 0, out_qsize 0
2021-12-01 02:07:37,958 : DEBUG : job loop exiting, total 633 jobs
2021-12-01 02:07:37,962 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 02:07:37,962 : DEBUG : worker exiting, processed 33 jobs
2021-12-01 02:07:37,962 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 02:07:37,962 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 02:07:37,962 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 02:07:37,963 : INFO : worker thread finished; awaiting finish of 19 more threads
2021-12-01 02:07:37,963 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 02:07:37,971 : INFO : worker thread finished; awaiting finish of 18 more threads
2021-12-01 02:07:37,973 : INFO : worker thread finished; awaiting finish of 17 more threads
2021-12-01 02:07:37,974

2021-12-01 02:07:50,997 : INFO : EPOCH 6 - PROGRESS: at 50.42% examples, 661070 words/s, in_qsize 0, out_qsize 9
2021-12-01 02:07:52,042 : INFO : EPOCH 6 - PROGRESS: at 62.49% examples, 672058 words/s, in_qsize 0, out_qsize 4
2021-12-01 02:07:53,043 : INFO : EPOCH 6 - PROGRESS: at 74.42% examples, 680093 words/s, in_qsize 0, out_qsize 2
2021-12-01 02:07:54,064 : INFO : EPOCH 6 - PROGRESS: at 88.13% examples, 685316 words/s, in_qsize 0, out_qsize 0
2021-12-01 02:07:55,080 : INFO : EPOCH 6 - PROGRESS: at 97.18% examples, 696680 words/s, in_qsize 0, out_qsize 0
2021-12-01 02:07:55,224 : DEBUG : job loop exiting, total 633 jobs
2021-12-01 02:07:55,228 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 02:07:55,228 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 02:07:55,228 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 02:07:55,228 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 02:07:55,228 : DEBUG : worker exiting, processed 33 jobs
2021-12-01 02:07:55,228 : IN

2021-12-01 02:08:04,961 : INFO : EPOCH 8 - PROGRESS: at 17.43% examples, 500466 words/s, in_qsize 0, out_qsize 0
2021-12-01 02:08:05,977 : INFO : EPOCH 8 - PROGRESS: at 26.65% examples, 595331 words/s, in_qsize 0, out_qsize 0
2021-12-01 02:08:07,000 : INFO : EPOCH 8 - PROGRESS: at 36.97% examples, 663008 words/s, in_qsize 0, out_qsize 0
2021-12-01 02:08:08,010 : INFO : EPOCH 8 - PROGRESS: at 49.38% examples, 666769 words/s, in_qsize 0, out_qsize 1
2021-12-01 02:08:09,032 : INFO : EPOCH 8 - PROGRESS: at 60.71% examples, 668742 words/s, in_qsize 0, out_qsize 1
2021-12-01 02:08:10,051 : INFO : EPOCH 8 - PROGRESS: at 72.76% examples, 680390 words/s, in_qsize 0, out_qsize 0
2021-12-01 02:08:11,073 : INFO : EPOCH 8 - PROGRESS: at 86.51% examples, 681148 words/s, in_qsize 0, out_qsize 0
2021-12-01 02:08:12,080 : INFO : EPOCH 8 - PROGRESS: at 95.06% examples, 688766 words/s, in_qsize 0, out_qsize 0
2021-12-01 02:08:12,403 : DEBUG : job loop exiting, total 633 jobs
2021-12-01 02:08:12,409 : DEB

2021-12-01 02:08:21,030 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-12-01 02:08:21,039 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 02:08:21,039 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-12-01 02:08:21,040 : INFO : EPOCH - 9 : training on 6320648 raw words (5978991 effective words) took 8.5s, 702342 effective words/s
2021-12-01 02:08:22,051 : INFO : EPOCH 10 - PROGRESS: at 17.10% examples, 483333 words/s, in_qsize 0, out_qsize 0
2021-12-01 02:08:23,059 : INFO : EPOCH 10 - PROGRESS: at 26.46% examples, 589821 words/s, in_qsize 0, out_qsize 0
2021-12-01 02:08:24,063 : INFO : EPOCH 10 - PROGRESS: at 36.33% examples, 660639 words/s, in_qsize 0, out_qsize 0
2021-12-01 02:08:25,083 : INFO : EPOCH 10 - PROGRESS: at 48.73% examples, 660651 words/s, in_qsize 0, out_qsize 0
2021-12-01 02:08:26,109 : INFO : EPOCH 10 - PROGRESS: at 61.43% examples, 678662 words/s, in_qsize 1, out_qsize 0
2021-12-01 02:08:27,111 : INFO : EPOCH 10 -

2021-12-01 02:08:30,311 : INFO : PROGRESS: at sentence #250000, processed 1556607 words, keeping 142632 word types
2021-12-01 02:08:30,350 : INFO : PROGRESS: at sentence #260000, processed 1673284 words, keeping 145137 word types
2021-12-01 02:08:30,394 : INFO : PROGRESS: at sentence #270000, processed 1808369 words, keeping 146165 word types
2021-12-01 02:08:30,437 : INFO : PROGRESS: at sentence #280000, processed 1942649 words, keeping 146739 word types
2021-12-01 02:08:30,467 : INFO : PROGRESS: at sentence #290000, processed 2019712 words, keeping 147220 word types
2021-12-01 02:08:30,487 : INFO : PROGRESS: at sentence #300000, processed 2071405 words, keeping 147230 word types
2021-12-01 02:08:30,508 : INFO : PROGRESS: at sentence #310000, processed 2123470 words, keeping 147237 word types
2021-12-01 02:08:30,529 : INFO : PROGRESS: at sentence #320000, processed 2178214 words, keeping 147261 word types
2021-12-01 02:08:30,550 : INFO : PROGRESS: at sentence #330000, processed 223136

2021-12-01 02:08:33,540 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 5979015.951129102 word corpus (95.1%% of prior 6287015)', 'datetime': '2021-12-01T02:08:33.540014', 'gensim': '4.0.1', 'python': '3.8.8 (default, Apr 13 2021, 19:58:26) \n[GCC 7.3.0]', 'platform': 'Linux-5.0.17-200.fc29.x86_64-x86_64-with-glibc2.10', 'event': 'prepare_vocab'}
2021-12-01 02:08:35,090 : INFO : estimated required memory for 140478 words and 200 dimensions: 295003800 bytes
2021-12-01 02:08:35,090 : INFO : resetting layer weights
2021-12-01 02:08:35,098 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2021-12-01T02:08:35.098310', 'gensim': '4.0.1', 'python': '3.8.8 (default, Apr 13 2021, 19:58:26) \n[GCC 7.3.0]', 'platform': 'Linux-5.0.17-200.fc29.x86_64-x86_64-with-glibc2.10', 'event': 'build_vocab'}
  _model.init_sims(replace=True)
2021-12-01 02:08:35,160 : INFO : Word2Vec lifecycle event {'fname_or_handle': './emb_models/wol_W2V_CBOW_200D'

<gensim.models.word2vec.Word2Vec at 0x7fd3cf2164f0>

In [22]:
train_FastText_model() #FastText sg=1 Skip-gram model with 200 dim  

2021-12-01 02:11:12,528 : INFO : collecting all words and their counts
2021-12-01 02:11:12,534 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-12-01 02:11:12,577 : INFO : PROGRESS: at sentence #10000, processed 94367 words, keeping 25860 word types
2021-12-01 02:11:12,600 : INFO : PROGRESS: at sentence #20000, processed 137135 words, keeping 34975 word types
2021-12-01 02:11:12,615 : INFO : PROGRESS: at sentence #30000, processed 157135 words, keeping 43701 word types
2021-12-01 02:11:12,628 : INFO : PROGRESS: at sentence #40000, processed 177135 words, keeping 52299 word types
2021-12-01 02:11:12,640 : INFO : PROGRESS: at sentence #50000, processed 197135 words, keeping 60914 word types
2021-12-01 02:11:12,652 : INFO : PROGRESS: at sentence #60000, processed 217135 words, keeping 69543 word types
2021-12-01 02:11:12,664 : INFO : PROGRESS: at sentence #70000, processed 237135 words, keeping 77508 word types
2021-12-01 02:11:12,676 : INFO : PROGRESS: at s

Loading Sentences with memory freindly iterator ...

Training Sentence Piece Word2Vec Skip-gram with 200 dimension



2021-12-01 02:11:12,745 : INFO : PROGRESS: at sentence #110000, processed 398973 words, keeping 103853 word types
2021-12-01 02:11:12,782 : INFO : PROGRESS: at sentence #120000, processed 495786 words, keeping 109784 word types
2021-12-01 02:11:12,804 : INFO : PROGRESS: at sentence #130000, processed 541473 words, keeping 110742 word types
2021-12-01 02:11:12,814 : INFO : PROGRESS: at sentence #140000, processed 555554 words, keeping 110894 word types
2021-12-01 02:11:12,830 : INFO : PROGRESS: at sentence #150000, processed 582222 words, keeping 111943 word types
2021-12-01 02:11:12,866 : INFO : PROGRESS: at sentence #160000, processed 672309 words, keeping 116135 word types
2021-12-01 02:11:12,901 : INFO : PROGRESS: at sentence #170000, processed 761491 words, keeping 121442 word types
2021-12-01 02:11:12,935 : INFO : PROGRESS: at sentence #180000, processed 845253 words, keeping 123145 word types
2021-12-01 02:11:12,971 : INFO : PROGRESS: at sentence #190000, processed 942882 words, 

2021-12-01 02:11:14,877 : INFO : PROGRESS: at sentence #830000, processed 5987014 words, keeping 167834 word types
2021-12-01 02:11:14,915 : INFO : PROGRESS: at sentence #840000, processed 6091595 words, keeping 170779 word types
2021-12-01 02:11:14,952 : INFO : PROGRESS: at sentence #850000, processed 6188067 words, keeping 172114 word types
2021-12-01 02:11:14,986 : INFO : PROGRESS: at sentence #860000, processed 6268618 words, keeping 173751 word types
2021-12-01 02:11:15,008 : INFO : collected 174111 word types from a corpus of 6320648 raw words and 865689 sentences
2021-12-01 02:11:15,009 : INFO : Creating a fresh vocabulary
2021-12-01 02:11:15,543 : DEBUG : starting a new internal lifecycle event log for FastText
2021-12-01 02:11:15,544 : INFO : FastText lifecycle event {'msg': 'effective_min_count=2 retains 140478 unique words (80.68301256095249%% of original 174111, drops 33633)', 'datetime': '2021-12-01T02:11:15.543836', 'gensim': '4.0.1', 'python': '3.8.8 (default, Apr 13 202

2021-12-01 02:12:27,803 : INFO : EPOCH 1 - PROGRESS: at 88.78% examples, 92372 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:12:29,045 : INFO : EPOCH 1 - PROGRESS: at 90.00% examples, 92366 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:12:30,195 : INFO : EPOCH 1 - PROGRESS: at 90.89% examples, 92018 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:12:31,257 : INFO : EPOCH 1 - PROGRESS: at 92.07% examples, 92318 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:12:32,354 : INFO : EPOCH 1 - PROGRESS: at 92.85% examples, 91902 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:12:33,441 : INFO : EPOCH 1 - PROGRESS: at 94.70% examples, 92619 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:12:34,323 : DEBUG : job loop exiting, total 633 jobs
2021-12-01 02:12:34,697 : INFO : EPOCH 1 - PROGRESS: at 95.40% examples, 91657 words/s, in_qsize 38, out_qsize 0
2021-12-01 02:12:36,180 : INFO : EPOCH 1 - PROGRESS: at 97.23% examples, 91883 words/s, in_qsize 22, out_qsize 0
2021-12-01 02:12:36,758 : DEB

2021-12-01 02:13:15,846 : INFO : EPOCH 2 - PROGRESS: at 65.36% examples, 93447 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:13:16,954 : INFO : EPOCH 2 - PROGRESS: at 66.86% examples, 93995 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:13:17,959 : INFO : EPOCH 2 - PROGRESS: at 67.94% examples, 93541 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:13:19,054 : INFO : EPOCH 2 - PROGRESS: at 69.04% examples, 93309 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:13:20,077 : INFO : EPOCH 2 - PROGRESS: at 70.15% examples, 93087 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:13:21,122 : INFO : EPOCH 2 - PROGRESS: at 71.59% examples, 93229 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:13:22,222 : INFO : EPOCH 2 - PROGRESS: at 73.42% examples, 93468 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:13:23,243 : INFO : EPOCH 2 - PROGRESS: at 76.71% examples, 93714 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:13:24,499 : INFO : EPOCH 2 - PROGRESS: at 78.37% examples, 93445 words/s, in_qsize

2021-12-01 02:14:03,607 : INFO : EPOCH 3 - PROGRESS: at 39.42% examples, 94694 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:14:04,782 : INFO : EPOCH 3 - PROGRESS: at 42.76% examples, 96021 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:14:05,856 : INFO : EPOCH 3 - PROGRESS: at 43.86% examples, 93752 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:14:07,020 : INFO : EPOCH 3 - PROGRESS: at 46.06% examples, 95029 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:14:08,072 : INFO : EPOCH 3 - PROGRESS: at 47.06% examples, 93772 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:14:09,246 : INFO : EPOCH 3 - PROGRESS: at 48.54% examples, 94235 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:14:10,345 : INFO : EPOCH 3 - PROGRESS: at 49.87% examples, 93661 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:14:11,482 : INFO : EPOCH 3 - PROGRESS: at 51.19% examples, 93653 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:14:12,511 : INFO : EPOCH 3 - PROGRESS: at 53.79% examples, 94541 words/s, in_qsize

2021-12-01 02:14:52,531 : INFO : EPOCH 4 - PROGRESS: at 21.24% examples, 90295 words/s, in_qsize 38, out_qsize 1
2021-12-01 02:14:53,775 : INFO : EPOCH 4 - PROGRESS: at 23.01% examples, 92437 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:14:54,825 : INFO : EPOCH 4 - PROGRESS: at 24.51% examples, 93141 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:14:55,847 : INFO : EPOCH 4 - PROGRESS: at 25.47% examples, 92402 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:14:56,866 : INFO : EPOCH 4 - PROGRESS: at 26.36% examples, 91775 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:14:58,197 : INFO : EPOCH 4 - PROGRESS: at 27.43% examples, 90580 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:14:59,201 : INFO : EPOCH 4 - PROGRESS: at 28.80% examples, 93427 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:15:00,305 : INFO : EPOCH 4 - PROGRESS: at 29.39% examples, 90607 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:15:01,317 : INFO : EPOCH 4 - PROGRESS: at 30.51% examples, 91988 words/s, in_qsize

2021-12-01 02:15:46,291 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-12-01 02:15:46,388 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 02:15:46,388 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-12-01 02:15:46,415 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 02:15:46,416 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-12-01 02:15:46,453 : DEBUG : worker exiting, processed 35 jobs
2021-12-01 02:15:46,454 : INFO : EPOCH 4 - PROGRESS: at 100.00% examples, 95821 words/s, in_qsize 0, out_qsize 1
2021-12-01 02:15:46,456 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-12-01 02:15:46,457 : INFO : EPOCH - 4 : training on 6320648 raw words (5978672 effective words) took 62.4s, 95816 effective words/s
2021-12-01 02:15:47,861 : INFO : EPOCH 5 - PROGRESS: at 9.71% examples, 75850 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:15:49,018 : INFO : EPOCH 5 - PROGRESS: at 11.75% example

2021-12-01 02:16:47,743 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 02:16:47,743 : INFO : worker thread finished; awaiting finish of 12 more threads
2021-12-01 02:16:47,780 : DEBUG : worker exiting, processed 34 jobs
2021-12-01 02:16:47,780 : INFO : worker thread finished; awaiting finish of 11 more threads
2021-12-01 02:16:47,785 : DEBUG : worker exiting, processed 30 jobs
2021-12-01 02:16:47,785 : INFO : worker thread finished; awaiting finish of 10 more threads
2021-12-01 02:16:47,818 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 02:16:47,819 : INFO : worker thread finished; awaiting finish of 9 more threads
2021-12-01 02:16:47,821 : DEBUG : worker exiting, processed 33 jobs
2021-12-01 02:16:47,821 : INFO : worker thread finished; awaiting finish of 8 more threads
2021-12-01 02:16:47,857 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 02:16:47,857 : INFO : worker thread finished; awaiting finish of 7 more threads
2021-12-01 02:16:47,895 : DEBUG : worker ex

2021-12-01 02:17:48,092 : INFO : EPOCH 6 - PROGRESS: at 97.13% examples, 96379 words/s, in_qsize 23, out_qsize 0
2021-12-01 02:17:48,663 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 02:17:48,663 : INFO : worker thread finished; awaiting finish of 19 more threads
2021-12-01 02:17:48,800 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 02:17:48,801 : INFO : worker thread finished; awaiting finish of 18 more threads
2021-12-01 02:17:48,877 : DEBUG : worker exiting, processed 33 jobs
2021-12-01 02:17:48,877 : INFO : worker thread finished; awaiting finish of 17 more threads
2021-12-01 02:17:48,992 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 02:17:48,993 : INFO : worker thread finished; awaiting finish of 16 more threads
2021-12-01 02:17:49,092 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 02:17:49,092 : INFO : EPOCH 6 - PROGRESS: at 98.07% examples, 96041 words/s, in_qsize 15, out_qsize 1
2021-12-01 02:17:49,094 : INFO : worker thread finished; awaiting 

2021-12-01 02:18:36,191 : INFO : EPOCH 7 - PROGRESS: at 80.08% examples, 97564 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:18:37,270 : INFO : EPOCH 7 - PROGRESS: at 81.20% examples, 97155 words/s, in_qsize 40, out_qsize 0
2021-12-01 02:18:38,332 : INFO : EPOCH 7 - PROGRESS: at 85.15% examples, 98015 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:18:39,692 : INFO : EPOCH 7 - PROGRESS: at 86.51% examples, 97079 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:18:40,748 : INFO : EPOCH 7 - PROGRESS: at 88.29% examples, 97904 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:18:41,875 : INFO : EPOCH 7 - PROGRESS: at 88.88% examples, 96700 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:18:43,125 : INFO : EPOCH 7 - PROGRESS: at 90.49% examples, 97305 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:18:44,141 : INFO : EPOCH 7 - PROGRESS: at 91.18% examples, 96723 words/s, in_qsize 40, out_qsize 0
2021-12-01 02:18:45,162 : INFO : EPOCH 7 - PROGRESS: at 92.56% examples, 97363 words/s, in_qsize

2021-12-01 02:19:23,911 : INFO : EPOCH 8 - PROGRESS: at 58.86% examples, 98477 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:19:25,022 : INFO : EPOCH 8 - PROGRESS: at 61.38% examples, 98650 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:19:26,120 : INFO : EPOCH 8 - PROGRESS: at 64.27% examples, 98600 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:19:27,223 : INFO : EPOCH 8 - PROGRESS: at 65.48% examples, 98507 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:19:28,270 : INFO : EPOCH 8 - PROGRESS: at 66.67% examples, 98323 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:19:29,382 : INFO : EPOCH 8 - PROGRESS: at 67.95% examples, 98201 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:19:30,413 : INFO : EPOCH 8 - PROGRESS: at 69.19% examples, 97988 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:19:31,420 : INFO : EPOCH 8 - PROGRESS: at 70.37% examples, 97915 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:19:32,557 : INFO : EPOCH 8 - PROGRESS: at 72.06% examples, 97957 words/s, in_qsize

2021-12-01 02:20:11,559 : INFO : EPOCH 9 - PROGRESS: at 35.25% examples, 96606 words/s, in_qsize 38, out_qsize 1
2021-12-01 02:20:12,583 : INFO : EPOCH 9 - PROGRESS: at 38.51% examples, 98700 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:20:13,626 : INFO : EPOCH 9 - PROGRESS: at 41.23% examples, 99210 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:20:14,844 : INFO : EPOCH 9 - PROGRESS: at 43.89% examples, 98894 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:20:15,940 : INFO : EPOCH 9 - PROGRESS: at 45.93% examples, 99879 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:20:16,967 : INFO : EPOCH 9 - PROGRESS: at 46.73% examples, 98072 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:20:18,071 : INFO : EPOCH 9 - PROGRESS: at 48.42% examples, 98989 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:20:19,099 : INFO : EPOCH 9 - PROGRESS: at 49.77% examples, 98445 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:20:20,315 : INFO : EPOCH 9 - PROGRESS: at 51.08% examples, 97987 words/s, in_qsize

2021-12-01 02:20:59,145 : INFO : EPOCH 10 - PROGRESS: at 20.11% examples, 95471 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:21:00,227 : INFO : EPOCH 10 - PROGRESS: at 21.69% examples, 97848 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:21:01,342 : INFO : EPOCH 10 - PROGRESS: at 23.01% examples, 95331 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:21:02,407 : INFO : EPOCH 10 - PROGRESS: at 24.62% examples, 97474 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:21:03,514 : INFO : EPOCH 10 - PROGRESS: at 25.57% examples, 94799 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:21:04,565 : INFO : EPOCH 10 - PROGRESS: at 26.95% examples, 97513 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:21:05,781 : INFO : EPOCH 10 - PROGRESS: at 27.63% examples, 93796 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:21:07,034 : INFO : EPOCH 10 - PROGRESS: at 29.19% examples, 96144 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:21:08,117 : INFO : EPOCH 10 - PROGRESS: at 29.89% examples, 93873 words/s,

2021-12-01 02:21:52,265 : INFO : worker thread finished; awaiting finish of 4 more threads
2021-12-01 02:21:52,290 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 02:21:52,290 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-12-01 02:21:52,332 : DEBUG : worker exiting, processed 34 jobs
2021-12-01 02:21:52,332 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-12-01 02:21:52,462 : DEBUG : worker exiting, processed 33 jobs
2021-12-01 02:21:52,462 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-12-01 02:21:52,485 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 02:21:52,486 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-12-01 02:21:52,488 : INFO : EPOCH - 10 : training on 6320648 raw words (5978923 effective words) took 60.6s, 98714 effective words/s
2021-12-01 02:21:52,488 : INFO : FastText lifecycle event {'msg': 'training on 63206480 raw words (59789576 effective words) took 619.2

2021-12-01 02:22:03,564 : INFO : PROGRESS: at sentence #580000, processed 3950524 words, keeping 157746 word types
2021-12-01 02:22:03,602 : INFO : PROGRESS: at sentence #590000, processed 4037925 words, keeping 158361 word types
2021-12-01 02:22:03,642 : INFO : PROGRESS: at sentence #600000, processed 4144770 words, keeping 160540 word types
2021-12-01 02:22:03,677 : INFO : PROGRESS: at sentence #610000, processed 4232026 words, keeping 160927 word types
2021-12-01 02:22:03,718 : INFO : PROGRESS: at sentence #620000, processed 4333626 words, keeping 162542 word types
2021-12-01 02:22:03,748 : INFO : PROGRESS: at sentence #630000, processed 4407095 words, keeping 163281 word types
2021-12-01 02:22:03,777 : INFO : PROGRESS: at sentence #640000, processed 4479264 words, keeping 166015 word types
2021-12-01 02:22:03,808 : INFO : PROGRESS: at sentence #650000, processed 4567479 words, keeping 166015 word types
2021-12-01 02:22:03,829 : INFO : PROGRESS: at sentence #660000, processed 462021

<gensim.models.fasttext.FastText at 0x7fd3671de940>

In [23]:
train_w2v_model() #w2v sg=1 CBOW model with 200 dim

2021-12-01 02:22:21,484 : INFO : collecting all words and their counts
2021-12-01 02:22:21,487 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-12-01 02:22:21,526 : INFO : PROGRESS: at sentence #10000, processed 94367 words, keeping 25860 word types
2021-12-01 02:22:21,547 : INFO : PROGRESS: at sentence #20000, processed 137135 words, keeping 34975 word types
2021-12-01 02:22:21,561 : INFO : PROGRESS: at sentence #30000, processed 157135 words, keeping 43701 word types
2021-12-01 02:22:21,573 : INFO : PROGRESS: at sentence #40000, processed 177135 words, keeping 52299 word types
2021-12-01 02:22:21,585 : INFO : PROGRESS: at sentence #50000, processed 197135 words, keeping 60914 word types
2021-12-01 02:22:21,598 : INFO : PROGRESS: at sentence #60000, processed 217135 words, keeping 69543 word types
2021-12-01 02:22:21,610 : INFO : PROGRESS: at sentence #70000, processed 237135 words, keeping 77508 word types
2021-12-01 02:22:21,623 : INFO : PROGRESS: at s

Loading Sentences with memory freindly iterator ...

Training Sentence Piece Word2Vec Skip-gram with 200 dimension



2021-12-01 02:22:21,693 : INFO : PROGRESS: at sentence #110000, processed 398973 words, keeping 103853 word types
2021-12-01 02:22:21,731 : INFO : PROGRESS: at sentence #120000, processed 495786 words, keeping 109784 word types
2021-12-01 02:22:21,753 : INFO : PROGRESS: at sentence #130000, processed 541473 words, keeping 110742 word types
2021-12-01 02:22:21,763 : INFO : PROGRESS: at sentence #140000, processed 555554 words, keeping 110894 word types
2021-12-01 02:22:21,779 : INFO : PROGRESS: at sentence #150000, processed 582222 words, keeping 111943 word types
2021-12-01 02:22:21,815 : INFO : PROGRESS: at sentence #160000, processed 672309 words, keeping 116135 word types
2021-12-01 02:22:21,850 : INFO : PROGRESS: at sentence #170000, processed 761491 words, keeping 121442 word types
2021-12-01 02:22:21,884 : INFO : PROGRESS: at sentence #180000, processed 845253 words, keeping 123145 word types
2021-12-01 02:22:21,920 : INFO : PROGRESS: at sentence #190000, processed 942882 words, 

2021-12-01 02:22:23,830 : INFO : PROGRESS: at sentence #830000, processed 5987014 words, keeping 167834 word types
2021-12-01 02:22:23,867 : INFO : PROGRESS: at sentence #840000, processed 6091595 words, keeping 170779 word types
2021-12-01 02:22:23,905 : INFO : PROGRESS: at sentence #850000, processed 6188067 words, keeping 172114 word types
2021-12-01 02:22:23,938 : INFO : PROGRESS: at sentence #860000, processed 6268618 words, keeping 173751 word types
2021-12-01 02:22:23,961 : INFO : collected 174111 word types from a corpus of 6320648 raw words and 865689 sentences
2021-12-01 02:22:23,961 : INFO : Creating a fresh vocabulary
2021-12-01 02:22:24,507 : DEBUG : starting a new internal lifecycle event log for Word2Vec
2021-12-01 02:22:24,508 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=2 retains 140478 unique words (80.68301256095249%% of original 174111, drops 33633)', 'datetime': '2021-12-01T02:22:24.507154', 'gensim': '4.0.1', 'python': '3.8.8 (default, Apr 13 202

2021-12-01 02:22:55,161 : INFO : worker thread finished; awaiting finish of 6 more threads
2021-12-01 02:22:55,174 : DEBUG : worker exiting, processed 35 jobs
2021-12-01 02:22:55,174 : INFO : worker thread finished; awaiting finish of 5 more threads
2021-12-01 02:22:55,189 : DEBUG : worker exiting, processed 33 jobs
2021-12-01 02:22:55,189 : INFO : worker thread finished; awaiting finish of 4 more threads
2021-12-01 02:22:55,191 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 02:22:55,192 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-12-01 02:22:55,216 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 02:22:55,216 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-12-01 02:22:55,230 : DEBUG : worker exiting, processed 33 jobs
2021-12-01 02:22:55,230 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-12-01 02:22:55,286 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 02:22:55,287 : INFO : worker thread

2021-12-01 02:23:31,535 : INFO : EPOCH 3 - PROGRESS: at 42.78% examples, 236730 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:23:32,667 : INFO : EPOCH 3 - PROGRESS: at 46.49% examples, 232959 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:23:33,680 : INFO : EPOCH 3 - PROGRESS: at 49.60% examples, 231467 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:23:34,680 : INFO : EPOCH 3 - PROGRESS: at 55.46% examples, 235240 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:23:35,762 : INFO : EPOCH 3 - PROGRESS: at 58.33% examples, 233474 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:23:36,775 : INFO : EPOCH 3 - PROGRESS: at 64.14% examples, 237636 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:23:37,781 : INFO : EPOCH 3 - PROGRESS: at 67.18% examples, 235931 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:23:38,784 : INFO : EPOCH 3 - PROGRESS: at 70.01% examples, 234720 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:23:39,826 : INFO : EPOCH 3 - PROGRESS: at 74.80% examples, 235856 words/s,

2021-12-01 02:24:10,313 : INFO : worker thread finished; awaiting finish of 16 more threads
2021-12-01 02:24:10,324 : DEBUG : worker exiting, processed 33 jobs
2021-12-01 02:24:10,324 : INFO : worker thread finished; awaiting finish of 15 more threads
2021-12-01 02:24:10,328 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 02:24:10,329 : INFO : worker thread finished; awaiting finish of 14 more threads
2021-12-01 02:24:10,338 : DEBUG : worker exiting, processed 33 jobs
2021-12-01 02:24:10,338 : INFO : worker thread finished; awaiting finish of 13 more threads
2021-12-01 02:24:10,369 : DEBUG : worker exiting, processed 33 jobs
2021-12-01 02:24:10,369 : INFO : worker thread finished; awaiting finish of 12 more threads
2021-12-01 02:24:10,370 : DEBUG : worker exiting, processed 30 jobs
2021-12-01 02:24:10,371 : INFO : worker thread finished; awaiting finish of 11 more threads
2021-12-01 02:24:10,373 : DEBUG : worker exiting, processed 30 jobs
2021-12-01 02:24:10,374 : INFO : worker 

2021-12-01 02:24:33,485 : DEBUG : worker exiting, processed 30 jobs
2021-12-01 02:24:33,485 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-12-01 02:24:33,487 : INFO : EPOCH - 5 : training on 6320648 raw words (5979463 effective words) took 22.8s, 261779 effective words/s
2021-12-01 02:24:34,795 : INFO : EPOCH 6 - PROGRESS: at 11.68% examples, 181325 words/s, in_qsize 25, out_qsize 1
2021-12-01 02:24:35,821 : INFO : EPOCH 6 - PROGRESS: at 18.52% examples, 248622 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:24:36,878 : INFO : EPOCH 6 - PROGRESS: at 22.32% examples, 248461 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:24:37,905 : INFO : EPOCH 6 - PROGRESS: at 25.57% examples, 249124 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:24:38,911 : INFO : EPOCH 6 - PROGRESS: at 27.92% examples, 245238 words/s, in_qsize 38, out_qsize 1
2021-12-01 02:24:39,924 : INFO : EPOCH 6 - PROGRESS: at 30.34% examples, 243876 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:24:40

2021-12-01 02:25:16,483 : DEBUG : job loop exiting, total 633 jobs
2021-12-01 02:25:16,831 : INFO : EPOCH 7 - PROGRESS: at 96.49% examples, 271973 words/s, in_qsize 28, out_qsize 0
2021-12-01 02:25:17,275 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 02:25:17,276 : INFO : worker thread finished; awaiting finish of 19 more threads
2021-12-01 02:25:17,282 : DEBUG : worker exiting, processed 30 jobs
2021-12-01 02:25:17,282 : INFO : worker thread finished; awaiting finish of 18 more threads
2021-12-01 02:25:17,334 : DEBUG : worker exiting, processed 33 jobs
2021-12-01 02:25:17,335 : INFO : worker thread finished; awaiting finish of 17 more threads
2021-12-01 02:25:17,340 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 02:25:17,340 : INFO : worker thread finished; awaiting finish of 16 more threads
2021-12-01 02:25:17,341 : DEBUG : worker exiting, processed 33 jobs
2021-12-01 02:25:17,343 : INFO : worker thread finished; awaiting finish of 15 more threads
2021-12-01 02:25:17

2021-12-01 02:25:39,143 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 02:25:39,143 : INFO : worker thread finished; awaiting finish of 4 more threads
2021-12-01 02:25:39,165 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 02:25:39,165 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-12-01 02:25:39,168 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 02:25:39,168 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-12-01 02:25:39,192 : DEBUG : worker exiting, processed 34 jobs
2021-12-01 02:25:39,192 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-12-01 02:25:39,204 : DEBUG : worker exiting, processed 33 jobs
2021-12-01 02:25:39,204 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-12-01 02:25:39,206 : INFO : EPOCH - 8 : training on 6320648 raw words (5979528 effective words) took 21.5s, 277681 effective words/s
2021-12-01 02:25:40,347 : INFO : EPOCH 9 - PROGRESS: at 11.73% exam

2021-12-01 02:26:16,398 : INFO : EPOCH 10 - PROGRESS: at 78.48% examples, 278265 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:26:17,421 : INFO : EPOCH 10 - PROGRESS: at 83.20% examples, 277909 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:26:18,447 : INFO : EPOCH 10 - PROGRESS: at 88.18% examples, 279230 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:26:19,471 : INFO : EPOCH 10 - PROGRESS: at 90.50% examples, 275225 words/s, in_qsize 37, out_qsize 3
2021-12-01 02:26:20,480 : INFO : EPOCH 10 - PROGRESS: at 93.74% examples, 276681 words/s, in_qsize 39, out_qsize 0
2021-12-01 02:26:20,908 : DEBUG : job loop exiting, total 633 jobs
2021-12-01 02:26:21,508 : INFO : EPOCH 10 - PROGRESS: at 97.23% examples, 276175 words/s, in_qsize 22, out_qsize 0
2021-12-01 02:26:21,665 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 02:26:21,666 : INFO : worker thread finished; awaiting finish of 19 more threads
2021-12-01 02:26:21,676 : DEBUG : worker exiting, processed 33 jobs
2021-12-01 02:2

2021-12-01 02:26:23,007 : INFO : PROGRESS: at sentence #300000, processed 2071405 words, keeping 147230 word types
2021-12-01 02:26:23,034 : INFO : PROGRESS: at sentence #310000, processed 2123470 words, keeping 147237 word types
2021-12-01 02:26:23,056 : INFO : PROGRESS: at sentence #320000, processed 2178214 words, keeping 147261 word types
2021-12-01 02:26:23,083 : INFO : PROGRESS: at sentence #330000, processed 2231362 words, keeping 147296 word types
2021-12-01 02:26:23,104 : INFO : PROGRESS: at sentence #340000, processed 2281390 words, keeping 147319 word types
2021-12-01 02:26:23,128 : INFO : PROGRESS: at sentence #350000, processed 2333993 words, keeping 147370 word types
2021-12-01 02:26:23,152 : INFO : PROGRESS: at sentence #360000, processed 2388111 words, keeping 147381 word types
2021-12-01 02:26:23,174 : INFO : PROGRESS: at sentence #370000, processed 2437676 words, keeping 148692 word types
2021-12-01 02:26:23,203 : INFO : PROGRESS: at sentence #380000, processed 248855

2021-12-01 02:26:28,393 : INFO : resetting layer weights
2021-12-01 02:26:28,400 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2021-12-01T02:26:28.400492', 'gensim': '4.0.1', 'python': '3.8.8 (default, Apr 13 2021, 19:58:26) \n[GCC 7.3.0]', 'platform': 'Linux-5.0.17-200.fc29.x86_64-x86_64-with-glibc2.10', 'event': 'build_vocab'}
  _model.init_sims(replace=True)
2021-12-01 02:26:28,465 : INFO : Word2Vec lifecycle event {'fname_or_handle': './emb_models/wol_W2V_Skip-gram_200D', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2021-12-01T02:26:28.465695', 'gensim': '4.0.1', 'python': '3.8.8 (default, Apr 13 2021, 19:58:26) \n[GCC 7.3.0]', 'platform': 'Linux-5.0.17-200.fc29.x86_64-x86_64-with-glibc2.10', 'event': 'saving'}
2021-12-01 02:26:28,466 : INFO : storing np array 'vectors' to ./emb_models/wol_W2V_Skip-gram_200D.wv.vectors.npy
2021-12-01 02:26:28,539 : INFO : storing np array 'syn1neg' to ./emb_models/wol_W2V_S

<gensim.models.word2vec.Word2Vec at 0x7fd3671de820>

In [30]:
train_FastText_model() #FastText sg=0 Skip-gram model with 100 dim  

2021-12-01 10:10:34,714 : INFO : collecting all words and their counts
2021-12-01 10:10:34,719 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-12-01 10:10:34,768 : INFO : PROGRESS: at sentence #10000, processed 94367 words, keeping 25860 word types
2021-12-01 10:10:34,793 : INFO : PROGRESS: at sentence #20000, processed 137135 words, keeping 34975 word types
2021-12-01 10:10:34,809 : INFO : PROGRESS: at sentence #30000, processed 157135 words, keeping 43701 word types
2021-12-01 10:10:34,821 : INFO : PROGRESS: at sentence #40000, processed 177135 words, keeping 52299 word types
2021-12-01 10:10:34,835 : INFO : PROGRESS: at sentence #50000, processed 197135 words, keeping 60914 word types
2021-12-01 10:10:34,848 : INFO : PROGRESS: at sentence #60000, processed 217135 words, keeping 69543 word types
2021-12-01 10:10:34,861 : INFO : PROGRESS: at sentence #70000, processed 237135 words, keeping 77508 word types
2021-12-01 10:10:34,875 : INFO : PROGRESS: at s

Loading Sentences with memory freindly iterator ...

Training Sentence Piece Word2Vec CBOW with 100 dimension



2021-12-01 10:10:34,949 : INFO : PROGRESS: at sentence #110000, processed 398973 words, keeping 103853 word types
2021-12-01 10:10:34,988 : INFO : PROGRESS: at sentence #120000, processed 495786 words, keeping 109784 word types
2021-12-01 10:10:35,011 : INFO : PROGRESS: at sentence #130000, processed 541473 words, keeping 110742 word types
2021-12-01 10:10:35,021 : INFO : PROGRESS: at sentence #140000, processed 555554 words, keeping 110894 word types
2021-12-01 10:10:35,038 : INFO : PROGRESS: at sentence #150000, processed 582222 words, keeping 111943 word types
2021-12-01 10:10:35,076 : INFO : PROGRESS: at sentence #160000, processed 672309 words, keeping 116135 word types
2021-12-01 10:10:35,113 : INFO : PROGRESS: at sentence #170000, processed 761491 words, keeping 121442 word types
2021-12-01 10:10:35,147 : INFO : PROGRESS: at sentence #180000, processed 845253 words, keeping 123145 word types
2021-12-01 10:10:35,184 : INFO : PROGRESS: at sentence #190000, processed 942882 words, 

2021-12-01 10:10:37,118 : INFO : PROGRESS: at sentence #830000, processed 5987014 words, keeping 167834 word types
2021-12-01 10:10:37,154 : INFO : PROGRESS: at sentence #840000, processed 6091595 words, keeping 170779 word types
2021-12-01 10:10:37,190 : INFO : PROGRESS: at sentence #850000, processed 6188067 words, keeping 172114 word types
2021-12-01 10:10:37,222 : INFO : PROGRESS: at sentence #860000, processed 6268618 words, keeping 173751 word types
2021-12-01 10:10:37,244 : INFO : collected 174111 word types from a corpus of 6320648 raw words and 865689 sentences
2021-12-01 10:10:37,245 : INFO : Creating a fresh vocabulary
2021-12-01 10:10:37,786 : DEBUG : starting a new internal lifecycle event log for FastText
2021-12-01 10:10:37,787 : INFO : FastText lifecycle event {'msg': 'effective_min_count=2 retains 140478 unique words (80.68301256095249%% of original 174111, drops 33633)', 'datetime': '2021-12-01T10:10:37.786529', 'gensim': '4.0.1', 'python': '3.8.8 (default, Apr 13 202

2021-12-01 10:11:35,104 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 10:11:35,104 : INFO : worker thread finished; awaiting finish of 13 more threads
2021-12-01 10:11:35,115 : DEBUG : worker exiting, processed 30 jobs
2021-12-01 10:11:35,116 : INFO : worker thread finished; awaiting finish of 12 more threads
2021-12-01 10:11:35,155 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 10:11:35,155 : INFO : worker thread finished; awaiting finish of 11 more threads
2021-12-01 10:11:35,170 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 10:11:35,170 : INFO : worker thread finished; awaiting finish of 10 more threads
2021-12-01 10:11:35,208 : DEBUG : worker exiting, processed 33 jobs
2021-12-01 10:11:35,208 : INFO : worker thread finished; awaiting finish of 9 more threads
2021-12-01 10:11:35,209 : DEBUG : worker exiting, processed 33 jobs
2021-12-01 10:11:35,210 : INFO : worker thread finished; awaiting finish of 8 more threads
2021-12-01 10:11:35,228 : DEBUG : worker e

2021-12-01 10:12:16,496 : DEBUG : worker exiting, processed 29 jobs
2021-12-01 10:12:16,497 : INFO : worker thread finished; awaiting finish of 9 more threads
2021-12-01 10:12:16,505 : DEBUG : worker exiting, processed 30 jobs
2021-12-01 10:12:16,506 : INFO : worker thread finished; awaiting finish of 8 more threads
2021-12-01 10:12:16,515 : DEBUG : worker exiting, processed 30 jobs
2021-12-01 10:12:16,516 : INFO : worker thread finished; awaiting finish of 7 more threads
2021-12-01 10:12:16,525 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 10:12:16,525 : INFO : worker thread finished; awaiting finish of 6 more threads
2021-12-01 10:12:16,543 : DEBUG : worker exiting, processed 36 jobs
2021-12-01 10:12:16,543 : INFO : worker thread finished; awaiting finish of 5 more threads
2021-12-01 10:12:16,665 : DEBUG : worker exiting, processed 33 jobs
2021-12-01 10:12:16,665 : INFO : EPOCH 2 - PROGRESS: at 99.51% examples, 144119 words/s, in_qsize 4, out_qsize 1
2021-12-01 10:12:16,666 

2021-12-01 10:12:57,699 : INFO : worker thread finished; awaiting finish of 7 more threads
2021-12-01 10:12:57,701 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 10:12:57,702 : INFO : worker thread finished; awaiting finish of 6 more threads
2021-12-01 10:12:57,776 : DEBUG : worker exiting, processed 34 jobs
2021-12-01 10:12:57,777 : INFO : worker thread finished; awaiting finish of 5 more threads
2021-12-01 10:12:57,797 : DEBUG : worker exiting, processed 30 jobs
2021-12-01 10:12:57,797 : INFO : worker thread finished; awaiting finish of 4 more threads
2021-12-01 10:12:57,835 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 10:12:57,835 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-12-01 10:12:57,875 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 10:12:57,875 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-12-01 10:12:57,897 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 10:12:57,898 : INFO : worker thread

2021-12-01 10:13:39,006 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 10:13:39,006 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-12-01 10:13:39,033 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 10:13:39,033 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-12-01 10:13:39,046 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 10:13:39,047 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-12-01 10:13:39,049 : INFO : EPOCH - 4 : training on 6320648 raw words (5978038 effective words) took 41.1s, 145334 effective words/s
2021-12-01 10:13:40,063 : INFO : EPOCH 5 - PROGRESS: at 8.91% examples, 110319 words/s, in_qsize 21, out_qsize 0
2021-12-01 10:13:41,210 : INFO : EPOCH 5 - PROGRESS: at 11.79% examples, 114171 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:13:42,349 : INFO : EPOCH 5 - PROGRESS: at 16.72% examples, 138765 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:13:43,420 : INFO : EPOCH 5 - P

2021-12-01 10:14:21,291 : INFO : EPOCH 6 - PROGRESS: at 9.58% examples, 100162 words/s, in_qsize 21, out_qsize 0
2021-12-01 10:14:22,322 : INFO : EPOCH 6 - PROGRESS: at 11.94% examples, 121298 words/s, in_qsize 38, out_qsize 1
2021-12-01 10:14:23,500 : INFO : EPOCH 6 - PROGRESS: at 16.72% examples, 139146 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:14:24,651 : INFO : EPOCH 6 - PROGRESS: at 19.29% examples, 143592 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:14:25,660 : INFO : EPOCH 6 - PROGRESS: at 21.07% examples, 139016 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:14:26,708 : INFO : EPOCH 6 - PROGRESS: at 23.03% examples, 138491 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:14:27,777 : INFO : EPOCH 6 - PROGRESS: at 25.09% examples, 139146 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:14:28,789 : INFO : EPOCH 6 - PROGRESS: at 26.65% examples, 140592 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:14:29,836 : INFO : EPOCH 6 - PROGRESS: at 28.12% examples, 140211 words/s, 

2021-12-01 10:15:07,616 : INFO : EPOCH 7 - PROGRESS: at 22.82% examples, 141360 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:15:08,630 : INFO : EPOCH 7 - PROGRESS: at 24.80% examples, 142674 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:15:09,694 : INFO : EPOCH 7 - PROGRESS: at 26.16% examples, 139394 words/s, in_qsize 39, out_qsize 1
2021-12-01 10:15:10,799 : INFO : EPOCH 7 - PROGRESS: at 27.63% examples, 138264 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:15:11,908 : INFO : EPOCH 7 - PROGRESS: at 29.19% examples, 138248 words/s, in_qsize 40, out_qsize 0
2021-12-01 10:15:12,978 : INFO : EPOCH 7 - PROGRESS: at 30.77% examples, 139582 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:15:14,022 : INFO : EPOCH 7 - PROGRESS: at 31.96% examples, 138723 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:15:15,236 : INFO : EPOCH 7 - PROGRESS: at 35.48% examples, 141134 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:15:16,243 : INFO : EPOCH 7 - PROGRESS: at 39.64% examples, 143778 words/s,

2021-12-01 10:15:54,558 : INFO : EPOCH 8 - PROGRESS: at 31.28% examples, 138912 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:15:55,644 : INFO : EPOCH 8 - PROGRESS: at 33.39% examples, 140604 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:15:56,725 : INFO : EPOCH 8 - PROGRESS: at 37.61% examples, 144061 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:15:57,740 : INFO : EPOCH 8 - PROGRESS: at 42.54% examples, 148220 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:15:58,747 : INFO : EPOCH 8 - PROGRESS: at 44.64% examples, 145430 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:15:59,814 : INFO : EPOCH 8 - PROGRESS: at 46.49% examples, 144145 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:16:00,914 : INFO : EPOCH 8 - PROGRESS: at 48.63% examples, 144252 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:16:01,953 : INFO : EPOCH 8 - PROGRESS: at 51.10% examples, 145497 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:16:02,984 : INFO : EPOCH 8 - PROGRESS: at 52.61% examples, 143265 words/s,

2021-12-01 10:16:40,714 : INFO : EPOCH 9 - PROGRESS: at 46.41% examples, 145242 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:16:41,721 : INFO : EPOCH 9 - PROGRESS: at 48.43% examples, 145513 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:16:42,747 : INFO : EPOCH 9 - PROGRESS: at 50.29% examples, 144269 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:16:43,775 : INFO : EPOCH 9 - PROGRESS: at 51.94% examples, 143546 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:16:44,850 : INFO : EPOCH 9 - PROGRESS: at 56.60% examples, 144900 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:16:46,019 : INFO : EPOCH 9 - PROGRESS: at 58.46% examples, 144200 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:16:47,056 : INFO : EPOCH 9 - PROGRESS: at 60.47% examples, 143967 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:16:48,640 : INFO : EPOCH 9 - PROGRESS: at 65.33% examples, 142937 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:16:49,747 : INFO : EPOCH 9 - PROGRESS: at 67.62% examples, 143878 words/s,

2021-12-01 10:17:27,627 : INFO : EPOCH 10 - PROGRESS: at 59.35% examples, 144620 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:17:28,772 : INFO : EPOCH 10 - PROGRESS: at 63.05% examples, 144530 words/s, in_qsize 38, out_qsize 1
2021-12-01 10:17:29,778 : INFO : EPOCH 10 - PROGRESS: at 65.35% examples, 143335 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:17:30,884 : INFO : EPOCH 10 - PROGRESS: at 67.51% examples, 144257 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:17:31,964 : INFO : EPOCH 10 - PROGRESS: at 69.65% examples, 144377 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:17:32,974 : INFO : EPOCH 10 - PROGRESS: at 71.40% examples, 143892 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:17:34,076 : INFO : EPOCH 10 - PROGRESS: at 74.03% examples, 143978 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:17:35,277 : INFO : EPOCH 10 - PROGRESS: at 78.37% examples, 143996 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:17:36,427 : INFO : EPOCH 10 - PROGRESS: at 80.56% examples, 144541

2021-12-01 10:17:55,674 : INFO : PROGRESS: at sentence #190000, processed 942882 words, keeping 125663 word types
2021-12-01 10:17:55,704 : INFO : PROGRESS: at sentence #200000, processed 1016555 words, keeping 127017 word types
2021-12-01 10:17:55,743 : INFO : PROGRESS: at sentence #210000, processed 1113700 words, keeping 129993 word types
2021-12-01 10:17:55,778 : INFO : PROGRESS: at sentence #220000, processed 1202914 words, keeping 132427 word types
2021-12-01 10:17:55,818 : INFO : PROGRESS: at sentence #230000, processed 1320270 words, keeping 136368 word types
2021-12-01 10:17:55,858 : INFO : PROGRESS: at sentence #240000, processed 1438378 words, keeping 139678 word types
2021-12-01 10:17:55,898 : INFO : PROGRESS: at sentence #250000, processed 1556607 words, keeping 142632 word types
2021-12-01 10:17:55,938 : INFO : PROGRESS: at sentence #260000, processed 1673284 words, keeping 145137 word types
2021-12-01 10:17:55,983 : INFO : PROGRESS: at sentence #270000, processed 1808369

2021-12-01 10:17:58,261 : INFO : FastText lifecycle event {'msg': 'effective_min_count=2 leaves 6287015 word corpus (99.46788683691925%% of original 6320648, drops 33633)', 'datetime': '2021-12-01T10:17:58.261885', 'gensim': '4.0.1', 'python': '3.8.8 (default, Apr 13 2021, 19:58:26) \n[GCC 7.3.0]', 'platform': 'Linux-5.0.17-200.fc29.x86_64-x86_64-with-glibc2.10', 'event': 'prepare_vocab'}
2021-12-01 10:17:59,118 : INFO : deleting the raw counts dictionary of 174111 items
2021-12-01 10:17:59,121 : INFO : sample=0.001 downsamples 25 most-common words
2021-12-01 10:17:59,122 : INFO : FastText lifecycle event {'msg': 'downsampling leaves estimated 5979015.951129102 word corpus (95.1%% of prior 6287015)', 'datetime': '2021-12-01T10:17:59.122040', 'gensim': '4.0.1', 'python': '3.8.8 (default, Apr 13 2021, 19:58:26) \n[GCC 7.3.0]', 'platform': 'Linux-5.0.17-200.fc29.x86_64-x86_64-with-glibc2.10', 'event': 'prepare_vocab'}
2021-12-01 10:18:02,247 : INFO : estimated required memory for 140478 w

<gensim.models.fasttext.FastText at 0x7fd3cf216790>

In [31]:
train_w2v_model() #w2v sg=0 CBOW model with 100 dim

2021-12-01 10:18:14,052 : INFO : collecting all words and their counts
2021-12-01 10:18:14,056 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-12-01 10:18:14,096 : INFO : PROGRESS: at sentence #10000, processed 94367 words, keeping 25860 word types
2021-12-01 10:18:14,117 : INFO : PROGRESS: at sentence #20000, processed 137135 words, keeping 34975 word types
2021-12-01 10:18:14,131 : INFO : PROGRESS: at sentence #30000, processed 157135 words, keeping 43701 word types
2021-12-01 10:18:14,143 : INFO : PROGRESS: at sentence #40000, processed 177135 words, keeping 52299 word types
2021-12-01 10:18:14,155 : INFO : PROGRESS: at sentence #50000, processed 197135 words, keeping 60914 word types
2021-12-01 10:18:14,167 : INFO : PROGRESS: at sentence #60000, processed 217135 words, keeping 69543 word types
2021-12-01 10:18:14,179 : INFO : PROGRESS: at sentence #70000, processed 237135 words, keeping 77508 word types
2021-12-01 10:18:14,192 : INFO : PROGRESS: at s

Loading Sentences with memory freindly iterator ...

Training Sentence Piece Word2Vec CBOW with 100 dimension



2021-12-01 10:18:14,260 : INFO : PROGRESS: at sentence #110000, processed 398973 words, keeping 103853 word types
2021-12-01 10:18:14,298 : INFO : PROGRESS: at sentence #120000, processed 495786 words, keeping 109784 word types
2021-12-01 10:18:14,319 : INFO : PROGRESS: at sentence #130000, processed 541473 words, keeping 110742 word types
2021-12-01 10:18:14,330 : INFO : PROGRESS: at sentence #140000, processed 555554 words, keeping 110894 word types
2021-12-01 10:18:14,346 : INFO : PROGRESS: at sentence #150000, processed 582222 words, keeping 111943 word types
2021-12-01 10:18:14,381 : INFO : PROGRESS: at sentence #160000, processed 672309 words, keeping 116135 word types
2021-12-01 10:18:14,417 : INFO : PROGRESS: at sentence #170000, processed 761491 words, keeping 121442 word types
2021-12-01 10:18:14,450 : INFO : PROGRESS: at sentence #180000, processed 845253 words, keeping 123145 word types
2021-12-01 10:18:14,488 : INFO : PROGRESS: at sentence #190000, processed 942882 words, 

2021-12-01 10:18:16,414 : INFO : PROGRESS: at sentence #830000, processed 5987014 words, keeping 167834 word types
2021-12-01 10:18:16,452 : INFO : PROGRESS: at sentence #840000, processed 6091595 words, keeping 170779 word types
2021-12-01 10:18:16,490 : INFO : PROGRESS: at sentence #850000, processed 6188067 words, keeping 172114 word types
2021-12-01 10:18:16,523 : INFO : PROGRESS: at sentence #860000, processed 6268618 words, keeping 173751 word types
2021-12-01 10:18:16,546 : INFO : collected 174111 word types from a corpus of 6320648 raw words and 865689 sentences
2021-12-01 10:18:16,547 : INFO : Creating a fresh vocabulary
2021-12-01 10:18:17,095 : DEBUG : starting a new internal lifecycle event log for Word2Vec
2021-12-01 10:18:17,096 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=2 retains 140478 unique words (80.68301256095249%% of original 174111, drops 33633)', 'datetime': '2021-12-01T10:18:17.095301', 'gensim': '4.0.1', 'python': '3.8.8 (default, Apr 13 202

2021-12-01 10:18:36,572 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 10:18:36,572 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 10:18:36,572 : INFO : worker thread finished; awaiting finish of 19 more threads
2021-12-01 10:18:36,573 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 10:18:36,579 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 10:18:36,579 : INFO : worker thread finished; awaiting finish of 18 more threads
2021-12-01 10:18:36,579 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 10:18:36,582 : INFO : worker thread finished; awaiting finish of 17 more threads
2021-12-01 10:18:36,584 : INFO : worker thread finished; awaiting finish of 16 more threads
2021-12-01 10:18:36,585 : INFO : worker thread finished; awaiting finish of 15 more threads
2021-12-01 10:18:36,587 : DEBUG : worker exiting, processed 30 jobs
2021-12-01 10:18:36,587 : INFO : worker thread finished; awaiting finish of 14 more threads
2021-12-01 10:18:36,589 : DEBUG : worker

2021-12-01 10:18:52,019 : INFO : EPOCH 4 - PROGRESS: at 88.70% examples, 701596 words/s, in_qsize 0, out_qsize 12
2021-12-01 10:18:52,939 : DEBUG : job loop exiting, total 633 jobs
2021-12-01 10:18:52,943 : DEBUG : worker exiting, processed 30 jobs
2021-12-01 10:18:52,943 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 10:18:52,943 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 10:18:52,944 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 10:18:52,944 : INFO : worker thread finished; awaiting finish of 19 more threads
2021-12-01 10:18:52,952 : INFO : worker thread finished; awaiting finish of 18 more threads
2021-12-01 10:18:52,953 : INFO : worker thread finished; awaiting finish of 17 more threads
2021-12-01 10:18:52,953 : INFO : worker thread finished; awaiting finish of 16 more threads
2021-12-01 10:18:52,957 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 10:18:52,957 : INFO : worker thread finished; awaiting finish of 15 more threads
2021-12-01 10:18:52

2021-12-01 10:19:05,293 : INFO : EPOCH 6 - PROGRESS: at 50.71% examples, 690850 words/s, in_qsize 1, out_qsize 0
2021-12-01 10:19:06,312 : INFO : EPOCH 6 - PROGRESS: at 65.36% examples, 705526 words/s, in_qsize 0, out_qsize 0
2021-12-01 10:19:07,337 : INFO : EPOCH 6 - PROGRESS: at 78.24% examples, 718005 words/s, in_qsize 0, out_qsize 0
2021-12-01 10:19:08,406 : INFO : EPOCH 6 - PROGRESS: at 89.60% examples, 712760 words/s, in_qsize 0, out_qsize 15
2021-12-01 10:19:09,222 : DEBUG : job loop exiting, total 633 jobs
2021-12-01 10:19:09,226 : DEBUG : worker exiting, processed 30 jobs
2021-12-01 10:19:09,226 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 10:19:09,226 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 10:19:09,227 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 10:19:09,227 : INFO : worker thread finished; awaiting finish of 19 more threads
2021-12-01 10:19:09,236 : INFO : worker thread finished; awaiting finish of 18 more threads
2021-12-01 10:19:09,236 

2021-12-01 10:19:20,560 : INFO : EPOCH 8 - PROGRESS: at 39.88% examples, 682931 words/s, in_qsize 0, out_qsize 1
2021-12-01 10:19:21,562 : INFO : EPOCH 8 - PROGRESS: at 51.79% examples, 696727 words/s, in_qsize 0, out_qsize 0
2021-12-01 10:19:22,568 : INFO : EPOCH 8 - PROGRESS: at 66.21% examples, 710195 words/s, in_qsize 0, out_qsize 0
2021-12-01 10:19:23,573 : INFO : EPOCH 8 - PROGRESS: at 79.21% examples, 725555 words/s, in_qsize 0, out_qsize 0
2021-12-01 10:19:24,573 : INFO : EPOCH 8 - PROGRESS: at 90.69% examples, 728915 words/s, in_qsize 0, out_qsize 0
2021-12-01 10:19:25,386 : DEBUG : job loop exiting, total 633 jobs
2021-12-01 10:19:25,391 : DEBUG : worker exiting, processed 33 jobs
2021-12-01 10:19:25,391 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 10:19:25,391 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 10:19:25,391 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 10:19:25,391 : INFO : worker thread finished; awaiting finish of 19 more threads
2021

2021-12-01 10:19:35,654 : INFO : EPOCH 10 - PROGRESS: at 27.34% examples, 626121 words/s, in_qsize 5, out_qsize 0
2021-12-01 10:19:36,656 : INFO : EPOCH 10 - PROGRESS: at 39.66% examples, 703908 words/s, in_qsize 0, out_qsize 0
2021-12-01 10:19:37,666 : INFO : EPOCH 10 - PROGRESS: at 50.91% examples, 695031 words/s, in_qsize 0, out_qsize 16
2021-12-01 10:19:38,713 : INFO : EPOCH 10 - PROGRESS: at 66.60% examples, 723831 words/s, in_qsize 0, out_qsize 0
2021-12-01 10:19:39,715 : INFO : EPOCH 10 - PROGRESS: at 78.76% examples, 726513 words/s, in_qsize 0, out_qsize 0
2021-12-01 10:19:40,722 : INFO : EPOCH 10 - PROGRESS: at 90.40% examples, 730379 words/s, in_qsize 0, out_qsize 0
2021-12-01 10:19:41,575 : DEBUG : job loop exiting, total 633 jobs
2021-12-01 10:19:41,580 : DEBUG : worker exiting, processed 30 jobs
2021-12-01 10:19:41,581 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 10:19:41,581 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 10:19:41,581 : DEBUG : worker exi

2021-12-01 10:19:42,534 : INFO : PROGRESS: at sentence #300000, processed 2071405 words, keeping 147230 word types
2021-12-01 10:19:42,555 : INFO : PROGRESS: at sentence #310000, processed 2123470 words, keeping 147237 word types
2021-12-01 10:19:42,576 : INFO : PROGRESS: at sentence #320000, processed 2178214 words, keeping 147261 word types
2021-12-01 10:19:42,597 : INFO : PROGRESS: at sentence #330000, processed 2231362 words, keeping 147296 word types
2021-12-01 10:19:42,618 : INFO : PROGRESS: at sentence #340000, processed 2281390 words, keeping 147319 word types
2021-12-01 10:19:42,638 : INFO : PROGRESS: at sentence #350000, processed 2333993 words, keeping 147370 word types
2021-12-01 10:19:42,659 : INFO : PROGRESS: at sentence #360000, processed 2388111 words, keeping 147381 word types
2021-12-01 10:19:42,680 : INFO : PROGRESS: at sentence #370000, processed 2437676 words, keeping 148692 word types
2021-12-01 10:19:42,703 : INFO : PROGRESS: at sentence #380000, processed 248855

2021-12-01 10:19:47,178 : INFO : resetting layer weights
2021-12-01 10:19:47,182 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2021-12-01T10:19:47.182674', 'gensim': '4.0.1', 'python': '3.8.8 (default, Apr 13 2021, 19:58:26) \n[GCC 7.3.0]', 'platform': 'Linux-5.0.17-200.fc29.x86_64-x86_64-with-glibc2.10', 'event': 'build_vocab'}
  _model.init_sims(replace=True)
2021-12-01 10:19:47,217 : INFO : Word2Vec lifecycle event {'fname_or_handle': './emb_models/wol_W2V_CBOW_100D', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2021-12-01T10:19:47.217459', 'gensim': '4.0.1', 'python': '3.8.8 (default, Apr 13 2021, 19:58:26) \n[GCC 7.3.0]', 'platform': 'Linux-5.0.17-200.fc29.x86_64-x86_64-with-glibc2.10', 'event': 'saving'}
2021-12-01 10:19:47,218 : INFO : storing np array 'vectors' to ./emb_models/wol_W2V_CBOW_100D.wv.vectors.npy
2021-12-01 10:19:47,269 : INFO : storing np array 'syn1neg' to ./emb_models/wol_W2V_CBOW_100D.s

<gensim.models.word2vec.Word2Vec at 0x7fd400619bb0>

In [36]:
train_FastText_model() #w2v sg=1 Skip-gram model with 100 dim  

2021-12-01 10:22:48,013 : INFO : collecting all words and their counts
2021-12-01 10:22:48,018 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-12-01 10:22:48,062 : INFO : PROGRESS: at sentence #10000, processed 94367 words, keeping 25860 word types
2021-12-01 10:22:48,085 : INFO : PROGRESS: at sentence #20000, processed 137135 words, keeping 34975 word types
2021-12-01 10:22:48,099 : INFO : PROGRESS: at sentence #30000, processed 157135 words, keeping 43701 word types
2021-12-01 10:22:48,112 : INFO : PROGRESS: at sentence #40000, processed 177135 words, keeping 52299 word types
2021-12-01 10:22:48,125 : INFO : PROGRESS: at sentence #50000, processed 197135 words, keeping 60914 word types
2021-12-01 10:22:48,137 : INFO : PROGRESS: at sentence #60000, processed 217135 words, keeping 69543 word types
2021-12-01 10:22:48,150 : INFO : PROGRESS: at sentence #70000, processed 237135 words, keeping 77508 word types
2021-12-01 10:22:48,163 : INFO : PROGRESS: at s

Loading Sentences with memory freindly iterator ...

Training Sentence Piece Word2Vec Skip-gram with 100 dimension



2021-12-01 10:22:48,230 : INFO : PROGRESS: at sentence #110000, processed 398973 words, keeping 103853 word types
2021-12-01 10:22:48,268 : INFO : PROGRESS: at sentence #120000, processed 495786 words, keeping 109784 word types
2021-12-01 10:22:48,290 : INFO : PROGRESS: at sentence #130000, processed 541473 words, keeping 110742 word types
2021-12-01 10:22:48,300 : INFO : PROGRESS: at sentence #140000, processed 555554 words, keeping 110894 word types
2021-12-01 10:22:48,317 : INFO : PROGRESS: at sentence #150000, processed 582222 words, keeping 111943 word types
2021-12-01 10:22:48,353 : INFO : PROGRESS: at sentence #160000, processed 672309 words, keeping 116135 word types
2021-12-01 10:22:48,389 : INFO : PROGRESS: at sentence #170000, processed 761491 words, keeping 121442 word types
2021-12-01 10:22:48,423 : INFO : PROGRESS: at sentence #180000, processed 845253 words, keeping 123145 word types
2021-12-01 10:22:48,459 : INFO : PROGRESS: at sentence #190000, processed 942882 words, 

2021-12-01 10:22:50,393 : INFO : PROGRESS: at sentence #830000, processed 5987014 words, keeping 167834 word types
2021-12-01 10:22:50,431 : INFO : PROGRESS: at sentence #840000, processed 6091595 words, keeping 170779 word types
2021-12-01 10:22:50,469 : INFO : PROGRESS: at sentence #850000, processed 6188067 words, keeping 172114 word types
2021-12-01 10:22:50,502 : INFO : PROGRESS: at sentence #860000, processed 6268618 words, keeping 173751 word types
2021-12-01 10:22:50,526 : INFO : collected 174111 word types from a corpus of 6320648 raw words and 865689 sentences
2021-12-01 10:22:50,526 : INFO : Creating a fresh vocabulary
2021-12-01 10:22:51,080 : DEBUG : starting a new internal lifecycle event log for FastText
2021-12-01 10:22:51,081 : INFO : FastText lifecycle event {'msg': 'effective_min_count=2 retains 140478 unique words (80.68301256095249%% of original 174111, drops 33633)', 'datetime': '2021-12-01T10:22:51.080908', 'gensim': '4.0.1', 'python': '3.8.8 (default, Apr 13 202

2021-12-01 10:23:58,775 : INFO : EPOCH 1 - PROGRESS: at 88.56% examples, 97449 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:23:59,789 : INFO : EPOCH 1 - PROGRESS: at 89.60% examples, 97376 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:24:00,867 : INFO : EPOCH 1 - PROGRESS: at 90.59% examples, 97210 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:24:01,957 : INFO : EPOCH 1 - PROGRESS: at 91.67% examples, 97201 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:24:03,002 : INFO : EPOCH 1 - PROGRESS: at 92.56% examples, 96925 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:24:04,036 : INFO : EPOCH 1 - PROGRESS: at 93.95% examples, 97179 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:24:05,089 : INFO : EPOCH 1 - PROGRESS: at 95.08% examples, 96891 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:24:05,144 : DEBUG : job loop exiting, total 633 jobs
2021-12-01 10:24:06,132 : INFO : EPOCH 1 - PROGRESS: at 96.64% examples, 97443 words/s, in_qsize 27, out_qsize 0
2021-12-01 10:24:07,242 : INF

2021-12-01 10:24:44,675 : INFO : EPOCH 2 - PROGRESS: at 65.36% examples, 98485 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:24:45,762 : INFO : EPOCH 2 - PROGRESS: at 66.73% examples, 98707 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:24:46,821 : INFO : EPOCH 2 - PROGRESS: at 68.12% examples, 98699 words/s, in_qsize 40, out_qsize 0
2021-12-01 10:24:47,938 : INFO : EPOCH 2 - PROGRESS: at 69.39% examples, 98506 words/s, in_qsize 38, out_qsize 1
2021-12-01 10:24:48,943 : INFO : EPOCH 2 - PROGRESS: at 70.60% examples, 98421 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:24:49,947 : INFO : EPOCH 2 - PROGRESS: at 72.16% examples, 98534 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:24:51,028 : INFO : EPOCH 2 - PROGRESS: at 74.68% examples, 98478 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:24:52,097 : INFO : EPOCH 2 - PROGRESS: at 77.76% examples, 98937 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:24:53,191 : INFO : EPOCH 2 - PROGRESS: at 78.58% examples, 98232 words/s, in_qsize

2021-12-01 10:25:31,921 : INFO : EPOCH 3 - PROGRESS: at 43.86% examples, 99315 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:25:33,119 : INFO : EPOCH 3 - PROGRESS: at 46.27% examples, 101025 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:25:34,165 : INFO : EPOCH 3 - PROGRESS: at 47.06% examples, 99089 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:25:35,244 : INFO : EPOCH 3 - PROGRESS: at 48.73% examples, 100078 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:25:36,250 : INFO : EPOCH 3 - PROGRESS: at 50.12% examples, 99581 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:25:37,329 : INFO : EPOCH 3 - PROGRESS: at 51.33% examples, 99194 words/s, in_qsize 38, out_qsize 1
2021-12-01 10:25:38,451 : INFO : EPOCH 3 - PROGRESS: at 54.28% examples, 99645 words/s, in_qsize 36, out_qsize 2
2021-12-01 10:25:40,029 : INFO : EPOCH 3 - PROGRESS: at 56.95% examples, 98911 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:25:41,291 : INFO : EPOCH 3 - PROGRESS: at 58.96% examples, 100095 words/s, in_qs

2021-12-01 10:26:20,631 : INFO : EPOCH 4 - PROGRESS: at 26.65% examples, 97255 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:26:21,636 : INFO : EPOCH 4 - PROGRESS: at 27.72% examples, 97819 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:26:22,712 : INFO : EPOCH 4 - PROGRESS: at 28.70% examples, 97145 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:26:23,937 : INFO : EPOCH 4 - PROGRESS: at 29.68% examples, 95670 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:26:25,015 : INFO : EPOCH 4 - PROGRESS: at 30.77% examples, 96399 words/s, in_qsize 38, out_qsize 1
2021-12-01 10:26:26,041 : INFO : EPOCH 4 - PROGRESS: at 31.45% examples, 95164 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:26:27,132 : INFO : EPOCH 4 - PROGRESS: at 33.30% examples, 97297 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:26:28,186 : INFO : EPOCH 4 - PROGRESS: at 35.70% examples, 98418 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:26:29,368 : INFO : EPOCH 4 - PROGRESS: at 39.66% examples, 101046 words/s, in_qsiz

2021-12-01 10:27:07,377 : INFO : EPOCH - 4 : training on 6320648 raw words (5979455 effective words) took 59.1s, 101119 effective words/s
2021-12-01 10:27:08,712 : INFO : EPOCH 5 - PROGRESS: at 9.71% examples, 79508 words/s, in_qsize 21, out_qsize 0
2021-12-01 10:27:09,848 : INFO : EPOCH 5 - PROGRESS: at 11.80% examples, 99840 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:27:10,969 : INFO : EPOCH 5 - PROGRESS: at 14.90% examples, 95353 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:27:12,039 : INFO : EPOCH 5 - PROGRESS: at 17.28% examples, 106205 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:27:13,157 : INFO : EPOCH 5 - PROGRESS: at 18.00% examples, 93604 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:27:14,179 : INFO : EPOCH 5 - PROGRESS: at 20.11% examples, 101537 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:27:15,242 : INFO : EPOCH 5 - PROGRESS: at 20.98% examples, 96196 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:27:16,291 : INFO : EPOCH 5 - PROGRESS: at 23.01% example

2021-12-01 10:28:05,728 : INFO : worker thread finished; awaiting finish of 7 more threads
2021-12-01 10:28:05,739 : DEBUG : worker exiting, processed 33 jobs
2021-12-01 10:28:05,739 : INFO : worker thread finished; awaiting finish of 6 more threads
2021-12-01 10:28:05,755 : DEBUG : worker exiting, processed 36 jobs
2021-12-01 10:28:05,755 : INFO : worker thread finished; awaiting finish of 5 more threads
2021-12-01 10:28:05,886 : DEBUG : worker exiting, processed 30 jobs
2021-12-01 10:28:05,886 : INFO : worker thread finished; awaiting finish of 4 more threads
2021-12-01 10:28:05,896 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 10:28:05,897 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-12-01 10:28:05,909 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 10:28:05,909 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-12-01 10:28:06,009 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 10:28:06,009 : INFO : worker thread

2021-12-01 10:29:03,980 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 10:29:03,981 : INFO : worker thread finished; awaiting finish of 13 more threads
2021-12-01 10:29:04,028 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 10:29:04,029 : INFO : worker thread finished; awaiting finish of 12 more threads
2021-12-01 10:29:04,034 : DEBUG : worker exiting, processed 36 jobs
2021-12-01 10:29:04,034 : INFO : worker thread finished; awaiting finish of 11 more threads
2021-12-01 10:29:04,041 : DEBUG : worker exiting, processed 30 jobs
2021-12-01 10:29:04,041 : INFO : worker thread finished; awaiting finish of 10 more threads
2021-12-01 10:29:04,097 : DEBUG : worker exiting, processed 33 jobs
2021-12-01 10:29:04,098 : INFO : worker thread finished; awaiting finish of 9 more threads
2021-12-01 10:29:04,102 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 10:29:04,102 : INFO : worker thread finished; awaiting finish of 8 more threads
2021-12-01 10:29:04,110 : DEBUG : worker e

2021-12-01 10:30:01,138 : INFO : EPOCH 7 - PROGRESS: at 97.23% examples, 101891 words/s, in_qsize 22, out_qsize 0
2021-12-01 10:30:01,787 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 10:30:01,787 : INFO : worker thread finished; awaiting finish of 19 more threads
2021-12-01 10:30:01,848 : DEBUG : worker exiting, processed 30 jobs
2021-12-01 10:30:01,849 : INFO : worker thread finished; awaiting finish of 18 more threads
2021-12-01 10:30:01,904 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 10:30:01,905 : INFO : worker thread finished; awaiting finish of 17 more threads
2021-12-01 10:30:01,985 : DEBUG : worker exiting, processed 36 jobs
2021-12-01 10:30:01,986 : INFO : worker thread finished; awaiting finish of 16 more threads
2021-12-01 10:30:02,026 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 10:30:02,027 : INFO : worker thread finished; awaiting finish of 15 more threads
2021-12-01 10:30:02,032 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 10:30:0

2021-12-01 10:30:48,820 : INFO : EPOCH 8 - PROGRESS: at 85.47% examples, 103438 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:30:49,880 : INFO : EPOCH 8 - PROGRESS: at 86.53% examples, 102345 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:30:51,015 : INFO : EPOCH 8 - PROGRESS: at 88.43% examples, 103137 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:30:52,046 : INFO : EPOCH 8 - PROGRESS: at 88.99% examples, 101950 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:30:53,139 : INFO : EPOCH 8 - PROGRESS: at 90.50% examples, 102582 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:30:54,213 : INFO : EPOCH 8 - PROGRESS: at 91.48% examples, 102306 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:30:55,252 : INFO : EPOCH 8 - PROGRESS: at 92.65% examples, 102474 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:30:56,268 : INFO : EPOCH 8 - PROGRESS: at 94.04% examples, 102483 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:30:57,226 : DEBUG : job loop exiting, total 633 jobs
2021-12-01 10:30:57,8

2021-12-01 10:31:36,388 : INFO : EPOCH 9 - PROGRESS: at 67.35% examples, 103693 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:31:37,392 : INFO : EPOCH 9 - PROGRESS: at 68.62% examples, 103423 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:31:38,427 : INFO : EPOCH 9 - PROGRESS: at 69.65% examples, 102821 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:31:39,506 : INFO : EPOCH 9 - PROGRESS: at 71.25% examples, 103126 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:31:40,764 : INFO : EPOCH 9 - PROGRESS: at 73.13% examples, 102707 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:31:41,844 : INFO : EPOCH 9 - PROGRESS: at 76.83% examples, 103299 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:31:42,973 : INFO : EPOCH 9 - PROGRESS: at 78.36% examples, 102810 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:31:44,122 : INFO : EPOCH 9 - PROGRESS: at 80.14% examples, 103407 words/s, in_qsize 38, out_qsize 1
2021-12-01 10:31:45,148 : INFO : EPOCH 9 - PROGRESS: at 81.06% examples, 102751 words/s,

2021-12-01 10:32:23,970 : INFO : EPOCH 10 - PROGRESS: at 49.09% examples, 102829 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:32:25,145 : INFO : EPOCH 10 - PROGRESS: at 50.88% examples, 103361 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:32:26,266 : INFO : EPOCH 10 - PROGRESS: at 52.08% examples, 102625 words/s, in_qsize 38, out_qsize 2
2021-12-01 10:32:27,417 : INFO : EPOCH 10 - PROGRESS: at 56.28% examples, 103530 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:32:28,507 : INFO : EPOCH 10 - PROGRESS: at 57.33% examples, 102668 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:32:29,524 : INFO : EPOCH 10 - PROGRESS: at 58.81% examples, 103315 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:32:30,592 : INFO : EPOCH 10 - PROGRESS: at 60.35% examples, 103162 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:32:31,700 : INFO : EPOCH 10 - PROGRESS: at 63.88% examples, 103505 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:32:32,733 : INFO : EPOCH 10 - PROGRESS: at 65.35% examples, 102909

2021-12-01 10:33:05,055 : INFO : PROGRESS: at sentence #70000, processed 237135 words, keeping 77508 word types
2021-12-01 10:33:05,067 : INFO : PROGRESS: at sentence #80000, processed 257135 words, keeping 84909 word types
2021-12-01 10:33:05,083 : INFO : PROGRESS: at sentence #90000, processed 277135 words, keeping 90915 word types
2021-12-01 10:33:05,095 : INFO : PROGRESS: at sentence #100000, processed 296511 words, keeping 95135 word types
2021-12-01 10:33:05,136 : INFO : PROGRESS: at sentence #110000, processed 398973 words, keeping 103853 word types
2021-12-01 10:33:05,174 : INFO : PROGRESS: at sentence #120000, processed 495786 words, keeping 109784 word types
2021-12-01 10:33:05,195 : INFO : PROGRESS: at sentence #130000, processed 541473 words, keeping 110742 word types
2021-12-01 10:33:05,205 : INFO : PROGRESS: at sentence #140000, processed 555554 words, keeping 110894 word types
2021-12-01 10:33:05,220 : INFO : PROGRESS: at sentence #150000, processed 582222 words, keeping

2021-12-01 10:33:07,102 : INFO : PROGRESS: at sentence #790000, processed 5572941 words, keeping 166751 word types
2021-12-01 10:33:07,141 : INFO : PROGRESS: at sentence #800000, processed 5690350 words, keeping 166761 word types
2021-12-01 10:33:07,180 : INFO : PROGRESS: at sentence #810000, processed 5808597 words, keeping 166774 word types
2021-12-01 10:33:07,208 : INFO : PROGRESS: at sentence #820000, processed 5888044 words, keeping 167559 word types
2021-12-01 10:33:07,245 : INFO : PROGRESS: at sentence #830000, processed 5987014 words, keeping 167834 word types
2021-12-01 10:33:07,282 : INFO : PROGRESS: at sentence #840000, processed 6091595 words, keeping 170779 word types
2021-12-01 10:33:07,319 : INFO : PROGRESS: at sentence #850000, processed 6188067 words, keeping 172114 word types
2021-12-01 10:33:07,351 : INFO : PROGRESS: at sentence #860000, processed 6268618 words, keeping 173751 word types
2021-12-01 10:33:07,374 : INFO : collected 174111 word types from a corpus of 63

<gensim.models.fasttext.FastText at 0x7fd3cf2163a0>

In [37]:
train_w2v_model() #w2v sg=1 CBOW model with 100 dim

2021-12-01 10:33:23,865 : INFO : collecting all words and their counts
2021-12-01 10:33:23,869 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-12-01 10:33:23,910 : INFO : PROGRESS: at sentence #10000, processed 94367 words, keeping 25860 word types
2021-12-01 10:33:23,931 : INFO : PROGRESS: at sentence #20000, processed 137135 words, keeping 34975 word types
2021-12-01 10:33:23,944 : INFO : PROGRESS: at sentence #30000, processed 157135 words, keeping 43701 word types
2021-12-01 10:33:23,957 : INFO : PROGRESS: at sentence #40000, processed 177135 words, keeping 52299 word types
2021-12-01 10:33:23,970 : INFO : PROGRESS: at sentence #50000, processed 197135 words, keeping 60914 word types
2021-12-01 10:33:23,982 : INFO : PROGRESS: at sentence #60000, processed 217135 words, keeping 69543 word types
2021-12-01 10:33:23,995 : INFO : PROGRESS: at sentence #70000, processed 237135 words, keeping 77508 word types
2021-12-01 10:33:24,007 : INFO : PROGRESS: at s

Loading Sentences with memory freindly iterator ...

Training Sentence Piece Word2Vec Skip-gram with 100 dimension



2021-12-01 10:33:24,074 : INFO : PROGRESS: at sentence #110000, processed 398973 words, keeping 103853 word types
2021-12-01 10:33:24,114 : INFO : PROGRESS: at sentence #120000, processed 495786 words, keeping 109784 word types
2021-12-01 10:33:24,135 : INFO : PROGRESS: at sentence #130000, processed 541473 words, keeping 110742 word types
2021-12-01 10:33:24,146 : INFO : PROGRESS: at sentence #140000, processed 555554 words, keeping 110894 word types
2021-12-01 10:33:24,161 : INFO : PROGRESS: at sentence #150000, processed 582222 words, keeping 111943 word types
2021-12-01 10:33:24,198 : INFO : PROGRESS: at sentence #160000, processed 672309 words, keeping 116135 word types
2021-12-01 10:33:24,233 : INFO : PROGRESS: at sentence #170000, processed 761491 words, keeping 121442 word types
2021-12-01 10:33:24,266 : INFO : PROGRESS: at sentence #180000, processed 845253 words, keeping 123145 word types
2021-12-01 10:33:24,302 : INFO : PROGRESS: at sentence #190000, processed 942882 words, 

2021-12-01 10:33:26,201 : INFO : PROGRESS: at sentence #830000, processed 5987014 words, keeping 167834 word types
2021-12-01 10:33:26,239 : INFO : PROGRESS: at sentence #840000, processed 6091595 words, keeping 170779 word types
2021-12-01 10:33:26,277 : INFO : PROGRESS: at sentence #850000, processed 6188067 words, keeping 172114 word types
2021-12-01 10:33:26,310 : INFO : PROGRESS: at sentence #860000, processed 6268618 words, keeping 173751 word types
2021-12-01 10:33:26,333 : INFO : collected 174111 word types from a corpus of 6320648 raw words and 865689 sentences
2021-12-01 10:33:26,333 : INFO : Creating a fresh vocabulary
2021-12-01 10:33:26,891 : DEBUG : starting a new internal lifecycle event log for Word2Vec
2021-12-01 10:33:26,892 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=2 retains 140478 unique words (80.68301256095249%% of original 174111, drops 33633)', 'datetime': '2021-12-01T10:33:26.891921', 'gensim': '4.0.1', 'python': '3.8.8 (default, Apr 13 202

2021-12-01 10:33:55,878 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 10:33:55,878 : INFO : worker thread finished; awaiting finish of 4 more threads
2021-12-01 10:33:55,920 : DEBUG : worker exiting, processed 30 jobs
2021-12-01 10:33:55,920 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-12-01 10:33:55,939 : DEBUG : worker exiting, processed 30 jobs
2021-12-01 10:33:55,939 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-12-01 10:33:55,959 : DEBUG : worker exiting, processed 33 jobs
2021-12-01 10:33:55,959 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-12-01 10:33:55,991 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 10:33:55,991 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-12-01 10:33:55,993 : INFO : EPOCH - 1 : training on 6320648 raw words (5978044 effective words) took 26.6s, 224655 effective words/s
2021-12-01 10:33:57,009 : INFO : EPOCH 2 - PROGRESS: at 11.60% exam

2021-12-01 10:34:32,714 : INFO : EPOCH 3 - PROGRESS: at 49.99% examples, 239821 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:34:33,731 : INFO : EPOCH 3 - PROGRESS: at 56.73% examples, 248006 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:34:34,878 : INFO : EPOCH 3 - PROGRESS: at 59.50% examples, 243352 words/s, in_qsize 37, out_qsize 2
2021-12-01 10:34:35,889 : INFO : EPOCH 3 - PROGRESS: at 65.62% examples, 246277 words/s, in_qsize 40, out_qsize 0
2021-12-01 10:34:36,911 : INFO : EPOCH 3 - PROGRESS: at 68.75% examples, 245954 words/s, in_qsize 38, out_qsize 1
2021-12-01 10:34:37,967 : INFO : EPOCH 3 - PROGRESS: at 72.51% examples, 246801 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:34:39,037 : INFO : EPOCH 3 - PROGRESS: at 78.36% examples, 247072 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:34:40,053 : INFO : EPOCH 3 - PROGRESS: at 81.34% examples, 245978 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:34:41,057 : INFO : EPOCH 3 - PROGRESS: at 87.60% examples, 249574 words/s,

2021-12-01 10:35:07,671 : INFO : worker thread finished; awaiting finish of 14 more threads
2021-12-01 10:35:07,691 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 10:35:07,692 : INFO : worker thread finished; awaiting finish of 13 more threads
2021-12-01 10:35:07,708 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 10:35:07,708 : INFO : worker thread finished; awaiting finish of 12 more threads
2021-12-01 10:35:07,718 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 10:35:07,718 : INFO : worker thread finished; awaiting finish of 11 more threads
2021-12-01 10:35:07,723 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 10:35:07,723 : INFO : worker thread finished; awaiting finish of 10 more threads
2021-12-01 10:35:07,741 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 10:35:07,742 : INFO : worker thread finished; awaiting finish of 9 more threads
2021-12-01 10:35:07,754 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 10:35:07,754 : INFO : worker t

2021-12-01 10:35:31,110 : INFO : EPOCH 6 - PROGRESS: at 11.68% examples, 205614 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:35:32,156 : INFO : EPOCH 6 - PROGRESS: at 17.43% examples, 229663 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:35:33,237 : INFO : EPOCH 6 - PROGRESS: at 22.04% examples, 256916 words/s, in_qsize 38, out_qsize 1
2021-12-01 10:35:34,250 : INFO : EPOCH 6 - PROGRESS: at 25.27% examples, 249677 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:35:35,268 : INFO : EPOCH 6 - PROGRESS: at 28.12% examples, 254067 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:35:36,321 : INFO : EPOCH 6 - PROGRESS: at 30.94% examples, 257281 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:35:37,345 : INFO : EPOCH 6 - PROGRESS: at 35.06% examples, 261928 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:35:38,420 : INFO : EPOCH 6 - PROGRESS: at 43.74% examples, 272531 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:35:39,422 : INFO : EPOCH 6 - PROGRESS: at 47.40% examples, 270566 words/s,

2021-12-01 10:36:12,546 : INFO : worker thread finished; awaiting finish of 18 more threads
2021-12-01 10:36:12,549 : INFO : worker thread finished; awaiting finish of 17 more threads
2021-12-01 10:36:12,550 : INFO : worker thread finished; awaiting finish of 16 more threads
2021-12-01 10:36:12,550 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 10:36:12,548 : DEBUG : worker exiting, processed 30 jobs
2021-12-01 10:36:12,557 : DEBUG : worker exiting, processed 30 jobs
2021-12-01 10:36:12,557 : INFO : worker thread finished; awaiting finish of 15 more threads
2021-12-01 10:36:12,572 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 10:36:12,572 : INFO : worker thread finished; awaiting finish of 14 more threads
2021-12-01 10:36:12,600 : DEBUG : worker exiting, processed 33 jobs
2021-12-01 10:36:12,600 : INFO : worker thread finished; awaiting finish of 13 more threads
2021-12-01 10:36:12,604 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 10:36:12,604 : INFO : worker 

2021-12-01 10:36:33,933 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-12-01 10:36:33,996 : DEBUG : worker exiting, processed 34 jobs
2021-12-01 10:36:33,996 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-12-01 10:36:34,025 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 10:36:34,025 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-12-01 10:36:34,027 : INFO : EPOCH - 8 : training on 6320648 raw words (5979825 effective words) took 21.2s, 282416 effective words/s
2021-12-01 10:36:35,199 : INFO : EPOCH 9 - PROGRESS: at 11.67% examples, 203058 words/s, in_qsize 3, out_qsize 1
2021-12-01 10:36:36,208 : INFO : EPOCH 9 - PROGRESS: at 18.60% examples, 270441 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:36:37,251 : INFO : EPOCH 9 - PROGRESS: at 22.73% examples, 273300 words/s, in_qsize 39, out_qsize 0
2021-12-01 10:36:38,273 : INFO : EPOCH 9 - PROGRESS: at 25.96% examples, 268323 words/s, in_qsize 39, out_qsize 

2021-12-01 10:37:15,001 : DEBUG : job loop exiting, total 633 jobs
2021-12-01 10:37:15,105 : INFO : EPOCH 10 - PROGRESS: at 95.76% examples, 281355 words/s, in_qsize 35, out_qsize 0
2021-12-01 10:37:15,740 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 10:37:15,740 : INFO : worker thread finished; awaiting finish of 19 more threads
2021-12-01 10:37:15,743 : DEBUG : worker exiting, processed 32 jobs
2021-12-01 10:37:15,744 : INFO : worker thread finished; awaiting finish of 18 more threads
2021-12-01 10:37:15,746 : INFO : worker thread finished; awaiting finish of 17 more threads
2021-12-01 10:37:15,747 : DEBUG : worker exiting, processed 30 jobs
2021-12-01 10:37:15,744 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 10:37:15,747 : INFO : worker thread finished; awaiting finish of 16 more threads
2021-12-01 10:37:15,778 : DEBUG : worker exiting, processed 31 jobs
2021-12-01 10:37:15,778 : INFO : worker thread finished; awaiting finish of 15 more threads
2021-12-01 10:37:1

2021-12-01 10:37:16,998 : INFO : PROGRESS: at sentence #350000, processed 2333993 words, keeping 147370 word types
2021-12-01 10:37:17,019 : INFO : PROGRESS: at sentence #360000, processed 2388111 words, keeping 147381 word types
2021-12-01 10:37:17,040 : INFO : PROGRESS: at sentence #370000, processed 2437676 words, keeping 148692 word types
2021-12-01 10:37:17,065 : INFO : PROGRESS: at sentence #380000, processed 2488553 words, keeping 149833 word types
2021-12-01 10:37:17,100 : INFO : PROGRESS: at sentence #390000, processed 2570351 words, keeping 150774 word types
2021-12-01 10:37:17,135 : INFO : PROGRESS: at sentence #400000, processed 2651000 words, keeping 151378 word types
2021-12-01 10:37:17,173 : INFO : PROGRESS: at sentence #410000, processed 2746833 words, keeping 153443 word types
2021-12-01 10:37:17,206 : INFO : PROGRESS: at sentence #420000, processed 2835932 words, keeping 154045 word types
2021-12-01 10:37:17,237 : INFO : PROGRESS: at sentence #430000, processed 292058

2021-12-01 10:37:21,556 : INFO : Word2Vec lifecycle event {'fname_or_handle': './emb_models/wol_W2V_Skip-gram_100D', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2021-12-01T10:37:21.556496', 'gensim': '4.0.1', 'python': '3.8.8 (default, Apr 13 2021, 19:58:26) \n[GCC 7.3.0]', 'platform': 'Linux-5.0.17-200.fc29.x86_64-x86_64-with-glibc2.10', 'event': 'saving'}
2021-12-01 10:37:21,557 : INFO : storing np array 'vectors' to ./emb_models/wol_W2V_Skip-gram_100D.wv.vectors.npy
2021-12-01 10:37:21,588 : INFO : storing np array 'syn1neg' to ./emb_models/wol_W2V_Skip-gram_100D.syn1neg.npy
2021-12-01 10:37:21,622 : INFO : not storing attribute cum_table
2021-12-01 10:37:21,623 : DEBUG : {'uri': './emb_models/wol_W2V_Skip-gram_100D', 'mode': 'wb', 'buffering': -1, 'encoding': None, 'errors': None, 'newline': None, 'closefd': True, 'opener': None, 'ignore_ext': False, 'transport_params': None}
2021-12-01 10:37:21,662 : INFO : saved ./emb_models/wol_W2V_Skip-gram_

<gensim.models.word2vec.Word2Vec at 0x7fd2fb8ca610>

In [47]:
train_w2v_model(sub_word = True) #w2v sg=0 Skip-gram model with 100 dim  

2021-12-01 10:49:52,796 : INFO : collecting all words and their counts
2021-12-01 10:49:52,799 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types


Loading Sentences with memory freindly iterator ...

Training Sentence Piece Word2Vec CBOW with 100 dimension



2021-12-01 10:49:53,091 : INFO : PROGRESS: at sentence #10000, processed 133237 words, keeping 11491 word types
2021-12-01 10:49:53,229 : INFO : PROGRESS: at sentence #20000, processed 195405 words, keeping 12682 word types
2021-12-01 10:49:53,296 : INFO : PROGRESS: at sentence #30000, processed 231684 words, keeping 13190 word types
2021-12-01 10:49:53,372 : INFO : PROGRESS: at sentence #40000, processed 269413 words, keeping 13591 word types
2021-12-01 10:49:53,439 : INFO : PROGRESS: at sentence #50000, processed 305238 words, keeping 14007 word types
2021-12-01 10:49:53,514 : INFO : PROGRESS: at sentence #60000, processed 342336 words, keeping 14400 word types
2021-12-01 10:49:53,580 : INFO : PROGRESS: at sentence #70000, processed 376490 words, keeping 14727 word types
2021-12-01 10:49:53,655 : INFO : PROGRESS: at sentence #80000, processed 409865 words, keeping 14941 word types
2021-12-01 10:49:53,724 : INFO : PROGRESS: at sentence #90000, processed 440464 words, keeping 15112 wor

2021-12-01 10:50:08,289 : INFO : PROGRESS: at sentence #740000, processed 7409767 words, keeping 15986 word types
2021-12-01 10:50:08,416 : INFO : PROGRESS: at sentence #750000, processed 7467746 words, keeping 15986 word types
2021-12-01 10:50:08,713 : INFO : PROGRESS: at sentence #760000, processed 7619446 words, keeping 15986 word types
2021-12-01 10:50:09,002 : INFO : PROGRESS: at sentence #770000, processed 7754852 words, keeping 15986 word types
2021-12-01 10:50:09,321 : INFO : PROGRESS: at sentence #780000, processed 7909310 words, keeping 15986 word types
2021-12-01 10:50:09,650 : INFO : PROGRESS: at sentence #790000, processed 8070188 words, keeping 15986 word types
2021-12-01 10:50:09,984 : INFO : PROGRESS: at sentence #800000, processed 8231320 words, keeping 15986 word types
2021-12-01 10:50:10,318 : INFO : PROGRESS: at sentence #810000, processed 8393485 words, keeping 15986 word types
2021-12-01 10:50:10,543 : INFO : PROGRESS: at sentence #820000, processed 8502067 words,

2021-12-01 10:50:33,137 : INFO : worker thread finished; awaiting finish of 11 more threads
2021-12-01 10:50:33,138 : INFO : worker thread finished; awaiting finish of 10 more threads
2021-12-01 10:50:33,138 : INFO : worker thread finished; awaiting finish of 9 more threads
2021-12-01 10:50:33,139 : INFO : worker thread finished; awaiting finish of 8 more threads
2021-12-01 10:50:33,139 : INFO : worker thread finished; awaiting finish of 7 more threads
2021-12-01 10:50:33,140 : INFO : worker thread finished; awaiting finish of 6 more threads
2021-12-01 10:50:33,140 : INFO : worker thread finished; awaiting finish of 5 more threads
2021-12-01 10:50:33,140 : INFO : worker thread finished; awaiting finish of 4 more threads
2021-12-01 10:50:33,141 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-12-01 10:50:33,141 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-12-01 10:50:33,145 : DEBUG : worker exiting, processed 44 jobs
2021-12-01 10:50:33,

2021-12-01 10:51:06,599 : INFO : EPOCH 3 - PROGRESS: at 60.93% examples, 382451 words/s, in_qsize 0, out_qsize 0
2021-12-01 10:51:07,608 : INFO : EPOCH 3 - PROGRESS: at 66.30% examples, 381057 words/s, in_qsize 0, out_qsize 0
2021-12-01 10:51:08,608 : INFO : EPOCH 3 - PROGRESS: at 69.92% examples, 380876 words/s, in_qsize 0, out_qsize 0
2021-12-01 10:51:09,626 : INFO : EPOCH 3 - PROGRESS: at 74.33% examples, 382745 words/s, in_qsize 0, out_qsize 0
2021-12-01 10:51:10,631 : INFO : EPOCH 3 - PROGRESS: at 79.77% examples, 382766 words/s, in_qsize 1, out_qsize 0
2021-12-01 10:51:11,633 : INFO : EPOCH 3 - PROGRESS: at 84.45% examples, 385078 words/s, in_qsize 0, out_qsize 0
2021-12-01 10:51:12,643 : INFO : EPOCH 3 - PROGRESS: at 89.68% examples, 386875 words/s, in_qsize 0, out_qsize 0
2021-12-01 10:51:13,652 : INFO : EPOCH 3 - PROGRESS: at 93.05% examples, 388374 words/s, in_qsize 0, out_qsize 0
2021-12-01 10:51:14,680 : INFO : EPOCH 3 - PROGRESS: at 97.34% examples, 391050 words/s, in_qsiz

2021-12-01 10:51:36,696 : INFO : worker thread finished; awaiting finish of 16 more threads
2021-12-01 10:51:36,697 : INFO : worker thread finished; awaiting finish of 15 more threads
2021-12-01 10:51:36,697 : INFO : worker thread finished; awaiting finish of 14 more threads
2021-12-01 10:51:36,698 : INFO : worker thread finished; awaiting finish of 13 more threads
2021-12-01 10:51:36,698 : INFO : worker thread finished; awaiting finish of 12 more threads
2021-12-01 10:51:36,699 : INFO : worker thread finished; awaiting finish of 11 more threads
2021-12-01 10:51:36,699 : INFO : worker thread finished; awaiting finish of 10 more threads
2021-12-01 10:51:36,700 : INFO : worker thread finished; awaiting finish of 9 more threads
2021-12-01 10:51:36,700 : INFO : worker thread finished; awaiting finish of 8 more threads
2021-12-01 10:51:36,701 : INFO : worker thread finished; awaiting finish of 7 more threads
2021-12-01 10:51:36,702 : INFO : worker thread finished; awaiting finish of 6 more 

2021-12-01 10:52:06,025 : INFO : EPOCH 6 - PROGRESS: at 44.23% examples, 382750 words/s, in_qsize 0, out_qsize 0
2021-12-01 10:52:07,030 : INFO : EPOCH 6 - PROGRESS: at 48.51% examples, 386067 words/s, in_qsize 0, out_qsize 0
2021-12-01 10:52:08,051 : INFO : EPOCH 6 - PROGRESS: at 52.67% examples, 388848 words/s, in_qsize 0, out_qsize 0
2021-12-01 10:52:09,052 : INFO : EPOCH 6 - PROGRESS: at 57.86% examples, 385301 words/s, in_qsize 0, out_qsize 0
2021-12-01 10:52:10,078 : INFO : EPOCH 6 - PROGRESS: at 61.79% examples, 384803 words/s, in_qsize 0, out_qsize 1
2021-12-01 10:52:11,120 : INFO : EPOCH 6 - PROGRESS: at 67.45% examples, 385030 words/s, in_qsize 0, out_qsize 0
2021-12-01 10:52:12,125 : INFO : EPOCH 6 - PROGRESS: at 71.14% examples, 385165 words/s, in_qsize 0, out_qsize 0
2021-12-01 10:52:13,136 : INFO : EPOCH 6 - PROGRESS: at 75.35% examples, 385602 words/s, in_qsize 0, out_qsize 0
2021-12-01 10:52:14,139 : INFO : EPOCH 6 - PROGRESS: at 80.67% examples, 384492 words/s, in_qsiz

2021-12-01 10:52:39,824 : DEBUG : worker exiting, processed 44 jobs
2021-12-01 10:52:39,824 : INFO : worker thread finished; awaiting finish of 19 more threads
2021-12-01 10:52:39,830 : DEBUG : worker exiting, processed 46 jobs
2021-12-01 10:52:39,841 : DEBUG : worker exiting, processed 48 jobs
2021-12-01 10:52:39,845 : INFO : worker thread finished; awaiting finish of 18 more threads
2021-12-01 10:52:39,848 : INFO : worker thread finished; awaiting finish of 17 more threads
2021-12-01 10:52:39,848 : INFO : worker thread finished; awaiting finish of 16 more threads
2021-12-01 10:52:39,849 : INFO : worker thread finished; awaiting finish of 15 more threads
2021-12-01 10:52:39,849 : INFO : worker thread finished; awaiting finish of 14 more threads
2021-12-01 10:52:39,849 : INFO : worker thread finished; awaiting finish of 13 more threads
2021-12-01 10:52:39,850 : INFO : worker thread finished; awaiting finish of 12 more threads
2021-12-01 10:52:39,850 : INFO : worker thread finished; awa

2021-12-01 10:53:04,268 : INFO : EPOCH 9 - PROGRESS: at 19.64% examples, 328902 words/s, in_qsize 0, out_qsize 0
2021-12-01 10:53:05,287 : INFO : EPOCH 9 - PROGRESS: at 23.60% examples, 339577 words/s, in_qsize 0, out_qsize 1
2021-12-01 10:53:06,316 : INFO : EPOCH 9 - PROGRESS: at 27.13% examples, 349602 words/s, in_qsize 0, out_qsize 0
2021-12-01 10:53:07,320 : INFO : EPOCH 9 - PROGRESS: at 30.43% examples, 359246 words/s, in_qsize 0, out_qsize 0
2021-12-01 10:53:08,330 : INFO : EPOCH 9 - PROGRESS: at 34.40% examples, 367861 words/s, in_qsize 1, out_qsize 0
2021-12-01 10:53:09,340 : INFO : EPOCH 9 - PROGRESS: at 40.49% examples, 362039 words/s, in_qsize 0, out_qsize 0
2021-12-01 10:53:10,367 : INFO : EPOCH 9 - PROGRESS: at 46.08% examples, 362434 words/s, in_qsize 0, out_qsize 1
2021-12-01 10:53:11,387 : INFO : EPOCH 9 - PROGRESS: at 50.12% examples, 366693 words/s, in_qsize 1, out_qsize 1
2021-12-01 10:53:12,410 : INFO : EPOCH 9 - PROGRESS: at 54.25% examples, 371782 words/s, in_qsiz

2021-12-01 10:53:43,241 : DEBUG : worker exiting, processed 47 jobs
2021-12-01 10:53:43,242 : DEBUG : worker exiting, processed 44 jobs
2021-12-01 10:53:43,242 : DEBUG : worker exiting, processed 48 jobs
2021-12-01 10:53:43,242 : DEBUG : worker exiting, processed 50 jobs
2021-12-01 10:53:43,242 : DEBUG : worker exiting, processed 43 jobs
2021-12-01 10:53:43,243 : INFO : worker thread finished; awaiting finish of 19 more threads
2021-12-01 10:53:43,243 : DEBUG : worker exiting, processed 44 jobs
2021-12-01 10:53:43,244 : DEBUG : worker exiting, processed 46 jobs
2021-12-01 10:53:43,244 : DEBUG : worker exiting, processed 48 jobs
2021-12-01 10:53:43,244 : DEBUG : worker exiting, processed 43 jobs
2021-12-01 10:53:43,244 : DEBUG : worker exiting, processed 45 jobs
2021-12-01 10:53:43,255 : DEBUG : worker exiting, processed 41 jobs
2021-12-01 10:53:43,261 : INFO : worker thread finished; awaiting finish of 18 more threads
2021-12-01 10:53:43,269 : INFO : worker thread finished; awaiting fi

2021-12-01 10:53:50,342 : INFO : PROGRESS: at sentence #410000, processed 3867189 words, keeping 15966 word types
2021-12-01 10:53:50,569 : INFO : PROGRESS: at sentence #420000, processed 3996547 words, keeping 15966 word types
2021-12-01 10:53:50,800 : INFO : PROGRESS: at sentence #430000, processed 4124472 words, keeping 15966 word types
2021-12-01 10:53:51,011 : INFO : PROGRESS: at sentence #440000, processed 4243535 words, keeping 15966 word types
2021-12-01 10:53:51,268 : INFO : PROGRESS: at sentence #450000, processed 4383057 words, keeping 15966 word types
2021-12-01 10:53:51,472 : INFO : PROGRESS: at sentence #460000, processed 4500960 words, keeping 15966 word types
2021-12-01 10:53:51,737 : INFO : PROGRESS: at sentence #470000, processed 4649225 words, keeping 15966 word types
2021-12-01 10:53:51,809 : INFO : PROGRESS: at sentence #480000, processed 4691062 words, keeping 15966 word types
2021-12-01 10:53:51,856 : INFO : PROGRESS: at sentence #490000, processed 4716395 words,

2021-12-01 10:54:00,031 : INFO : not storing attribute cum_table
2021-12-01 10:54:00,032 : DEBUG : {'uri': './emb_models/subword_wol_W2V_CBOW_100D', 'mode': 'wb', 'buffering': -1, 'encoding': None, 'errors': None, 'newline': None, 'closefd': True, 'opener': None, 'ignore_ext': False, 'transport_params': None}
2021-12-01 10:54:00,107 : INFO : saved ./emb_models/subword_wol_W2V_CBOW_100D


<gensim.models.word2vec.Word2Vec at 0x7fd2fb8ca400>

In [48]:
train_w2v_model(sub_word = True) #w2v sg=0 Skip-gram model with 100 dim  

2021-12-01 10:54:00,142 : INFO : collecting all words and their counts
2021-12-01 10:54:00,145 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types


Loading Sentences with memory freindly iterator ...

Training Sentence Piece Word2Vec CBOW with 100 dimension



2021-12-01 10:54:00,389 : INFO : PROGRESS: at sentence #10000, processed 133237 words, keeping 11491 word types
2021-12-01 10:54:00,505 : INFO : PROGRESS: at sentence #20000, processed 195405 words, keeping 12682 word types
2021-12-01 10:54:00,565 : INFO : PROGRESS: at sentence #30000, processed 231684 words, keeping 13190 word types
2021-12-01 10:54:00,627 : INFO : PROGRESS: at sentence #40000, processed 269413 words, keeping 13591 word types
2021-12-01 10:54:00,687 : INFO : PROGRESS: at sentence #50000, processed 305238 words, keeping 14007 word types
2021-12-01 10:54:00,747 : INFO : PROGRESS: at sentence #60000, processed 342336 words, keeping 14400 word types
2021-12-01 10:54:00,805 : INFO : PROGRESS: at sentence #70000, processed 376490 words, keeping 14727 word types
2021-12-01 10:54:00,863 : INFO : PROGRESS: at sentence #80000, processed 409865 words, keeping 14941 word types
2021-12-01 10:54:00,919 : INFO : PROGRESS: at sentence #90000, processed 440464 words, keeping 15112 wor

2021-12-01 10:54:13,484 : INFO : PROGRESS: at sentence #740000, processed 7409767 words, keeping 15986 word types
2021-12-01 10:54:13,589 : INFO : PROGRESS: at sentence #750000, processed 7467746 words, keeping 15986 word types
2021-12-01 10:54:13,849 : INFO : PROGRESS: at sentence #760000, processed 7619446 words, keeping 15986 word types
2021-12-01 10:54:14,096 : INFO : PROGRESS: at sentence #770000, processed 7754852 words, keeping 15986 word types
2021-12-01 10:54:14,372 : INFO : PROGRESS: at sentence #780000, processed 7909310 words, keeping 15986 word types
2021-12-01 10:54:14,656 : INFO : PROGRESS: at sentence #790000, processed 8070188 words, keeping 15986 word types
2021-12-01 10:54:14,942 : INFO : PROGRESS: at sentence #800000, processed 8231320 words, keeping 15986 word types
2021-12-01 10:54:15,229 : INFO : PROGRESS: at sentence #810000, processed 8393485 words, keeping 15986 word types
2021-12-01 10:54:15,424 : INFO : PROGRESS: at sentence #820000, processed 8502067 words,

2021-12-01 10:54:37,826 : INFO : worker thread finished; awaiting finish of 10 more threads
2021-12-01 10:54:37,827 : INFO : worker thread finished; awaiting finish of 9 more threads
2021-12-01 10:54:37,827 : INFO : worker thread finished; awaiting finish of 8 more threads
2021-12-01 10:54:37,828 : INFO : worker thread finished; awaiting finish of 7 more threads
2021-12-01 10:54:37,828 : INFO : worker thread finished; awaiting finish of 6 more threads
2021-12-01 10:54:37,829 : INFO : worker thread finished; awaiting finish of 5 more threads
2021-12-01 10:54:37,829 : INFO : worker thread finished; awaiting finish of 4 more threads
2021-12-01 10:54:37,830 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-12-01 10:54:37,831 : DEBUG : worker exiting, processed 50 jobs
2021-12-01 10:54:37,831 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-12-01 10:54:37,835 : DEBUG : worker exiting, processed 50 jobs
2021-12-01 10:54:37,835 : INFO : worker thre

2021-12-01 10:55:10,485 : INFO : EPOCH 3 - PROGRESS: at 55.63% examples, 374250 words/s, in_qsize 0, out_qsize 0
2021-12-01 10:55:11,512 : INFO : EPOCH 3 - PROGRESS: at 60.87% examples, 379053 words/s, in_qsize 0, out_qsize 1
2021-12-01 10:55:12,551 : INFO : EPOCH 3 - PROGRESS: at 66.50% examples, 379088 words/s, in_qsize 1, out_qsize 0
2021-12-01 10:55:13,570 : INFO : EPOCH 3 - PROGRESS: at 70.46% examples, 381083 words/s, in_qsize 0, out_qsize 0
2021-12-01 10:55:14,575 : INFO : EPOCH 3 - PROGRESS: at 74.33% examples, 379678 words/s, in_qsize 0, out_qsize 0
2021-12-01 10:55:15,590 : INFO : EPOCH 3 - PROGRESS: at 80.13% examples, 381880 words/s, in_qsize 0, out_qsize 0
2021-12-01 10:55:16,598 : INFO : EPOCH 3 - PROGRESS: at 85.59% examples, 383536 words/s, in_qsize 0, out_qsize 0
2021-12-01 10:55:17,637 : INFO : EPOCH 3 - PROGRESS: at 89.54% examples, 382354 words/s, in_qsize 1, out_qsize 0
2021-12-01 10:55:18,676 : INFO : EPOCH 3 - PROGRESS: at 91.48% examples, 374295 words/s, in_qsiz

2021-12-01 10:55:42,572 : INFO : worker thread finished; awaiting finish of 18 more threads
2021-12-01 10:55:42,574 : INFO : worker thread finished; awaiting finish of 17 more threads
2021-12-01 10:55:42,574 : INFO : worker thread finished; awaiting finish of 16 more threads
2021-12-01 10:55:42,575 : INFO : worker thread finished; awaiting finish of 15 more threads
2021-12-01 10:55:42,575 : INFO : worker thread finished; awaiting finish of 14 more threads
2021-12-01 10:55:42,576 : INFO : worker thread finished; awaiting finish of 13 more threads
2021-12-01 10:55:42,576 : INFO : worker thread finished; awaiting finish of 12 more threads
2021-12-01 10:55:42,577 : INFO : worker thread finished; awaiting finish of 11 more threads
2021-12-01 10:55:42,577 : INFO : worker thread finished; awaiting finish of 10 more threads
2021-12-01 10:55:42,578 : INFO : worker thread finished; awaiting finish of 9 more threads
2021-12-01 10:55:42,578 : INFO : worker thread finished; awaiting finish of 8 mor

2021-12-01 10:56:09,070 : INFO : EPOCH 6 - PROGRESS: at 27.58% examples, 358731 words/s, in_qsize 0, out_qsize 0
2021-12-01 10:56:10,139 : INFO : EPOCH 6 - PROGRESS: at 30.31% examples, 351246 words/s, in_qsize 1, out_qsize 0
2021-12-01 10:56:11,204 : INFO : EPOCH 6 - PROGRESS: at 34.03% examples, 357957 words/s, in_qsize 1, out_qsize 1
2021-12-01 10:56:12,210 : INFO : EPOCH 6 - PROGRESS: at 41.35% examples, 361545 words/s, in_qsize 0, out_qsize 0
2021-12-01 10:56:13,215 : INFO : EPOCH 6 - PROGRESS: at 46.31% examples, 360680 words/s, in_qsize 0, out_qsize 0
2021-12-01 10:56:14,230 : INFO : EPOCH 6 - PROGRESS: at 50.53% examples, 367020 words/s, in_qsize 0, out_qsize 0
2021-12-01 10:56:15,249 : INFO : EPOCH 6 - PROGRESS: at 56.11% examples, 372979 words/s, in_qsize 0, out_qsize 0
2021-12-01 10:56:16,279 : INFO : EPOCH 6 - PROGRESS: at 60.86% examples, 376989 words/s, in_qsize 0, out_qsize 0
2021-12-01 10:56:17,291 : INFO : EPOCH 6 - PROGRESS: at 66.05% examples, 374592 words/s, in_qsiz

2021-12-01 10:56:46,118 : DEBUG : worker exiting, processed 44 jobs
2021-12-01 10:56:46,119 : DEBUG : worker exiting, processed 48 jobs
2021-12-01 10:56:46,119 : DEBUG : worker exiting, processed 45 jobs
2021-12-01 10:56:46,119 : DEBUG : worker exiting, processed 49 jobs
2021-12-01 10:56:46,119 : DEBUG : worker exiting, processed 48 jobs
2021-12-01 10:56:46,119 : DEBUG : worker exiting, processed 50 jobs
2021-12-01 10:56:46,119 : DEBUG : worker exiting, processed 43 jobs
2021-12-01 10:56:46,119 : INFO : worker thread finished; awaiting finish of 19 more threads
2021-12-01 10:56:46,143 : INFO : worker thread finished; awaiting finish of 18 more threads
2021-12-01 10:56:46,144 : INFO : worker thread finished; awaiting finish of 17 more threads
2021-12-01 10:56:46,144 : INFO : worker thread finished; awaiting finish of 16 more threads
2021-12-01 10:56:46,145 : INFO : worker thread finished; awaiting finish of 15 more threads
2021-12-01 10:56:46,145 : INFO : worker thread finished; awaitin

2021-12-01 10:57:08,026 : INFO : EPOCH 9 - PROGRESS: at 6.85% examples, 279967 words/s, in_qsize 0, out_qsize 0
2021-12-01 10:57:09,030 : INFO : EPOCH 9 - PROGRESS: at 13.79% examples, 309036 words/s, in_qsize 0, out_qsize 0
2021-12-01 10:57:10,042 : INFO : EPOCH 9 - PROGRESS: at 19.79% examples, 333650 words/s, in_qsize 0, out_qsize 0
2021-12-01 10:57:11,149 : INFO : EPOCH 9 - PROGRESS: at 23.85% examples, 340403 words/s, in_qsize 0, out_qsize 2
2021-12-01 10:57:12,158 : INFO : EPOCH 9 - PROGRESS: at 27.43% examples, 353292 words/s, in_qsize 0, out_qsize 0
2021-12-01 10:57:13,162 : INFO : EPOCH 9 - PROGRESS: at 30.72% examples, 363261 words/s, in_qsize 0, out_qsize 0
2021-12-01 10:57:14,188 : INFO : EPOCH 9 - PROGRESS: at 34.93% examples, 368298 words/s, in_qsize 0, out_qsize 1
2021-12-01 10:57:15,192 : INFO : EPOCH 9 - PROGRESS: at 42.40% examples, 371810 words/s, in_qsize 0, out_qsize 1
2021-12-01 10:57:16,200 : INFO : EPOCH 9 - PROGRESS: at 46.87% examples, 371509 words/s, in_qsize

2021-12-01 10:57:49,911 : DEBUG : worker exiting, processed 42 jobs
2021-12-01 10:57:49,911 : DEBUG : worker exiting, processed 45 jobs
2021-12-01 10:57:49,911 : DEBUG : worker exiting, processed 44 jobs
2021-12-01 10:57:49,911 : DEBUG : worker exiting, processed 45 jobs
2021-12-01 10:57:49,911 : DEBUG : worker exiting, processed 49 jobs
2021-12-01 10:57:49,911 : DEBUG : worker exiting, processed 50 jobs
2021-12-01 10:57:49,911 : DEBUG : worker exiting, processed 44 jobs
2021-12-01 10:57:49,911 : DEBUG : worker exiting, processed 45 jobs
2021-12-01 10:57:49,911 : DEBUG : worker exiting, processed 43 jobs
2021-12-01 10:57:49,911 : DEBUG : worker exiting, processed 45 jobs
2021-12-01 10:57:49,912 : DEBUG : worker exiting, processed 48 jobs
2021-12-01 10:57:49,912 : DEBUG : worker exiting, processed 51 jobs
2021-12-01 10:57:49,912 : DEBUG : worker exiting, processed 48 jobs
2021-12-01 10:57:49,912 : DEBUG : worker exiting, processed 47 jobs
2021-12-01 10:57:49,912 : INFO : worker thread f

2021-12-01 10:57:56,275 : INFO : PROGRESS: at sentence #380000, processed 3495144 words, keeping 15965 word types
2021-12-01 10:57:56,491 : INFO : PROGRESS: at sentence #390000, processed 3611566 words, keeping 15966 word types
2021-12-01 10:57:56,698 : INFO : PROGRESS: at sentence #400000, processed 3720008 words, keeping 15966 word types
2021-12-01 10:57:56,953 : INFO : PROGRESS: at sentence #410000, processed 3867189 words, keeping 15966 word types
2021-12-01 10:57:57,179 : INFO : PROGRESS: at sentence #420000, processed 3996547 words, keeping 15966 word types
2021-12-01 10:57:57,408 : INFO : PROGRESS: at sentence #430000, processed 4124472 words, keeping 15966 word types
2021-12-01 10:57:57,618 : INFO : PROGRESS: at sentence #440000, processed 4243535 words, keeping 15966 word types
2021-12-01 10:57:57,873 : INFO : PROGRESS: at sentence #450000, processed 4383057 words, keeping 15966 word types
2021-12-01 10:57:58,075 : INFO : PROGRESS: at sentence #460000, processed 4500960 words,

2021-12-01 10:58:06,594 : INFO : not storing attribute cum_table
2021-12-01 10:58:06,594 : DEBUG : {'uri': './emb_models/subword_wol_W2V_CBOW_100D', 'mode': 'wb', 'buffering': -1, 'encoding': None, 'errors': None, 'newline': None, 'closefd': True, 'opener': None, 'ignore_ext': False, 'transport_params': None}
2021-12-01 10:58:06,733 : INFO : saved ./emb_models/subword_wol_W2V_CBOW_100D


<gensim.models.word2vec.Word2Vec at 0x7fd3cf1fc550>

In [52]:
train_w2v_model(sub_word = True) #w2v sg=1 Skip-gram model with 100 dim  

2021-12-01 11:05:46,448 : INFO : collecting all words and their counts
2021-12-01 11:05:46,451 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types


Loading Sentences with memory freindly iterator ...

Training Sentence Piece Word2Vec Skip-gram with 100 dimension



2021-12-01 11:05:46,704 : INFO : PROGRESS: at sentence #10000, processed 133237 words, keeping 11491 word types
2021-12-01 11:05:46,820 : INFO : PROGRESS: at sentence #20000, processed 195405 words, keeping 12682 word types
2021-12-01 11:05:46,880 : INFO : PROGRESS: at sentence #30000, processed 231684 words, keeping 13190 word types
2021-12-01 11:05:46,953 : INFO : PROGRESS: at sentence #40000, processed 269413 words, keeping 13591 word types
2021-12-01 11:05:47,018 : INFO : PROGRESS: at sentence #50000, processed 305238 words, keeping 14007 word types
2021-12-01 11:05:47,079 : INFO : PROGRESS: at sentence #60000, processed 342336 words, keeping 14400 word types
2021-12-01 11:05:47,137 : INFO : PROGRESS: at sentence #70000, processed 376490 words, keeping 14727 word types
2021-12-01 11:05:47,195 : INFO : PROGRESS: at sentence #80000, processed 409865 words, keeping 14941 word types
2021-12-01 11:05:47,250 : INFO : PROGRESS: at sentence #90000, processed 440464 words, keeping 15112 wor

2021-12-01 11:05:59,853 : INFO : PROGRESS: at sentence #740000, processed 7409767 words, keeping 15986 word types
2021-12-01 11:05:59,958 : INFO : PROGRESS: at sentence #750000, processed 7467746 words, keeping 15986 word types
2021-12-01 11:06:00,219 : INFO : PROGRESS: at sentence #760000, processed 7619446 words, keeping 15986 word types
2021-12-01 11:06:00,467 : INFO : PROGRESS: at sentence #770000, processed 7754852 words, keeping 15986 word types
2021-12-01 11:06:00,746 : INFO : PROGRESS: at sentence #780000, processed 7909310 words, keeping 15986 word types
2021-12-01 11:06:01,031 : INFO : PROGRESS: at sentence #790000, processed 8070188 words, keeping 15986 word types
2021-12-01 11:06:01,317 : INFO : PROGRESS: at sentence #800000, processed 8231320 words, keeping 15986 word types
2021-12-01 11:06:01,607 : INFO : PROGRESS: at sentence #810000, processed 8393485 words, keeping 15986 word types
2021-12-01 11:06:01,805 : INFO : PROGRESS: at sentence #820000, processed 8502067 words,

2021-12-01 11:06:42,638 : DEBUG : worker exiting, processed 43 jobs
2021-12-01 11:06:42,638 : INFO : worker thread finished; awaiting finish of 19 more threads
2021-12-01 11:06:42,763 : DEBUG : worker exiting, processed 44 jobs
2021-12-01 11:06:42,763 : INFO : worker thread finished; awaiting finish of 18 more threads
2021-12-01 11:06:42,767 : DEBUG : worker exiting, processed 45 jobs
2021-12-01 11:06:42,767 : INFO : worker thread finished; awaiting finish of 17 more threads
2021-12-01 11:06:42,893 : DEBUG : worker exiting, processed 45 jobs
2021-12-01 11:06:42,893 : INFO : worker thread finished; awaiting finish of 16 more threads
2021-12-01 11:06:42,924 : DEBUG : worker exiting, processed 45 jobs
2021-12-01 11:06:42,924 : INFO : worker thread finished; awaiting finish of 15 more threads
2021-12-01 11:06:42,946 : DEBUG : worker exiting, processed 46 jobs
2021-12-01 11:06:42,946 : INFO : worker thread finished; awaiting finish of 14 more threads
2021-12-01 11:06:42,998 : DEBUG : worker

2021-12-01 11:07:21,386 : DEBUG : worker exiting, processed 46 jobs
2021-12-01 11:07:21,387 : INFO : worker thread finished; awaiting finish of 14 more threads
2021-12-01 11:07:21,422 : DEBUG : worker exiting, processed 47 jobs
2021-12-01 11:07:21,422 : INFO : worker thread finished; awaiting finish of 13 more threads
2021-12-01 11:07:21,427 : DEBUG : worker exiting, processed 45 jobs
2021-12-01 11:07:21,427 : INFO : worker thread finished; awaiting finish of 12 more threads
2021-12-01 11:07:21,429 : DEBUG : worker exiting, processed 45 jobs
2021-12-01 11:07:21,429 : INFO : worker thread finished; awaiting finish of 11 more threads
2021-12-01 11:07:21,438 : DEBUG : worker exiting, processed 44 jobs
2021-12-01 11:07:21,438 : INFO : worker thread finished; awaiting finish of 10 more threads
2021-12-01 11:07:21,439 : DEBUG : worker exiting, processed 46 jobs
2021-12-01 11:07:21,440 : INFO : worker thread finished; awaiting finish of 9 more threads
2021-12-01 11:07:21,475 : DEBUG : worker 

2021-12-01 11:07:59,116 : DEBUG : worker exiting, processed 45 jobs
2021-12-01 11:07:59,116 : INFO : EPOCH 3 - PROGRESS: at 99.22% examples, 216068 words/s, in_qsize 9, out_qsize 1
2021-12-01 11:07:59,118 : INFO : worker thread finished; awaiting finish of 9 more threads
2021-12-01 11:07:59,136 : DEBUG : worker exiting, processed 44 jobs
2021-12-01 11:07:59,136 : INFO : worker thread finished; awaiting finish of 8 more threads
2021-12-01 11:07:59,163 : DEBUG : worker exiting, processed 46 jobs
2021-12-01 11:07:59,163 : INFO : worker thread finished; awaiting finish of 7 more threads
2021-12-01 11:07:59,182 : DEBUG : worker exiting, processed 46 jobs
2021-12-01 11:07:59,183 : INFO : worker thread finished; awaiting finish of 6 more threads
2021-12-01 11:07:59,189 : DEBUG : worker exiting, processed 48 jobs
2021-12-01 11:07:59,189 : INFO : worker thread finished; awaiting finish of 5 more threads
2021-12-01 11:07:59,197 : DEBUG : worker exiting, processed 45 jobs
2021-12-01 11:07:59,197 

2021-12-01 11:08:36,773 : DEBUG : worker exiting, processed 45 jobs
2021-12-01 11:08:36,774 : INFO : worker thread finished; awaiting finish of 5 more threads
2021-12-01 11:08:36,788 : DEBUG : worker exiting, processed 46 jobs
2021-12-01 11:08:36,789 : INFO : worker thread finished; awaiting finish of 4 more threads
2021-12-01 11:08:36,810 : DEBUG : worker exiting, processed 44 jobs
2021-12-01 11:08:36,811 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-12-01 11:08:36,823 : DEBUG : worker exiting, processed 47 jobs
2021-12-01 11:08:36,823 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-12-01 11:08:36,847 : DEBUG : worker exiting, processed 46 jobs
2021-12-01 11:08:36,847 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-12-01 11:08:36,880 : DEBUG : worker exiting, processed 46 jobs
2021-12-01 11:08:36,881 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-12-01 11:08:36,882 : INFO : EPOCH - 4 : t

2021-12-01 11:09:14,143 : DEBUG : worker exiting, processed 48 jobs
2021-12-01 11:09:14,143 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-12-01 11:09:14,145 : INFO : EPOCH - 5 : training on 9115766 raw words (8182358 effective words) took 37.3s, 219636 effective words/s
2021-12-01 11:09:15,177 : INFO : EPOCH 6 - PROGRESS: at 2.31% examples, 106357 words/s, in_qsize 0, out_qsize 1
2021-12-01 11:09:16,296 : INFO : EPOCH 6 - PROGRESS: at 11.77% examples, 183148 words/s, in_qsize 0, out_qsize 0
2021-12-01 11:09:17,301 : INFO : EPOCH 6 - PROGRESS: at 13.77% examples, 193909 words/s, in_qsize 0, out_qsize 0
2021-12-01 11:09:18,310 : INFO : EPOCH 6 - PROGRESS: at 18.02% examples, 201010 words/s, in_qsize 0, out_qsize 0
2021-12-01 11:09:19,412 : INFO : EPOCH 6 - PROGRESS: at 20.47% examples, 201728 words/s, in_qsize 4, out_qsize 0
2021-12-01 11:09:20,417 : INFO : EPOCH 6 - PROGRESS: at 22.73% examples, 205247 words/s, in_qsize 3, out_qsize 0
2021-12-01 11:09:21,417 : 

2021-12-01 11:09:57,600 : INFO : EPOCH 7 - PROGRESS: at 22.62% examples, 204607 words/s, in_qsize 1, out_qsize 0
2021-12-01 11:09:58,646 : INFO : EPOCH 7 - PROGRESS: at 25.02% examples, 207528 words/s, in_qsize 0, out_qsize 0
2021-12-01 11:09:59,810 : INFO : EPOCH 7 - PROGRESS: at 26.98% examples, 207053 words/s, in_qsize 0, out_qsize 1
2021-12-01 11:10:00,816 : INFO : EPOCH 7 - PROGRESS: at 28.93% examples, 210171 words/s, in_qsize 0, out_qsize 0
2021-12-01 11:10:01,842 : INFO : EPOCH 7 - PROGRESS: at 30.67% examples, 210984 words/s, in_qsize 0, out_qsize 1
2021-12-01 11:10:02,909 : INFO : EPOCH 7 - PROGRESS: at 32.14% examples, 210309 words/s, in_qsize 0, out_qsize 0
2021-12-01 11:10:03,960 : INFO : EPOCH 7 - PROGRESS: at 36.48% examples, 215249 words/s, in_qsize 1, out_qsize 0
2021-12-01 11:10:04,976 : INFO : EPOCH 7 - PROGRESS: at 41.85% examples, 220267 words/s, in_qsize 0, out_qsize 0
2021-12-01 11:10:05,977 : INFO : EPOCH 7 - PROGRESS: at 45.15% examples, 220169 words/s, in_qsiz

2021-12-01 11:10:42,879 : INFO : EPOCH 8 - PROGRESS: at 44.11% examples, 216823 words/s, in_qsize 0, out_qsize 0
2021-12-01 11:10:43,900 : INFO : EPOCH 8 - PROGRESS: at 46.49% examples, 217114 words/s, in_qsize 0, out_qsize 0
2021-12-01 11:10:44,919 : INFO : EPOCH 8 - PROGRESS: at 48.49% examples, 216235 words/s, in_qsize 1, out_qsize 1
2021-12-01 11:10:45,922 : INFO : EPOCH 8 - PROGRESS: at 50.93% examples, 217222 words/s, in_qsize 0, out_qsize 0
2021-12-01 11:10:46,922 : INFO : EPOCH 8 - PROGRESS: at 52.94% examples, 216194 words/s, in_qsize 0, out_qsize 2
2021-12-01 11:10:47,961 : INFO : EPOCH 8 - PROGRESS: at 57.12% examples, 217098 words/s, in_qsize 4, out_qsize 0
2021-12-01 11:10:48,967 : INFO : EPOCH 8 - PROGRESS: at 59.31% examples, 218291 words/s, in_qsize 0, out_qsize 0
2021-12-01 11:10:49,967 : INFO : EPOCH 8 - PROGRESS: at 61.67% examples, 218686 words/s, in_qsize 2, out_qsize 0
2021-12-01 11:10:51,022 : INFO : EPOCH 8 - PROGRESS: at 65.65% examples, 218421 words/s, in_qsiz

2021-12-01 11:11:28,508 : INFO : EPOCH 9 - PROGRESS: at 66.30% examples, 219752 words/s, in_qsize 0, out_qsize 0
2021-12-01 11:11:29,546 : INFO : EPOCH 9 - PROGRESS: at 68.44% examples, 218883 words/s, in_qsize 0, out_qsize 0
2021-12-01 11:11:30,628 : INFO : EPOCH 9 - PROGRESS: at 70.44% examples, 218412 words/s, in_qsize 0, out_qsize 1
2021-12-01 11:11:31,656 : INFO : EPOCH 9 - PROGRESS: at 73.21% examples, 219544 words/s, in_qsize 0, out_qsize 0
2021-12-01 11:11:32,665 : INFO : EPOCH 9 - PROGRESS: at 77.10% examples, 220994 words/s, in_qsize 0, out_qsize 0
2021-12-01 11:11:33,670 : INFO : EPOCH 9 - PROGRESS: at 79.20% examples, 219130 words/s, in_qsize 1, out_qsize 0
2021-12-01 11:11:34,671 : INFO : EPOCH 9 - PROGRESS: at 81.14% examples, 218707 words/s, in_qsize 0, out_qsize 1
2021-12-01 11:11:35,688 : INFO : EPOCH 9 - PROGRESS: at 85.80% examples, 220866 words/s, in_qsize 0, out_qsize 0
2021-12-01 11:11:36,743 : INFO : EPOCH 9 - PROGRESS: at 87.75% examples, 219707 words/s, in_qsiz

2021-12-01 11:12:14,163 : INFO : EPOCH 10 - PROGRESS: at 87.75% examples, 218149 words/s, in_qsize 0, out_qsize 0
2021-12-01 11:12:15,266 : INFO : EPOCH 10 - PROGRESS: at 89.54% examples, 216832 words/s, in_qsize 5, out_qsize 1
2021-12-01 11:12:16,267 : INFO : EPOCH 10 - PROGRESS: at 91.27% examples, 216743 words/s, in_qsize 4, out_qsize 0
2021-12-01 11:12:17,356 : INFO : EPOCH 10 - PROGRESS: at 93.05% examples, 216361 words/s, in_qsize 0, out_qsize 1
2021-12-01 11:12:18,368 : INFO : EPOCH 10 - PROGRESS: at 95.22% examples, 216325 words/s, in_qsize 4, out_qsize 0
2021-12-01 11:12:19,403 : INFO : EPOCH 10 - PROGRESS: at 96.87% examples, 215164 words/s, in_qsize 1, out_qsize 4
2021-12-01 11:12:19,811 : DEBUG : job loop exiting, total 913 jobs
2021-12-01 11:12:19,815 : DEBUG : worker exiting, processed 45 jobs
2021-12-01 11:12:19,815 : INFO : worker thread finished; awaiting finish of 19 more threads
2021-12-01 11:12:19,819 : INFO : worker thread finished; awaiting finish of 18 more threa

2021-12-01 11:12:25,758 : INFO : PROGRESS: at sentence #300000, processed 2961849 words, keeping 15965 word types
2021-12-01 11:12:25,889 : INFO : PROGRESS: at sentence #310000, processed 3027438 words, keeping 15965 word types
2021-12-01 11:12:26,024 : INFO : PROGRESS: at sentence #320000, processed 3095684 words, keeping 15965 word types
2021-12-01 11:12:26,156 : INFO : PROGRESS: at sentence #330000, processed 3162882 words, keeping 15965 word types
2021-12-01 11:12:26,284 : INFO : PROGRESS: at sentence #340000, processed 3227133 words, keeping 15965 word types
2021-12-01 11:12:26,414 : INFO : PROGRESS: at sentence #350000, processed 3293194 words, keeping 15965 word types
2021-12-01 11:12:26,549 : INFO : PROGRESS: at sentence #360000, processed 3360895 words, keeping 15965 word types
2021-12-01 11:12:26,673 : INFO : PROGRESS: at sentence #370000, processed 3424511 words, keeping 15965 word types
2021-12-01 11:12:26,815 : INFO : PROGRESS: at sentence #380000, processed 3495144 words,

2021-12-01 11:12:37,274 : INFO : resetting layer weights
2021-12-01 11:12:37,276 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2021-12-01T11:12:37.276066', 'gensim': '4.0.1', 'python': '3.8.8 (default, Apr 13 2021, 19:58:26) \n[GCC 7.3.0]', 'platform': 'Linux-5.0.17-200.fc29.x86_64-x86_64-with-glibc2.10', 'event': 'build_vocab'}
  _model.init_sims(replace=True)
2021-12-01 11:12:37,280 : INFO : Word2Vec lifecycle event {'fname_or_handle': './emb_models/subword_wol_W2V_Skip-gram_100D', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2021-12-01T11:12:37.280817', 'gensim': '4.0.1', 'python': '3.8.8 (default, Apr 13 2021, 19:58:26) \n[GCC 7.3.0]', 'platform': 'Linux-5.0.17-200.fc29.x86_64-x86_64-with-glibc2.10', 'event': 'saving'}
2021-12-01 11:12:37,281 : INFO : not storing attribute cum_table
2021-12-01 11:12:37,282 : DEBUG : {'uri': './emb_models/subword_wol_W2V_Skip-gram_100D', 'mode': 'wb', 'buffering': -1, 'encod

<gensim.models.word2vec.Word2Vec at 0x7fd3cf1fcdc0>

In [53]:
train_w2v_model(sub_word = True) #w2v sg=1 Skip-gram model with 100 dim  

2021-12-01 11:12:37,334 : INFO : collecting all words and their counts
2021-12-01 11:12:37,337 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types


Loading Sentences with memory freindly iterator ...

Training Sentence Piece Word2Vec Skip-gram with 100 dimension



2021-12-01 11:12:37,581 : INFO : PROGRESS: at sentence #10000, processed 133237 words, keeping 11491 word types
2021-12-01 11:12:37,697 : INFO : PROGRESS: at sentence #20000, processed 195405 words, keeping 12682 word types
2021-12-01 11:12:37,756 : INFO : PROGRESS: at sentence #30000, processed 231684 words, keeping 13190 word types
2021-12-01 11:12:37,818 : INFO : PROGRESS: at sentence #40000, processed 269413 words, keeping 13591 word types
2021-12-01 11:12:37,878 : INFO : PROGRESS: at sentence #50000, processed 305238 words, keeping 14007 word types
2021-12-01 11:12:37,938 : INFO : PROGRESS: at sentence #60000, processed 342336 words, keeping 14400 word types
2021-12-01 11:12:37,996 : INFO : PROGRESS: at sentence #70000, processed 376490 words, keeping 14727 word types
2021-12-01 11:12:38,053 : INFO : PROGRESS: at sentence #80000, processed 409865 words, keeping 14941 word types
2021-12-01 11:12:38,108 : INFO : PROGRESS: at sentence #90000, processed 440464 words, keeping 15112 wor

2021-12-01 11:12:50,655 : INFO : PROGRESS: at sentence #740000, processed 7409767 words, keeping 15986 word types
2021-12-01 11:12:50,761 : INFO : PROGRESS: at sentence #750000, processed 7467746 words, keeping 15986 word types
2021-12-01 11:12:51,023 : INFO : PROGRESS: at sentence #760000, processed 7619446 words, keeping 15986 word types
2021-12-01 11:12:51,270 : INFO : PROGRESS: at sentence #770000, processed 7754852 words, keeping 15986 word types
2021-12-01 11:12:51,544 : INFO : PROGRESS: at sentence #780000, processed 7909310 words, keeping 15986 word types
2021-12-01 11:12:51,837 : INFO : PROGRESS: at sentence #790000, processed 8070188 words, keeping 15986 word types
2021-12-01 11:12:52,126 : INFO : PROGRESS: at sentence #800000, processed 8231320 words, keeping 15986 word types
2021-12-01 11:12:52,415 : INFO : PROGRESS: at sentence #810000, processed 8393485 words, keeping 15986 word types
2021-12-01 11:12:52,612 : INFO : PROGRESS: at sentence #820000, processed 8502067 words,

2021-12-01 11:13:33,450 : INFO : worker thread finished; awaiting finish of 19 more threads
2021-12-01 11:13:33,573 : DEBUG : worker exiting, processed 44 jobs
2021-12-01 11:13:33,573 : INFO : worker thread finished; awaiting finish of 18 more threads
2021-12-01 11:13:33,580 : DEBUG : worker exiting, processed 46 jobs
2021-12-01 11:13:33,581 : INFO : worker thread finished; awaiting finish of 17 more threads
2021-12-01 11:13:33,634 : DEBUG : worker exiting, processed 43 jobs
2021-12-01 11:13:33,634 : INFO : worker thread finished; awaiting finish of 16 more threads
2021-12-01 11:13:33,645 : DEBUG : worker exiting, processed 45 jobs
2021-12-01 11:13:33,645 : INFO : worker thread finished; awaiting finish of 15 more threads
2021-12-01 11:13:33,660 : DEBUG : worker exiting, processed 45 jobs
2021-12-01 11:13:33,660 : INFO : worker thread finished; awaiting finish of 14 more threads
2021-12-01 11:13:33,677 : DEBUG : worker exiting, processed 46 jobs
2021-12-01 11:13:33,677 : INFO : worker 

2021-12-01 11:14:12,173 : DEBUG : worker exiting, processed 44 jobs
2021-12-01 11:14:12,173 : INFO : worker thread finished; awaiting finish of 14 more threads
2021-12-01 11:14:12,187 : DEBUG : worker exiting, processed 46 jobs
2021-12-01 11:14:12,187 : INFO : worker thread finished; awaiting finish of 13 more threads
2021-12-01 11:14:12,296 : DEBUG : worker exiting, processed 44 jobs
2021-12-01 11:14:12,296 : INFO : EPOCH 2 - PROGRESS: at 98.89% examples, 210385 words/s, in_qsize 12, out_qsize 1
2021-12-01 11:14:12,298 : INFO : worker thread finished; awaiting finish of 12 more threads
2021-12-01 11:14:12,306 : DEBUG : worker exiting, processed 46 jobs
2021-12-01 11:14:12,306 : INFO : worker thread finished; awaiting finish of 11 more threads
2021-12-01 11:14:12,307 : DEBUG : worker exiting, processed 44 jobs
2021-12-01 11:14:12,308 : INFO : worker thread finished; awaiting finish of 10 more threads
2021-12-01 11:14:12,311 : DEBUG : worker exiting, processed 46 jobs
2021-12-01 11:14:1

2021-12-01 11:14:50,054 : INFO : worker thread finished; awaiting finish of 10 more threads
2021-12-01 11:14:50,058 : INFO : worker thread finished; awaiting finish of 9 more threads
2021-12-01 11:14:50,085 : DEBUG : worker exiting, processed 46 jobs
2021-12-01 11:14:50,085 : INFO : worker thread finished; awaiting finish of 8 more threads
2021-12-01 11:14:50,114 : DEBUG : worker exiting, processed 46 jobs
2021-12-01 11:14:50,114 : INFO : worker thread finished; awaiting finish of 7 more threads
2021-12-01 11:14:50,123 : DEBUG : worker exiting, processed 47 jobs
2021-12-01 11:14:50,123 : INFO : worker thread finished; awaiting finish of 6 more threads
2021-12-01 11:14:50,128 : DEBUG : worker exiting, processed 45 jobs
2021-12-01 11:14:50,129 : INFO : worker thread finished; awaiting finish of 5 more threads
2021-12-01 11:14:50,150 : DEBUG : worker exiting, processed 45 jobs
2021-12-01 11:14:50,150 : INFO : worker thread finished; awaiting finish of 4 more threads
2021-12-01 11:14:50,15

2021-12-01 11:15:27,537 : INFO : worker thread finished; awaiting finish of 4 more threads
2021-12-01 11:15:27,540 : DEBUG : worker exiting, processed 46 jobs
2021-12-01 11:15:27,541 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-12-01 11:15:27,573 : DEBUG : worker exiting, processed 45 jobs
2021-12-01 11:15:27,573 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-12-01 11:15:27,578 : DEBUG : worker exiting, processed 45 jobs
2021-12-01 11:15:27,578 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-12-01 11:15:27,589 : DEBUG : worker exiting, processed 46 jobs
2021-12-01 11:15:27,589 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-12-01 11:15:27,591 : INFO : EPOCH - 4 : training on 9115766 raw words (8182212 effective words) took 37.4s, 218906 effective words/s
2021-12-01 11:15:28,619 : INFO : EPOCH 5 - PROGRESS: at 2.87% examples, 132117 words/s, in_qsize 0, out_qsize 1
2021-12-01 11:15:29,701

2021-12-01 11:16:05,783 : INFO : EPOCH 6 - PROGRESS: at 2.07% examples, 107861 words/s, in_qsize 0, out_qsize 2
2021-12-01 11:16:06,828 : INFO : EPOCH 6 - PROGRESS: at 11.69% examples, 185152 words/s, in_qsize 0, out_qsize 0
2021-12-01 11:16:07,839 : INFO : EPOCH 6 - PROGRESS: at 14.05% examples, 200945 words/s, in_qsize 0, out_qsize 0
2021-12-01 11:16:08,947 : INFO : EPOCH 6 - PROGRESS: at 17.95% examples, 197315 words/s, in_qsize 1, out_qsize 0
2021-12-01 11:16:10,007 : INFO : EPOCH 6 - PROGRESS: at 20.55% examples, 203836 words/s, in_qsize 0, out_qsize 0
2021-12-01 11:16:11,013 : INFO : EPOCH 6 - PROGRESS: at 22.56% examples, 204073 words/s, in_qsize 1, out_qsize 0
2021-12-01 11:16:12,094 : INFO : EPOCH 6 - PROGRESS: at 24.84% examples, 206050 words/s, in_qsize 0, out_qsize 0
2021-12-01 11:16:13,156 : INFO : EPOCH 6 - PROGRESS: at 26.83% examples, 207170 words/s, in_qsize 0, out_qsize 0
2021-12-01 11:16:14,175 : INFO : EPOCH 6 - PROGRESS: at 28.71% examples, 209029 words/s, in_qsize

2021-12-01 11:16:51,333 : INFO : EPOCH 7 - PROGRESS: at 28.71% examples, 210391 words/s, in_qsize 0, out_qsize 0
2021-12-01 11:16:52,336 : INFO : EPOCH 7 - PROGRESS: at 30.49% examples, 211825 words/s, in_qsize 0, out_qsize 0
2021-12-01 11:16:53,354 : INFO : EPOCH 7 - PROGRESS: at 31.95% examples, 211868 words/s, in_qsize 0, out_qsize 0
2021-12-01 11:16:54,355 : INFO : EPOCH 7 - PROGRESS: at 35.45% examples, 215314 words/s, in_qsize 0, out_qsize 0
2021-12-01 11:16:55,356 : INFO : EPOCH 7 - PROGRESS: at 40.67% examples, 219984 words/s, in_qsize 0, out_qsize 1
2021-12-01 11:16:56,397 : INFO : EPOCH 7 - PROGRESS: at 44.12% examples, 217496 words/s, in_qsize 2, out_qsize 0
2021-12-01 11:16:57,400 : INFO : EPOCH 7 - PROGRESS: at 46.43% examples, 217441 words/s, in_qsize 2, out_qsize 0
2021-12-01 11:16:58,401 : INFO : EPOCH 7 - PROGRESS: at 48.49% examples, 217309 words/s, in_qsize 0, out_qsize 1
2021-12-01 11:16:59,410 : INFO : EPOCH 7 - PROGRESS: at 50.84% examples, 217636 words/s, in_qsiz

2021-12-01 11:17:35,689 : INFO : EPOCH 8 - PROGRESS: at 48.86% examples, 218453 words/s, in_qsize 1, out_qsize 0
2021-12-01 11:17:36,696 : INFO : EPOCH 8 - PROGRESS: at 51.06% examples, 218221 words/s, in_qsize 0, out_qsize 0
2021-12-01 11:17:37,739 : INFO : EPOCH 8 - PROGRESS: at 53.24% examples, 217601 words/s, in_qsize 0, out_qsize 0
2021-12-01 11:17:38,808 : INFO : EPOCH 8 - PROGRESS: at 57.41% examples, 218113 words/s, in_qsize 2, out_qsize 0
2021-12-01 11:17:39,811 : INFO : EPOCH 8 - PROGRESS: at 59.54% examples, 218431 words/s, in_qsize 0, out_qsize 0
2021-12-01 11:17:40,834 : INFO : EPOCH 8 - PROGRESS: at 62.16% examples, 219391 words/s, in_qsize 0, out_qsize 0
2021-12-01 11:17:41,849 : INFO : EPOCH 8 - PROGRESS: at 66.02% examples, 219871 words/s, in_qsize 0, out_qsize 0
2021-12-01 11:17:42,930 : INFO : EPOCH 8 - PROGRESS: at 68.12% examples, 218939 words/s, in_qsize 1, out_qsize 0
2021-12-01 11:17:43,938 : INFO : EPOCH 8 - PROGRESS: at 70.08% examples, 218413 words/s, in_qsiz

2021-12-01 11:18:21,343 : INFO : EPOCH 9 - PROGRESS: at 70.46% examples, 217866 words/s, in_qsize 0, out_qsize 0
2021-12-01 11:18:22,388 : INFO : EPOCH 9 - PROGRESS: at 72.72% examples, 217136 words/s, in_qsize 0, out_qsize 3
2021-12-01 11:18:23,398 : INFO : EPOCH 9 - PROGRESS: at 77.22% examples, 219675 words/s, in_qsize 0, out_qsize 1
2021-12-01 11:18:24,419 : INFO : EPOCH 9 - PROGRESS: at 79.38% examples, 219015 words/s, in_qsize 0, out_qsize 0
2021-12-01 11:18:25,584 : INFO : EPOCH 9 - PROGRESS: at 81.41% examples, 217678 words/s, in_qsize 4, out_qsize 1
2021-12-01 11:18:26,595 : INFO : EPOCH 9 - PROGRESS: at 86.17% examples, 219622 words/s, in_qsize 0, out_qsize 0
2021-12-01 11:18:27,685 : INFO : EPOCH 9 - PROGRESS: at 87.88% examples, 218263 words/s, in_qsize 4, out_qsize 0
2021-12-01 11:18:28,750 : INFO : EPOCH 9 - PROGRESS: at 89.76% examples, 217468 words/s, in_qsize 4, out_qsize 2
2021-12-01 11:18:29,765 : INFO : EPOCH 9 - PROGRESS: at 91.91% examples, 218870 words/s, in_qsiz

2021-12-01 11:19:06,686 : INFO : EPOCH 10 - PROGRESS: at 90.84% examples, 216851 words/s, in_qsize 0, out_qsize 0
2021-12-01 11:19:07,717 : INFO : EPOCH 10 - PROGRESS: at 92.55% examples, 216565 words/s, in_qsize 0, out_qsize 0
2021-12-01 11:19:08,805 : INFO : EPOCH 10 - PROGRESS: at 94.81% examples, 216559 words/s, in_qsize 0, out_qsize 0
2021-12-01 11:19:09,808 : INFO : EPOCH 10 - PROGRESS: at 96.79% examples, 216565 words/s, in_qsize 0, out_qsize 0
2021-12-01 11:19:10,597 : DEBUG : job loop exiting, total 913 jobs
2021-12-01 11:19:10,607 : DEBUG : worker exiting, processed 45 jobs
2021-12-01 11:19:10,612 : INFO : worker thread finished; awaiting finish of 19 more threads
2021-12-01 11:19:10,694 : DEBUG : worker exiting, processed 46 jobs
2021-12-01 11:19:10,694 : INFO : worker thread finished; awaiting finish of 18 more threads
2021-12-01 11:19:10,716 : DEBUG : worker exiting, processed 46 jobs
2021-12-01 11:19:10,716 : INFO : worker thread finished; awaiting finish of 17 more threa

2021-12-01 11:19:16,611 : INFO : PROGRESS: at sentence #310000, processed 3027438 words, keeping 15965 word types
2021-12-01 11:19:16,745 : INFO : PROGRESS: at sentence #320000, processed 3095684 words, keeping 15965 word types
2021-12-01 11:19:16,875 : INFO : PROGRESS: at sentence #330000, processed 3162882 words, keeping 15965 word types
2021-12-01 11:19:17,001 : INFO : PROGRESS: at sentence #340000, processed 3227133 words, keeping 15965 word types
2021-12-01 11:19:17,131 : INFO : PROGRESS: at sentence #350000, processed 3293194 words, keeping 15965 word types
2021-12-01 11:19:17,266 : INFO : PROGRESS: at sentence #360000, processed 3360895 words, keeping 15965 word types
2021-12-01 11:19:17,389 : INFO : PROGRESS: at sentence #370000, processed 3424511 words, keeping 15965 word types
2021-12-01 11:19:17,530 : INFO : PROGRESS: at sentence #380000, processed 3495144 words, keeping 15965 word types
2021-12-01 11:19:17,749 : INFO : PROGRESS: at sentence #390000, processed 3611566 words,

  _model.init_sims(replace=True)
2021-12-01 11:19:27,967 : INFO : Word2Vec lifecycle event {'fname_or_handle': './emb_models/subword_wol_W2V_Skip-gram_100D', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2021-12-01T11:19:27.967244', 'gensim': '4.0.1', 'python': '3.8.8 (default, Apr 13 2021, 19:58:26) \n[GCC 7.3.0]', 'platform': 'Linux-5.0.17-200.fc29.x86_64-x86_64-with-glibc2.10', 'event': 'saving'}
2021-12-01 11:19:27,967 : INFO : not storing attribute cum_table
2021-12-01 11:19:27,968 : DEBUG : {'uri': './emb_models/subword_wol_W2V_Skip-gram_100D', 'mode': 'wb', 'buffering': -1, 'encoding': None, 'errors': None, 'newline': None, 'closefd': True, 'opener': None, 'ignore_ext': False, 'transport_params': None}
2021-12-01 11:19:28,080 : INFO : saved ./emb_models/subword_wol_W2V_Skip-gram_100D


<gensim.models.word2vec.Word2Vec at 0x7fd2fa5eb4f0>

In [55]:
import gensim

#w2v 
w2v_CBOW_100 = gensim.models.Word2Vec.load("./emb_models/wol_W2V_CBOW_100D")
w2v_SG_100 = gensim.models.Word2Vec.load("./emb_models/wol_W2V_Skip-gram_100D")
w2v_CBOW_200 = gensim.models.Word2Vec.load("./emb_models/wol_W2V_CBOW_200D")
w2v_SG_200 = gensim.models.Word2Vec.load("./emb_models/wol_W2V_Skip-gram_200D")

#FastText
ft_CBOW_100 = gensim.models.Word2Vec.load("./emb_models/wol_FastText_CBOW_100D")
ft_SG_100 = gensim.models.Word2Vec.load("./emb_models/wol_FastText_Skip-gram_100D")
ft_CBOW_200 = gensim.models.Word2Vec.load("./emb_models/wol_FastText_CBOW_200D")
ft_SG_200 = gensim.models.Word2Vec.load("./emb_models/wol_FastText_Skip-gram_200D")


#w2v with subword
sw_w2v_CBOW_100 = gensim.models.Word2Vec.load("./emb_models/subword_wol_W2V_CBOW_100D")
sw_w2v_SG_100 = gensim.models.Word2Vec.load("./emb_models/subword_wol_W2V_Skip-gram_100D")
sw_w2v_CBOW_200 = gensim.models.Word2Vec.load("./emb_models/subword_wol_W2V_CBOW_200D")
sw_w2v_SG_200 = gensim.models.Word2Vec.load("./emb_models/subword_wol_W2V_Skip-gram_200D")





2021-12-01 11:27:35,855 : INFO : loading Word2Vec object from ./emb_models/wol_W2V_CBOW_100D
2021-12-01 11:27:35,856 : DEBUG : {'uri': './emb_models/wol_W2V_CBOW_100D', 'mode': 'rb', 'buffering': -1, 'encoding': None, 'errors': None, 'newline': None, 'closefd': True, 'opener': None, 'ignore_ext': False, 'transport_params': None}
2021-12-01 11:27:35,910 : INFO : loading wv recursively from ./emb_models/wol_W2V_CBOW_100D.wv.* with mmap=None
2021-12-01 11:27:35,911 : INFO : loading vectors from ./emb_models/wol_W2V_CBOW_100D.wv.vectors.npy with mmap=None
2021-12-01 11:27:35,929 : INFO : loading syn1neg from ./emb_models/wol_W2V_CBOW_100D.syn1neg.npy with mmap=None
2021-12-01 11:27:35,948 : INFO : setting ignored attribute cum_table to None
2021-12-01 11:27:37,422 : INFO : Word2Vec lifecycle event {'fname': './emb_models/wol_W2V_CBOW_100D', 'datetime': '2021-12-01T11:27:37.422304', 'gensim': '4.0.1', 'python': '3.8.8 (default, Apr 13 2021, 19:58:26) \n[GCC 7.3.0]', 'platform': 'Linux-5.0.1

2021-12-01 11:28:25,192 : INFO : FastText lifecycle event {'fname': './emb_models/wol_FastText_CBOW_200D', 'datetime': '2021-12-01T11:28:25.192272', 'gensim': '4.0.1', 'python': '3.8.8 (default, Apr 13 2021, 19:58:26) \n[GCC 7.3.0]', 'platform': 'Linux-5.0.17-200.fc29.x86_64-x86_64-with-glibc2.10', 'event': 'loaded'}
2021-12-01 11:28:25,193 : INFO : loading Word2Vec object from ./emb_models/wol_FastText_Skip-gram_200D
2021-12-01 11:28:25,193 : DEBUG : {'uri': './emb_models/wol_FastText_Skip-gram_200D', 'mode': 'rb', 'buffering': -1, 'encoding': None, 'errors': None, 'newline': None, 'closefd': True, 'opener': None, 'ignore_ext': False, 'transport_params': None}
2021-12-01 11:28:25,241 : INFO : loading wv recursively from ./emb_models/wol_FastText_Skip-gram_200D.wv.* with mmap=None
2021-12-01 11:28:25,242 : INFO : loading vectors_vocab from ./emb_models/wol_FastText_Skip-gram_200D.wv.vectors_vocab.npy with mmap=None
2021-12-01 11:28:25,344 : INFO : loading vectors_ngrams from ./emb_mode

In [62]:
w2v_CBOW_100.wv.most_similar("asaa")  # 

[('naagana', 0.6747323274612427),
 ('takkanawu', 0.5704615712165833),
 ('seera', 0.5234895944595337),
 ('xoossainne', 0.5154280662536621),
 ('woosata', 0.5134694576263428),
 ('hegaadan', 0.5133533477783203),
 ('ba', 0.5102196931838989),
 ('ezggiiddi', 0.5014573335647583),
 ('dabaabai', 0.48715874552726746),
 ('intteyyo', 0.4784954786300659)]

In [63]:
w2v_SG_100.wv.most_similar("asaa")  # 

[('naagana', 0.8576710224151611),
 ('butti', 0.669225811958313),
 ('xeesidaagaa', 0.6680731773376465),
 ('waissiyaageetuyyookka', 0.6626163125038147),
 ('addaamedan', 0.6619529724121094),
 ('takkanawu', 0.6604024171829224),
 ('yeggiyai', 0.6585956811904907),
 ('qarettabaakka', 0.6578243970870972),
 ('qoncciyonne', 0.6572359800338745),
 ('aridi', 0.6541458368301392)]

In [64]:
w2v_CBOW_200.wv.most_similar("asaa")

[('naagana', 0.6281840801239014),
 ('woosata', 0.5079213380813599),
 ('takkanawu', 0.49026358127593994),
 ('intteyyo', 0.48888808488845825),
 ('koyro', 0.47974514961242676),
 ('ba', 0.4761030673980713),
 ('immees', 0.4622330665588379),
 ('gidiyoogaa', 0.45942002534866333),
 ('medhidobawu', 0.45883095264434814),
 ('xoossainne', 0.453715980052948)]

In [65]:
w2v_SG_200.wv.most_similar("asaa")

[('naagana', 0.6656869649887085),
 ('yeggiyai', 0.58135986328125),
 ('addaamedan', 0.5732384324073792),
 ('hanaaniyaayyo', 0.569291353225708),
 ("dom''ettiyoogee", 0.5651791095733643),
 ('geeddarettana', 0.5644726753234863),
 ('doommettees', 0.5621622800827026),
 ('xeesidaagaa', 0.5612590909004211),
 ('beettaadan', 0.5582201480865479),
 ('yaratuyyoonne', 0.5569989085197449)]

In [123]:
data = open("./wol-WordSim-100.txt", "r", encoding="utf8")
for d in data:
    words = d.split()
    #print(words[0], words[1], w2v_CBOW_100.wv.similarity(words[0], words[1]))

In [124]:
# Cosine Similarity for Vector Space Models
from sklearn.metrics.pairwise import cosine_similarity
def get_cosine_similarity(feature_vec_1, feature_vec_2):    
    return cosine_similarity(feature_vec_1.reshape(1, -1), feature_vec_2.reshape(1, -1))[0][0]

In [128]:
list1 = []
list2 = []
data = open("./wol-WordSim-100.txt", "r", encoding="utf8")
for d in data:
    words = d.split()
    a = ft_SG_200.wv[words[0]]
    #print(a)
    b = ft_SG_200.wv[words[1]]
    #print(b)
    list1.append(words[2])
    c = get_cosine_similarity(a, b)
    list2.append(c)
    #print (words[0], words[1], words[2], c)
#print("list1", list1)
#print("list2", list2)
#W2V_CBOW.wv.similar_by_vector("oduwa")  # 


In [129]:
from scipy import stats
stats.spearmanr(list1, list2)

SpearmanrResult(correlation=0.24860916798967522, pvalue=0.01881227511533736)

# Spearman Correlation Coefficient
Spearman rank correlation: 
Spearman rank correlation is a non-parametric test 
that is used to measure the degree of association between two variables. 
Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. 
Correlations of -1 or +1 imply an exact monotonic relationship. 
Positive correlations imply that as x increases, so does y. 
Negative correlations imply that as x increases, y decreases.

In [130]:
# calculate the spearman's correlation between two variables
from numpy.random import rand
from numpy.random import seed
from scipy.stats import spearmanr
# calculate spearman's correlation
coef, p = spearmanr(list1, list2)
print('Spearmans correlation coefficient: %.3f' % coef)
# interpret the significance
alpha = 0.5
if p > alpha:
    print('Samples are uncorrelated')
else:
    print('Samples are correlated')

Spearmans correlation coefficient: 0.249
Samples are correlated


In [None]:
W2V_SG.build_vocab(sentences=common_texts)

# Result comparison



| Model Type |embedding Dimension | Sperasman Correlation | pvalue | Status |
| --- | --- | --- |--- |--- |
| word2vec_CBOW  | 100 | 0.022866288573785833 | 0.831562224446891 |uncorrelated |
| --- | --- | --- |--- |
| word2vec_SG | 100 | 0.03671716404271716 | 0.7326462858242914 | uncorrelated |
| --- | --- | --- |--- |--- |
| word2vec_CBOW | 200 | -0.029617206979970556 | 0.7829177288456539|uncorrelated |
| --- | --- | --- |--- |
| word2vec_SG | 200 | 0.049912527888349344 | 0.6422845369597203 | uncorrelated |
| --- | --- | --- |--- |--- |
| fasttext_CBOW | 100 | 0.1407562231120532 | 0.18828129431121884 |correlated |
| --- | --- | --- |--- |--- |
| fasttext_SG | 100 | 0.09833525663283697 | 0.35924580321854915 |correlated |
| --- | --- | --- |--- |--- |
| fasttext_CBOW | 200 | 0.09833525663283697 | 0.35924580321854915 |correlated |
| --- | --- | --- |--- |--- |
| fasttext_SG | 200 | 0.24860916798967522 | 0.01881227511533736 |correlated |
| --- | --- | --- |--- |--- |

