# Word2Vec model training


**Description**

In this notebook the goal is to train and evaluate word2vec models with different hyperparameter tunings using the gensim library. The task is to spot diversities between the results of the models' evaluation and comment on the findings. The dataset used is a corpus of news articles in English originally extracted from http://www.setimes.com  originally consisted of 100.000 sentences.


In [1]:
pip install --upgrade gensim


Defaulting to user installation because normal site-packages is not writeableNote: you may need to restart the kernel to use updated packages.

Collecting gensim
  Downloading gensim-4.2.0-cp39-cp39-win_amd64.whl (23.9 MB)
Installing collected packages: gensim
Successfully installed gensim-4.2.0


In [4]:
import gensim

**1.** **Pre-Processing**

In [5]:
import re
import pandas as pd

def read_file (filepath):
    with open (filepath, 'r', encoding='utf8', errors='ignore') as reader:
        lines = [line for line in reader]
        return lines

def cleaning(lines):
    cleaned_sentences = []
    for text in lines: 
        text = re.sub (r'[^\w\s]', '', text)# remove special characters
        text = re.sub(r'(\w\s)\1{2,}', '', text) # remove multiple consequtive occurrences of a character
        text = re.sub(r'[0-9]+', '', text) # remove digits
        text = text.lower().rstrip() # set everything to lower case
        words = [word for word in text.split(' ')]
        if text != '' and len(words) > 2: # drop sentences with few words
            cleaned_sentences.append(text)
        else:
            continue
    return cleaned_sentences

def tokenize (list_of_sentences):
    tokenized_sent = []
    for sent in list_of_sentences:
        tokenized_sent.append(sent.split())
    return tokenized_sent

def drop_duplicates (cleaned_sentences):
    df_clean = pd.DataFrame({'sentences': cleaned_sentences})
    df_clean = df_clean.dropna().drop_duplicates()
    return df_clean['sentences'].to_list()

**Load Dataset**

In [6]:
# Load dataset
dataset = read_file("setimes_english.txt")
# Apply pre-proccesing
sentences = tokenize(drop_duplicates(cleaning(dataset)))
print (sentences[1:3])

[['croatia', 'farming', 'and', 'agriculture', 'sectors', 'could', 'reap', 'benefits', 'from', 'eu', 'membership'], ['on', 'friday', 'the', 'eu', 'confirmed', 'its', 'commitment', 'to', 'admitting', 'bulgaria', 'and', 'romania', 'together', 'in', 'although', 'the', 'two', 'countries', 'are', 'at', 'different', 'stages', 'of', 'their', 'accession', 'preparations']]


**2. Set parameters**

In [54]:
# model 1 
model_1 = gensim.models.Word2Vec (
    vector_size=300,   
    window=10,   
    min_count=10, 
    workers=10,  
    sg=0,        
    hs=0,
    negative=5
)

In [55]:
# model 2
model_2 = gensim.models.Word2Vec (
    vector_size=300,   
    window=10,   
    min_count=10, 
    workers=10,  
    sg=1,        
    hs=0,
    negative=5
)

In [56]:
# model 3
model_3 = gensim.models.Word2Vec (
    vector_size = 100,   
    window=2,   
    min_count=2, 
    workers=10,  
    sg=0,        
    hs=0,
    negative=5
)

In [57]:
# model 4
model_4 = gensim.models.Word2Vec (
    vector_size=100,   
    window=2,   
    min_count=2, 
    workers=10,  
    sg=1,        
    hs=0,
    negative=5
)

**Parameters' selection criteria**

Two pairs of hyperparameters' sets were formulated in order to further spot any significant difference between the two architectures (Skip-gram and CBOW). 

Pair 1: Model 1 (CBOW) & Model 2 (Skip-gram)

Pair 2: Model 3 (CBOW) & Model 4 (Skip-gram)

**3. Train models & save to disk**

In [58]:
i = 1
for model in [model_1, model_2, model_3, model_4]:
  
  model.build_vocab (sentences, progress_per = 1000)
  
  print (f'model_{i} is training...')

  model.train(sentences, total_examples = model.corpus_count, epochs = model.epochs)

  model.save (f"models/model_{i}.model")

  print (f'model_{i} trained and saved succesfully')

  i += 1


model_1 is training...
model_1 trained and saved succesfully
model_2 is training...
model_2 trained and saved succesfully
model_3 is training...
model_3 trained and saved succesfully
model_4 is training...
model_4 trained and saved succesfully


**4. Models' Evaluation**

**Load models**

In [59]:

model_1 = gensim.models.Word2Vec.load("models/model_1.model")
model_2 = gensim.models.Word2Vec.load("models/model_2.model")
model_3 = gensim.models.Word2Vec.load("models/model_3.model")
model_4 = gensim.models.Word2Vec.load("models/model_4.model")

**Similar words to a given one**

In [60]:

i = 1

for model in [model_1, model_2, model_3, model_4]:
  
  print (f'\nmodel_{i}:\n')

  words = ['important', 'police', 'good']
  for word in words:
    
    print (f'words similar to {word}:\n', model.wv.most_similar(word))
  i += 1



model_1:

words similar to important:
 [('essential', 0.686482310295105), ('obstacle', 0.6447576880455017), ('successful', 0.6429343819618225), ('crucial', 0.6400361061096191), ('ideal', 0.6318754553794861), ('priority', 0.6281493902206421), ('difficult', 0.6200161576271057), ('factor', 0.6175982356071472), ('goal', 0.6050266027450562), ('importance', 0.6047756671905518)]
words similar to police:
 [('officers', 0.7701618671417236), ('kfor', 0.7123222351074219), ('prosecutors', 0.6956273913383484), ('eulex', 0.6943183541297913), ('soldiers', 0.6774521470069885), ('courts', 0.672831654548645), ('kps', 0.6711850762367249), ('army', 0.6609285473823547), ('peacekeepers', 0.6510785222053528), ('force', 0.6365142464637756)]
words similar to good:
 [('normal', 0.7690985798835754), ('very', 0.7430737614631653), ('great', 0.7209277749061584), ('always', 0.6905218362808228), ('bad', 0.6854774355888367), ('absolutely', 0.6430114507675171), ('hard', 0.634160041809082), ('really', 0.626529216766357

**Similarity between pairs of words**

In [61]:
 i = 1

for model in [model_1, model_2, model_3, model_4]:
  print (f'\nmodel_{i}:\n')
  similarity = model.wv.similarity (w1='girl',w2='boy')
  diversity = model.wv.similarity (w1='country',w2='rock')  
  print (f'\n similarity between words "girl" and "boy":\t{similarity*100}%')
  print (f'\n similarity between words "country" and "rock":\t{diversity*100}%')

  i+=1



model_1:


 similarity between words "girl" and "boy":	85.75131893157959%

 similarity between words "country" and "rock":	-13.031578063964844%

model_2:


 similarity between words "girl" and "boy":	88.66469264030457%

 similarity between words "country" and "rock":	13.84054720401764%

model_3:


 similarity between words "girl" and "boy":	92.05288290977478%

 similarity between words "country" and "rock":	0.5180047824978828%

model_4:


 similarity between words "girl" and "boy":	89.61658477783203%

 similarity between words "country" and "rock":	16.82732403278351%


**Words that do not match**

In [62]:
i = 1
for model in [model_1, model_2, model_3, model_4]:
  word = model.wv.doesnt_match("breakfast paris lunch dinner".split())
  print (f'the irrelevant word in group [breakfast paris lunch dinner] is {word} (model_{i})')

  i+=1

the irrelevant word in group [breakfast paris lunch dinner] is paris (model_1)
the irrelevant word in group [breakfast paris lunch dinner] is paris (model_2)
the irrelevant word in group [breakfast paris lunch dinner] is paris (model_3)
the irrelevant word in group [breakfast paris lunch dinner] is paris (model_4)


**Word analogies**

In [63]:
i = 1
for model in [model_1, model_2, model_3, model_4]:
  
  similarity = model.wv.n_similarity(['paris', 'france'], ['berlin', 'germany'])
  
  print(f"analogy paris:france;berlin:germany is {similarity*100}% true (model_{i})")

  i += 1

analogy paris:france;berlin:germany is 94.58596706390381% true (model_1)
analogy paris:france;berlin:germany is 80.85355758666992% true (model_2)
analogy paris:france;berlin:germany is 96.04023694992065% true (model_3)
analogy paris:france;berlin:germany is 89.81183767318726% true (model_4)


**Comments**

Based on the selected evaluation tests we can comment the following:

**1. Smaller window size and word sub-sampling threshold (word count) lead to better performances in all tests (Pair 2, window size: 2, word count: 2 > Pair 1, window size: 10, word count: 10).**

Although **large window size** slection means less syntactic constricted information due to larger margins of context words that are included in the training, it is higly possible that a larger number of words that are **semanticly irrelevant to the given word are taken into account by the model** (e.g. model 2 gave the word 'neighbourly' as similar to "good"). Moreover, a **high word frequency limitation**, namely the more frequent a word is the less possible is to be included in the models' training vocabulary,  **leads to the exclusion of less frequent words** in a dataset but more semanticly related to the given word (e.g. model 1 gave the word "very" as more similar to the word "good" than the words "great" or "bad" which was not the case with model 3)

**2. CBOW architecture performed better with more frequent words than skip-gram but worse with rarer words or word connections.**

In the first test, CBOW models' similarity predictions with given word  "important" were slightly worse than the ones of Skip-grams models (e.g. cbows: essential, obstacle, difficult, sensitive / skip-grams: crucial, sensitive, vital, essential,) in contrast with the more frequent word "good" where the predictions made by CBOW models were better (e.g. cbows: bad, normal, very, great / skip-grams: neighbourly, bad, excellent, normal). In the fourth test (word pair analogy), where the test pair of vector differences is related to countries and capitals, CBOW model outperformed Skip-gram models due to the high frequency of these words in the news domain specific training dataset.

The above observations are explained by the nature of the two different architectures. **CBOW is learning to predict the word by the context** therefore a rare word will get much less attention of the model because it is designed to predict the most probable word. On the other hand, **the skip-gram model is designed to predict the context** which gives more attention to the given word despite the rarity. 