# Semantically related words using embeddings

In this notebook, word embeddings using word2vec or GloVe on 1 Nillion corpus, is utilized to arrive at words with similar meanings.

**Gensim word2vec APIs**: https://radimrehurek.com/gensim/models/word2vec.html

**1 Billion corpus**: http://www.statmt.org/lm-benchmark/

**Evaluation**

The evaluation is done using the following human scored word pair datasets:  
- SimLex-999: (999 word pairs)  https://fh295.github.io//simlex.html    
- WordSim-353: (133+200 word pairs)  http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/wordsim353.html
- MEN: 3000 (word pairs)  https://staff.fnwi.uva.nl/e.bruni/MEN 

SimLex dataset is particularly challenging due to its differentiation between semantic similarity and semantic relatedness.

In [1]:
import gensim                     # implements word2vec model infrastructure and provides interfacing APIs 
from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.test.utils import get_tmpfile, datapath
from gensim.models.word2vec import *
import pandas as pd
import numpy as np
import scipy.stats
import codecs
from tqdm import tqdm_notebook
import warnings
warnings.filterwarnings('ignore')



In [2]:
models = ['word2vec']                                # list of models to eval: 'word2vec' 
                                                     # and 'glove' (trainng not implemented yet)

path_model       = '../pretrained'
path_corpus      = '../datasets/1-billion-word-benchmark/training-monolingual.tokenized.shuffled'
path_corpus_proc = '../datasets/1-billion-word-benchmark/training-monolingual.tokenized.shuffled_proc'
path_results     = './results'

name_model_w2v   = 'model_gensim_word2vec'
name_Slex        = 'w2v_scoring_Slex.csv'
name_Wsim        = 'w2v_scoring_Wsim.csv'
name_Mens        = 'w2v_scoring_Mens.csv'
name_Spmr        = 'w2v_scoring_Spmr.csv'

file_prefix = 'news.en-'
file_suffix = '-of-00100'
file_indices = range(1,100)

# generate filenames
corpus_files = list()
for i in file_indices:
    corpus_files.append('{}{:05d}{}'.format(file_prefix, i, file_suffix))

In [3]:
# check for text decoding errors in a single file

def decode_check(file_path, verbose=0x00):
    dec_errors = 0
    line_cnt = 0
    line = 1
        
    with codecs.open(file_path, mode='rb') as file:
        buff = 1
        while(buff):
            try:
                buff = file.read()
                line = buff.decode('utf-8', 'strict').encode('utf-8', 'strict')
            except:
                dec_errors += 1
            line_cnt += 1
            
    if verbose & 0x01:
        print("{:25s} --> {:20s}: {:10d}, {:20s}: {:10d}"
              .format(file_path[-20:], "Total line count", line_cnt, "Line decoding errors", dec_errors))
    
    return line_cnt, dec_errors

In [4]:
# iterate over all corpus files and check for any decoding errors

def decode_check_corpus(path_corpus_, corpus_files, idx_files, verbosity=0x2):
    total_line_cnt = 0
    total_dec_errors = 0
    for i, file in zip(tqdm_notebook(file_indices), corpus_files):
        lineCnt, errorCnt = decode_check(str(path_corpus_ + '/' + file), verbose=verbosity)
        total_line_cnt += lineCnt
        total_dec_errors += errorCnt
    
    if verbosity & 0x2:
        print("{:20s}: {}".format("\nTotal line count", total_line_cnt))
        print("{:20s}: {}".format("Line decoding errors", total_dec_errors))
    
    return total_line_cnt, total_dec_errors    

In [6]:
total_lines, total_err = decode_check_corpus(
    path_corpus, corpus_files, file_indices, verbosity=0x2)  # 0x3 for per-file status



Total line count   : 198
Line decoding errors: 0


In [7]:
# inpsect and remove decoding errors from the corpus
# the whole sentence needs to be removed, otherwise wrong context 

input_corpus = None

if total_err != 0:
    print('\nPre-processing corpus...')
   
    errorCnt = 0
    for i, file in zip(tqdm_notebook(file_indices), corpus_files):
        new_file = open(str(path_corpus_proc + '/' + file), 'w')
 
        #with open(str(path_corpus + '/' + file), mode='r') as file:
        #    line = 1
        #    while(line):
        #        try:
        #            line = file.readline()
        #            print(line, file=new_file, end='')
        #        except:
        #            errorCnt += 1
        
        with codecs.open(str(path_corpus + '/' + file), mode='rb') as file:
            buff = 1
            while(buff):
                try:
                    buff = file.read()
                    line = buff.decode('utf-8', 'strict').encode('utf-8', 'strict')
                    new_file.write(line)
                except:
                    errorCnt += 1
            
        new_file.close()
    
    print("Dropped {} lines with decoding errors.".format(errorCnt))
    print('Pre-processing done.')

    lineCnt, errorCnt = decode_check_corpus(
        path_corpus_proc, corpus_files, file_indices, verbosity=0x02)
    
    assert errorCnt == 0
    input_corpus = path_corpus_proc

else:
    input_corpus = path_corpus

Approaches to train the word2vec model with, and the tradeoffs, as per *Mikolov*:

**Skip-gram**: works well with small amount of the training data, represents well even rare words or phrases.  
**CBOW**: several times faster to train than the skip-gram, slightly better accuracy for the frequent words

In [8]:
# train models
TRAIN_MODEL = False

if 'word2vec' in models:
    # initialize the model
    
    if TRAIN_MODEL == True:
        print("Training word2vec model...")
        input_sents = PathLineSentences(input_corpus)
        w2v = Word2Vec(input_sents, size=300, window=5, min_count=1, workers=16, sg=0)
        print("Finished training.")
    else:
        w2v = Word2Vec.load(path_model + '/' + name_model_w2v) 
        print('Loaded saved model from disk.')
        
if 'glove' in models:
    # load pre-trained GloVe model
    glove_vectors = '../pretrained/glove.twitter.27B.200d.txt'
    tmp_file = get_tmpfile("test_word2vec.txt")
    
    glove2word2vec(glove_input_file=glove_vectors, word2vec_output_file=tmp_file)
    glove = gensim.models.KeyedVectors.load_word2vec_format(tmp_file)
    print("Loaded GloVe model.")

Loaded saved model from disk.


In [9]:
# save trained model
if TRAIN_MODEL == True:
    w2v.save(path_model + '/' + name_model_w2v)
    print('Saved trained word2vec model to disk.')

In [10]:
w2v_embed = w2v.wv                # assigns embedding matrix

In [11]:
# similarity 
pair1 = ['coffee','cup']
pair2 = ['coffee','tea']

if 'word2vec' in models:
    cos_dist1_w = w2v_embed.similarity(pair1[0], pair1[1])
    cos_dist2_w = w2v_embed.similarity(pair2[0], pair2[1])
    print('word2vec cosine similarity of {}: {}'.format(pair1, cos_dist1_w) )
    print('word2vec cosine similarity of {}: {}'.format(pair2, cos_dist2_w) )

if 'glove' in models:
    cos_dist1_g = glove.similarity(pair1[0], pair1[1])
    cos_dist2_g = glove.similarity(pair2[0], pair2[1])
    print('\nGloVe cosine similarity of {}: {}'.format(pair1, cos_dist1_g) )
    print('GloVe cosine similarity of {}: {}'.format(pair2, cos_dist2_g) )

word2vec cosine similarity of ['coffee', 'cup']: 0.4078576862812042
word2vec cosine similarity of ['coffee', 'tea']: 0.7131128907203674


The problem above is that similarity doesn't always translate to synonyms - the target word 'minor' is closer to 'major' than to 'small'.

In [12]:
# vector representation of the word

if 'word2vec' in models:
    vec_pair1_0_w = w2v_embed.get_vector(pair1[0])
    print("word2vec Vector embedding dimension: ",vec_pair1_0_w.shape)
    print("\nPrinting a subset of the whole vector for the word '{}':".format(pair1[0]))
    print(vec_pair1_0_w[1:20])

if 'glove' in models:
    vec_pair1_0_g = glove.get_vector(pair1[0])
    print("\nGloVe vector embedding dimension: ",vec_pair1_0_g.shape)
    print("\nPrinting a subset of the whole vector for the word '{}':".format(pair1[0]))
    print(vec_pair1_0_g[1:20])

word2vec Vector embedding dimension:  (300,)

Printing a subset of the whole vector for the word 'coffee':
[ 0.8398364  -1.1771388   1.2766795  -0.71031374 -0.5955065  -1.0504907
  1.0739919  -0.41915712 -2.3491497   0.7334186   0.40302852  1.115876
 -0.0078924   0.01063586 -1.1075275   0.9948525  -0.1625037  -0.20667532
  1.984322  ]


In [13]:
# most similar words - by word
n_similar = 15
thisWord = 'coffee'

if 'word2vec' in models:
    print("Most similar {} words (by word) for '{}' by word2vec model:".format(n_similar, thisWord))
    display(w2v.similar_by_word(thisWord, n_similar))

if 'glove' in models:
    print("\nMost similar {} words (by word) for '{}' by GloVe model:".format(n_similar, thisWord))
    display(glove.similar_by_word(thisWord, n_similar))

Most similar 15 words (by word) for 'coffee' by word2vec model:


[('cappuccino', 0.7374070286750793),
 ('cappuccinos', 0.7249612808227539),
 ('espresso', 0.7171907424926758),
 ('tea', 0.7131129503250122),
 ('coffees', 0.7073582410812378),
 ('latte', 0.7022086381912231),
 ('beer', 0.6977545022964478),
 ('chai', 0.6865086555480957),
 ('lattes', 0.6777160167694092),
 ('chocolate', 0.6607286334037781),
 ('croissants', 0.6590141654014587),
 ('gelato', 0.657051682472229),
 ('decaf', 0.6520746946334839),
 ('smoothie', 0.6502355337142944),
 ('pastries', 0.6489415168762207)]

There are some odd words in the list of candidates as 'tea' and 'noodle', which ideally should be screened out, based on context.

In [14]:
# most similar words - by vector

if 'word2vec' in models:
    print("Most similar {} words (by vector) for '{}' by word2vec model:".format(n_similar, thisWord))
    display(w2v.similar_by_vector(thisWord, n_similar))

if 'glove' in models:
    print("\nMost similar {} words (by vector) for '{}' by GloVe model:".format(n_similar, thisWord))
    display(glove.similar_by_vector(thisWord, n_similar))

Most similar 15 words (by vector) for 'coffee' by word2vec model:


[('cappuccino', 0.7374070286750793),
 ('cappuccinos', 0.7249612808227539),
 ('espresso', 0.7171907424926758),
 ('tea', 0.7131129503250122),
 ('coffees', 0.7073582410812378),
 ('latte', 0.7022086381912231),
 ('beer', 0.6977545022964478),
 ('chai', 0.6865086555480957),
 ('lattes', 0.6777160167694092),
 ('chocolate', 0.6607286334037781),
 ('croissants', 0.6590141654014587),
 ('gelato', 0.657051682472229),
 ('decaf', 0.6520746946334839),
 ('smoothie', 0.6502355337142944),
 ('pastries', 0.6489415168762207)]

In this case, the candidates and their order is exactly the same as using the similarity() metric.

## Evaluation:

In [15]:
# load evaluation datsets
path_evalsetSlex = '..\\datasets\\SimLex-999\\SimLex-999.txt'
path_evalsetWsim = '..\\datasets\\WordSim-353\\combined.csv'
path_evalsetMens = '..\\datasets\\MEN\\MEN_dataset_natural_form_full'

evalsetSlex = pd.read_csv(path_evalsetSlex, sep='\t')
evalsetWsim = pd.read_csv(path_evalsetWsim, sep=',')
evalsetMens = pd.read_csv(path_evalsetMens, sep=' ', header=None)
evalsetMens.columns = ['word1', 'word2', 'MEN_score']

print("\nSimLex eval set:")
display(evalsetSlex.head())
print("\nWordSim eval set:")
display(evalsetWsim.head())
print("\nMEN eval set:")
display(evalsetMens.head())

evalset_name_list = ['SimLex-999', 'WordSim-353', 'MEN (3000)']


SimLex eval set:


Unnamed: 0,word1,word2,POS,SimLex999,conc(w1),conc(w2),concQ,Assoc(USF),SimAssoc333,SD(SimLex)
0,old,new,A,1.58,2.72,2.81,2,7.25,1,0.41
1,smart,intelligent,A,9.2,1.75,2.46,1,7.11,1,0.67
2,hard,difficult,A,8.77,3.76,2.21,2,5.94,1,1.19
3,happy,cheerful,A,9.55,2.56,2.34,1,5.85,1,2.18
4,hard,easy,A,0.95,3.76,2.07,2,5.82,1,0.93



WordSim eval set:


Unnamed: 0,Word 1,Word 2,Human (mean)
0,love,sex,6.77
1,tiger,cat,7.35
2,tiger,tiger,10.0
3,book,paper,7.46
4,computer,keyboard,7.62



MEN eval set:


Unnamed: 0,word1,word2,MEN_score
0,sun,sunlight,50.0
1,automobile,car,50.0
2,river,water,49.0
3,stairs,staircase,49.0
4,morning,sunrise,49.0


Based on the above structure for SimLex-999 as evaluation dataset1, we'll be using the word pairs (columns `word1` and `word2`) as well as the `SimLex999` column for score on scale 1-10. Additionally, the `SD(SimLex)` column for standard deviation indicative of agreement between the human annotators for the given word pair, could be used for further investigation and possible waiving, in case discrepancy of the output from the model(s) to be evaluated.

**Spearman correlation**

Taken from [scipy.stats.spearman()](https://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.stats.spearmanr.html)

The Spearman correlation is a nonparametric measure of the monotonicity of the relationship between two datasets. Unlike the Pearson correlation, the Spearman correlation does not assume that both datasets are normally distributed. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation

In [16]:
# ranking the eval datasets' scores

Slex_sorted = evalsetSlex.copy()
print("\nSimLex-999 eval dataset sample:")
display(Slex_sorted.head())
Slex_sorted = Slex_sorted.filter(items=['word1', 'word2', 'SimLex999', 'SD(SimLex)'], axis=1)
Slex_sorted = Slex_sorted.sort_index(by=['SimLex999'], ascending=False)
Slex_sorted = Slex_sorted.set_index(pd.Index(range(0,Slex_sorted.shape[0])))

Wsim_sorted = evalsetWsim.copy()
print("\nWordSim-353 eval dataset sample:")
display(Wsim_sorted.head())
Wsim_sorted = Wsim_sorted.sort_index(by=['Human (mean)'], ascending=False)
Wsim_sorted = Wsim_sorted.set_index(pd.Index(range(0,Wsim_sorted.shape[0])))

Mens_sorted = evalsetMens.copy()
print("\nMEN (3000) eval dataset sample:")
display(Mens_sorted.head())
Mens_sorted = Mens_sorted.sort_index(by=['MEN_score'], ascending=False)
Mens_sorted = Mens_sorted.set_index(pd.Index(range(0,Mens_sorted.shape[0])))

# add the rank column
print('*'*80)
rank = 0
Slex_sorted['rank_simlex'] = Slex_sorted.index + 1
print("\nTop and bottom rows of ranked SimLex eval dataset:")
display(pd.concat([Slex_sorted.head(), Slex_sorted.tail()], axis=0))

rank = 0
Wsim_sorted['rank_wordsim'] = Wsim_sorted.index + 1
print("\nTop and bottom rows of ranked WordSim eval dataset:")
display(pd.concat([Wsim_sorted.head(), Wsim_sorted.tail()], axis=0))

rank = 0
Mens_sorted['rank_men'] = Mens_sorted.index + 1
print("\nTop and bottom rows of ranked MEN eval dataset:")
display(pd.concat([Mens_sorted.head(), Mens_sorted.tail()], axis=0))


SimLex-999 eval dataset sample:


Unnamed: 0,word1,word2,POS,SimLex999,conc(w1),conc(w2),concQ,Assoc(USF),SimAssoc333,SD(SimLex)
0,old,new,A,1.58,2.72,2.81,2,7.25,1,0.41
1,smart,intelligent,A,9.2,1.75,2.46,1,7.11,1,0.67
2,hard,difficult,A,8.77,3.76,2.21,2,5.94,1,1.19
3,happy,cheerful,A,9.55,2.56,2.34,1,5.85,1,2.18
4,hard,easy,A,0.95,3.76,2.07,2,5.82,1,0.93



WordSim-353 eval dataset sample:


Unnamed: 0,Word 1,Word 2,Human (mean)
0,love,sex,6.77
1,tiger,cat,7.35
2,tiger,tiger,10.0
3,book,paper,7.46
4,computer,keyboard,7.62



MEN (3000) eval dataset sample:


Unnamed: 0,word1,word2,MEN_score
0,sun,sunlight,50.0
1,automobile,car,50.0
2,river,water,49.0
3,stairs,staircase,49.0
4,morning,sunrise,49.0


********************************************************************************

Top and bottom rows of ranked SimLex eval dataset:


Unnamed: 0,word1,word2,SimLex999,SD(SimLex),rank_simlex
0,vanish,disappear,9.8,0.46,1
1,quick,rapid,9.7,1.14,2
2,creator,maker,9.62,1.4,3
3,stupid,dumb,9.58,1.48,4
4,insane,crazy,9.57,0.92,5
994,gun,fur,0.3,1.8,995
995,chapter,tail,0.3,1.57,996
996,dirty,narrow,0.3,0.89,997
997,new,ancient,0.23,0.46,998
998,shrink,grow,0.23,1.2,999



Top and bottom rows of ranked WordSim eval dataset:


Unnamed: 0,Word 1,Word 2,Human (mean),rank_wordsim
0,tiger,tiger,10.0,1
1,fuck,sex,9.44,2
2,journey,voyage,9.29,3
3,midday,noon,9.29,4
4,dollar,buck,9.22,5
348,rooster,voyage,0.62,349
349,noon,string,0.54,350
350,chord,smile,0.54,351
351,professor,cucumber,0.31,352
352,king,cabbage,0.23,353



Top and bottom rows of ranked MEN eval dataset:


Unnamed: 0,word1,word2,MEN_score,rank_men
0,sun,sunlight,50.0,1
1,automobile,car,50.0,2
2,river,water,49.0,3
3,stairs,staircase,49.0,4
4,morning,sunrise,49.0,5
2995,feathers,truck,1.0,2996
2996,festival,whiskers,1.0,2997
2997,muscle,tulip,1.0,2998
2998,bikini,pizza,1.0,2999
2999,bakery,zebra,0.0,3000


In [19]:
# Evaluating the model's scoring on eval datasets

# Simlex

w2v_Slex = Slex_sorted.copy()
w2v_Slex['w2v_score'] = Slex_sorted.apply(lambda row: w2v.similarity(row['word1'], row['word2']), axis=1 )
w2v_Slex_sorted = w2v_Slex.sort_values(by=['w2v_score'], ascending=False)
w2v_Slex_sorted = w2v_Slex_sorted.set_index(pd.Index(range(0,w2v_Slex_sorted.shape[0])))
w2v_Slex_sorted['rank_w2v'] = w2v_Slex_sorted.index + 1
w2v_Slex_sorted = w2v_Slex_sorted.sort_values(by=['SimLex999'], ascending=False)
w2v_Slex['rank_w2v'] = w2v_Slex_sorted['rank_w2v'].values

# Wordsim

w2v_Wsim = Wsim_sorted.copy()
w2v_Wsim['w2v_score'] = Wsim_sorted.apply(lambda row: w2v.similarity(row['Word 1'], row['Word 2']), axis=1 )
w2v_Wsim_sorted = w2v_Wsim.sort_values(by=['w2v_score'], ascending=False)
w2v_Wsim_sorted = w2v_Wsim_sorted.set_index(pd.Index(range(0,w2v_Wsim_sorted.shape[0])))
w2v_Wsim_sorted['rank_w2v'] = w2v_Wsim_sorted.index + 1
w2v_Wsim_sorted = w2v_Wsim_sorted.sort_values(by=['Human (mean)'], ascending=False)
w2v_Wsim['rank_w2v'] = w2v_Wsim_sorted['rank_w2v'].values

# MEN

w2v_Mens = Mens_sorted.copy()
w2v_Mens['w2v_score'] = Mens_sorted.apply(lambda row: w2v.similarity(row['word1'], row['word2']), axis=1 )
w2v_Mens_sorted = w2v_Mens.sort_values(by=['w2v_score'], ascending=False)
w2v_Mens_sorted = w2v_Mens_sorted.set_index(pd.Index(range(0,w2v_Mens_sorted.shape[0])))
w2v_Mens_sorted['rank_w2v'] = w2v_Mens_sorted.index + 1
w2v_Mens_sorted = w2v_Mens_sorted.sort_values(by=['MEN_score'], ascending=False)
w2v_Mens['rank_w2v'] = w2v_Mens_sorted['rank_w2v'].values

print("\nEval datasets as ranked by word2vec:")
display(w2v_Slex.head())
display(w2v_Wsim.head())
display(w2v_Mens.head())

w2v_spearman = list()
w2v_spearman.append(scipy.stats.spearmanr(w2v_Slex['rank_simlex'], w2v_Slex['rank_w2v'])[0])
w2v_spearman.append(scipy.stats.spearmanr(w2v_Wsim['rank_wordsim'], w2v_Wsim['rank_w2v'])[0])
w2v_spearman.append(scipy.stats.spearmanr(w2v_Mens['rank_men'], w2v_Mens['rank_w2v'])[0])

print('*'*70)
w2v_Spmr = pd.DataFrame({'Dataset': evalset_name_list, 'Spearman rank coeff.': w2v_spearman})
display(w2v_Spmr)


Eval datasets as ranked by word2vec:


Unnamed: 0,word1,word2,SimLex999,SD(SimLex),rank_simlex,w2v_score,rank_w2v
0,vanish,disappear,9.8,0.46,1,0.910748,4
1,quick,rapid,9.7,1.14,2,0.414103,555
2,creator,maker,9.62,1.4,3,0.307668,758
3,stupid,dumb,9.58,1.48,4,0.804416,20
4,insane,crazy,9.57,0.92,5,0.616115,187


Unnamed: 0,Word 1,Word 2,Human (mean),rank_wordsim,w2v_score,rank_w2v
0,tiger,tiger,10.0,1,1.0,1
1,fuck,sex,9.44,2,0.120425,297
2,journey,voyage,9.29,3,0.777584,5
3,midday,noon,9.29,4,0.76694,7
4,dollar,buck,9.22,5,0.249064,200


Unnamed: 0,word1,word2,MEN_score,rank_men,w2v_score,rank_w2v
0,sun,sunlight,50.0,1,0.714696,418
1,automobile,car,50.0,2,0.56477,90
2,river,water,49.0,3,0.610263,482
3,stairs,staircase,49.0,4,0.701168,289
4,morning,sunrise,49.0,5,0.42083,114


**********************************************************************


Unnamed: 0,Dataset,Spearman rank coeff.
0,SimLex-999,0.367297
1,WordSim-353,0.591474
2,MEN (3000),0.676167


In [20]:
# save the ranked word pairs and scores for further analysis
w2v_Slex.to_csv(path_results + '/' + name_Slex)
w2v_Wsim.to_csv(path_results + '/' + name_Wsim)
w2v_Mens.to_csv(path_results + '/' + name_Mens)
w2v_Spmr.to_csv(path_results + '/' + name_Spmr)

The Spearman coefficient for word2vec is very close with its value reported on SimLex-999 page:  
https://fh295.github.io//simlex.html  

As expected SimLex dataset scores the lowest Spearman rank value, since it is the only one among the three that differentiates semantic similarity with semantic relatedness, which is difficult concept to be captured by the models.