# Semantically related words using embeddings

In this notebook, word embeddings using word2vec or GloVe on 1 Nillion corpus, is utilized to arrive at words with similar meanings.

**Gensim word2vec APIs**: https://radimrehurek.com/gensim/models/word2vec.html

**1 Billion corpus**: http://www.statmt.org/lm-benchmark/

**Evaluation**

The evaluation is done using SimLex-999 dataset. This is particularly challenging due to its differentiation between semantic similarity and semantic relatedness.

In [1]:
import gensim                     # implements word2vec model infrastructure and provides interfacing APIs 
from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.test.utils import get_tmpfile, datapath
from gensim.models.word2vec import *
import pandas as pd
import scipy.stats
import codecs
import warnings
warnings.filterwarnings('ignore')



In [2]:
models = ['word2vec']                                # list of models to eval: 'word2vec' 
                                                     # and 'glove' (trainng not implemented yet)

path_corpus = '../datasets/1-billion-word-LM_corpus/1-billion-word-LM_corpus'
path_corpus_proc = '../datasets/1-billion-word-LM_corpus/1-billion-word-LM_corpus_processed'

Approaches to train the word2vec model with, and the tradeoffs, as per *Mikolov*:

**Skip-gram**: works well with small amount of the training data, represents well even rare words or phrases.  
**CBOW**: several times faster to train than the skip-gram, slightly better accuracy for the frequent words

In [3]:
# check for text decoding errors

def decode_check(file_path):
    dec_errors = 0
    line_cnt = 0
    with open(file_path, mode= 'r') as file:
        try:
            line = file.readline()
        except:
            dec_errors += 1
        line_cnt += 1
        while(line):
            try:
                line = file.readline()
            except:
                dec_errors += 1
            line_cnt += 1
    
    print("{:20s}: {}".format("\nTotal line count", line_cnt))
    print("{:20s}: {}".format("Line decoding errors", dec_errors))
    
    return line_cnt, dec_errors

In [4]:
# inpsect and remove decoding errors from the corpus
# the whole sentence needs to be removed, otherwise wrong context might be captured 

lineCnt, errorCnt = decode_check(path_corpus)

if errorCnt != 0:
    print('\nPre-processing corpus...')
    new_file = open(path_corpus_proc, 'w')
    errorCnt = 0
    
    #with codecs.open(path_corpus, 'rb') as file:
    #    try:
    #        line = file.read().decode('utf-8', 'strict').encode('utf-8', 'strict')
    #        new_file.write(line)
    #    except:
    #        line = None
    #        errorCnt += 1
    #    
    #    while(line):
    #        try:
    #            line = file.read().decode('utf-8', 'strict').encode('utf-8', 'strict')
    #            new_file.write(line)
    #        except:
    #            line = None
    #            errorCnt += 1
                
    with open(path_corpus, mode='r') as file:
        line = 1
        while(line):
            try:
                line = file.readline()
                print(line, file=new_file,end='')
            except:
                errorCnt += 1

    new_file.close()
    print("Dropped {} lines with decoding errors.".format(errorCnt))
    print('Pre-processing done.')

    lineCnt, errorCnt = decode_check(path_corpus_proc)
    assert errorCnt == 0
    input_corpus = path_corpus_proc

else:
    input_corpus = path_corpus


Total line count   : 30726609
Line decoding errors: 3247

Pre-processing corpus...
Dropped 3247 lines with decoding errors.
Pre-processing done.

Total line count   : 30723362
Line decoding errors: 0


In [6]:
# train models

if 'word2vec' in models:
    # initialize the model

    input_sents = LineSentence(input_corpus)
    print("Training word2vec model...")
    w2v = Word2Vec(input_sents, size=300, window=5, min_count=1, workers=16, sg=0)
    print("Finished training.")

if 'glove' in models:
    # load pre-trained GloVe model
    glove_vectors = '../pretrained/glove.twitter.27B.200d.txt'
    tmp_file = get_tmpfile("test_word2vec.txt")
    
    glove2word2vec(glove_input_file=glove_vectors, word2vec_output_file=tmp_file)
    glove = gensim.models.KeyedVectors.load_word2vec_format(tmp_file)
    print("Loaded GloVe model.")

Training word2vec model...
Finished training.


In [7]:
w2v_embed = w2v.wv                # assigns embedding matrix

In [8]:
# similarity 
pair1 = ['coffee','cup']
pair2 = ['coffee','tea']

if 'word2vec' in models:
    cos_dist1_w = w2v_embed.similarity(pair1[0], pair1[1])
    cos_dist2_w = w2v_embed.similarity(pair2[0], pair2[1])
    print('word2vec cosine similarity of {}: {}'.format(pair1, cos_dist1_w) )
    print('word2vec cosine similarity of {}: {}'.format(pair2, cos_dist2_w) )

if 'glove' in models:
    cos_dist1_g = glove.similarity(pair1[0], pair1[1])
    cos_dist2_g = glove.similarity(pair2[0], pair2[1])
    print('\nGloVe cosine similarity of {}: {}'.format(pair1, cos_dist1_g) )
    print('GloVe cosine similarity of {}: {}'.format(pair2, cos_dist2_g) )

word2vec cosine similarity of ['coffee', 'cup']: 0.4199226200580597
word2vec cosine similarity of ['coffee', 'tea']: 0.7134822607040405


The problem above is that similarity doesn't always translate to synonyms - the target word 'minor' is closer to 'major' than to 'small'.

In [9]:
# vector representation of the word

if 'word2vec' in models:
    vec_pair1_0_w = w2v_embed.get_vector(pair1[0])
    print("word2vec Vector embedding dimension: ",vec_pair1_0_w.shape)
    print("\nPrinting a subset of the whole vector for the word '{}':".format(pair1[0]))
    print(vec_pair1_0_w[1:20])

if 'glove' in models:
    vec_pair1_0_g = glove.get_vector(pair1[0])
    print("\nGloVe vector embedding dimension: ",vec_pair1_0_g.shape)
    print("\nPrinting a subset of the whole vector for the word '{}':".format(pair1[0]))
    print(vec_pair1_0_g[1:20])

word2vec Vector embedding dimension:  (300,)

Printing a subset of the whole vector for the word 'coffee':
[ 0.19272955 -0.7465307  -0.5118991  -0.13203195 -0.803349   -0.89371943
  0.59276193 -0.71725786  1.1546805   0.4209137   0.594539   -1.625906
 -1.5279709  -0.95023376  0.6368995  -0.3045696  -0.1427788  -2.0711997
 -1.2850178 ]


In [10]:
# most similar words - by word
n_similar = 15
thisWord = 'coffee'

if 'word2vec' in models:
    print("Most similar {} words (by word) for '{}' by word2vec model:".format(n_similar, thisWord))
    display(w2v.similar_by_word(thisWord, n_similar))

if 'glove' in models:
    print("\nMost similar {} words (by word) for '{}' by GloVe model:".format(n_similar, thisWord))
    display(glove.similar_by_word(thisWord, n_similar))

Most similar 15 words (by word) for 'coffee' by word2vec model:


[('cappuccino', 0.7345283031463623),
 ('coffees', 0.719455897808075),
 ('latte', 0.7135310173034668),
 ('tea', 0.7134822607040405),
 ('cappuccinos', 0.7123081684112549),
 ('lattes', 0.7047884464263916),
 ('beer', 0.6945380568504333),
 ('espresso', 0.691644549369812),
 ('snack', 0.6696176528930664),
 ('croissants', 0.6683788895606995),
 ('chai', 0.6663810014724731),
 ('chocolate', 0.6636378765106201),
 ('granola', 0.6541717052459717),
 ('gelato', 0.6520758867263794),
 ('mocha', 0.6493180990219116)]

There are some odd words in the list of candidates as 'tea' and 'noodle', which ideally should be screened out, based on context.

In [11]:
# most similar words - by vector

if 'word2vec' in models:
    print("Most similar {} words (by vector) for '{}' by word2vec model:".format(n_similar, thisWord))
    display(w2v.similar_by_vector(thisWord, n_similar))

if 'glove' in models:
    print("\nMost similar {} words (by vector) for '{}' by GloVe model:".format(n_similar, thisWord))
    display(glove.similar_by_vector(thisWord, n_similar))

Most similar 15 words (by vector) for 'coffee' by word2vec model:


[('cappuccino', 0.7345283031463623),
 ('coffees', 0.719455897808075),
 ('latte', 0.7135310173034668),
 ('tea', 0.7134822607040405),
 ('cappuccinos', 0.7123081684112549),
 ('lattes', 0.7047884464263916),
 ('beer', 0.6945380568504333),
 ('espresso', 0.691644549369812),
 ('snack', 0.6696176528930664),
 ('croissants', 0.6683788895606995),
 ('chai', 0.6663810014724731),
 ('chocolate', 0.6636378765106201),
 ('granola', 0.6541717052459717),
 ('gelato', 0.6520758867263794),
 ('mocha', 0.6493180990219116)]

In this case, the candidates and their order is exactly the same as using the similarity() metric.

## Evaluation:

In [12]:
# load evaluation datsets
path_evalset1 = '..\\datasets\\SimLex-999\\SimLex-999.txt'
evalset1 = pd.read_csv(path_evalset1, sep='\t')
evalset1.head()

Unnamed: 0,word1,word2,POS,SimLex999,conc(w1),conc(w2),concQ,Assoc(USF),SimAssoc333,SD(SimLex)
0,old,new,A,1.58,2.72,2.81,2,7.25,1,0.41
1,smart,intelligent,A,9.2,1.75,2.46,1,7.11,1,0.67
2,hard,difficult,A,8.77,3.76,2.21,2,5.94,1,1.19
3,happy,cheerful,A,9.55,2.56,2.34,1,5.85,1,2.18
4,hard,easy,A,0.95,3.76,2.07,2,5.82,1,0.93


Based on the above structure for SimLex-999 as evaluation dataset1, we'll be using the word pairs (columns `word1` and `word2`) as well as the `SimLex999` column for score on scale 1-10. Additionally, the `SD(SimLex)` column for standard deviation indicative of agreement between the human annotators for the given word pair, could be used for further investigation and possible waiving, in case discrepancy of the output from the model(s) to be evaluated.

**Spearman correlation**

Taken from [scipy.stats.spearman()](https://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.stats.spearmanr.html)

The Spearman correlation is a nonparametric measure of the monotonicity of the relationship between two datasets. Unlike the Pearson correlation, the Spearman correlation does not assume that both datasets are normally distributed. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation

In [13]:
# ranking the SimLex-999 scores

# sort the eval dataset
evalset1_sorted = evalset1.copy()
display(evalset1_sorted.head())
evalset1_sorted = evalset1_sorted.filter(items=['word1', 'word2', 'SimLex999', 'SD(SimLex)'], axis=1)
evalset1_sorted = evalset1_sorted.sort_index(by=['SimLex999'], ascending=False)
evalset1_sorted = evalset1_sorted.set_index(pd.Index(range(0,evalset1_sorted.shape[0])))

# add the rank column
rank = 0
evalset1_sorted['rank'] = evalset1_sorted.index + 1
print("\nTop rows of ranked eval dataset:")
display(evalset1_sorted.head())
print("\nBottom rows of ranked eval dataset:")
display(evalset1_sorted.tail())

Unnamed: 0,word1,word2,POS,SimLex999,conc(w1),conc(w2),concQ,Assoc(USF),SimAssoc333,SD(SimLex)
0,old,new,A,1.58,2.72,2.81,2,7.25,1,0.41
1,smart,intelligent,A,9.2,1.75,2.46,1,7.11,1,0.67
2,hard,difficult,A,8.77,3.76,2.21,2,5.94,1,1.19
3,happy,cheerful,A,9.55,2.56,2.34,1,5.85,1,2.18
4,hard,easy,A,0.95,3.76,2.07,2,5.82,1,0.93



Top rows of ranked eval dataset:


Unnamed: 0,word1,word2,SimLex999,SD(SimLex),rank
0,vanish,disappear,9.8,0.46,1
1,quick,rapid,9.7,1.14,2
2,creator,maker,9.62,1.4,3
3,stupid,dumb,9.58,1.48,4
4,insane,crazy,9.57,0.92,5



Bottom rows of ranked eval dataset:


Unnamed: 0,word1,word2,SimLex999,SD(SimLex),rank
994,gun,fur,0.3,1.8,995
995,chapter,tail,0.3,1.57,996
996,dirty,narrow,0.3,0.89,997
997,new,ancient,0.23,0.46,998
998,shrink,grow,0.23,1.2,999


In [16]:
# Evaluating the models scoring on eval datasets

w2v_scores = evalset1_sorted.copy()
w2v_scores['w2v'] = evalset1_sorted.apply(lambda row: 
                                          w2v.similarity(row['word1'], row['word2']), axis=1 )
display(w2v_scores.head())
w2v_spearman = scipy.stats.spearmanr(w2v_scores['SimLex999'], w2v_scores['w2v'])
print('Spearman correlation for word2vec model: {:6.4f} '.format(w2v_spearman[0]) )

Unnamed: 0,word1,word2,SimLex999,SD(SimLex),rank,w2v
0,vanish,disappear,9.8,0.46,1,0.906844
1,quick,rapid,9.7,1.14,2,0.42493
2,creator,maker,9.62,1.4,3,0.295135
3,stupid,dumb,9.58,1.48,4,0.803944
4,insane,crazy,9.57,0.92,5,0.614515


Spearman correlation for word2vec model: 0.3683 


This maybe a coincidence, but **the value above for Spearman coefficient is very close with reporting on SimLex-999 page: **  
https://fh295.github.io//simlex.html

### Next steps:

- Evaluate GloVe embedding model
- Use other eval datasets as well - MENs and WordSim-353