# Semantically related words using pre-trained embeddings

In this notebook, pre-trained word embeddings using word2vec on google news corpus or GloVe on Twitter data is utilized to arrive at synsets (synomyms sets) that are words with similar meanings.

**Gensim word2vec APIs**: https://radimrehurek.com/gensim/models/word2vec.html

**Pre-trained word2vec model on google news**: https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing

**Pre-trained GloVe model on Twitter 2B tweets**: https://nlp.stanford.edu/projects/glove/

The above models are in the form of binary/text files that can be loaded into the environment at runtime.

**Evaluation**

The evaluation is done using SimLex-999 dataset. This is particularly challenging due to its differentiation between semantic similarity and semantic relatedness.


In [11]:
import gensim                     # implements word2vec model infrastructure and provides interfacing APIs 
from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.test.utils import get_tmpfile

import pandas as pd
import scipy.stats
import warnings
warnings.filterwarnings('ignore')

In [2]:
models = ['word2vec']                                # list of models to eval: 'word2vec' and 'glove'

In [3]:
if 'word2vec' in models:
    # load pre-trained word2vec model
    word2vec_vectors = '../pretrained/GoogleNews-vectors-negative300.bin'
    w2v = gensim.models.KeyedVectors.load_word2vec_format(word2vec_vectors, binary=True)
    print("Loaded word2vec model.")

if 'glove' in models:
    # load pre-trained GloVe model
    glove_vectors = '../pretrained/glove.twitter.27B.200d.txt'
    tmp_file = get_tmpfile("test_word2vec.txt")
    
    glove2word2vec(glove_input_file=glove_vectors, word2vec_output_file=tmp_file)
    glove = gensim.models.KeyedVectors.load_word2vec_format(tmp_file)
    print("Loaded GloVe model.")

Loaded word2vec model.


In [4]:
# similarity 
pair1 = ['coffee','cup']
pair2 = ['coffee','tea']

if 'word2vec' in models:
    cos_dist1_w = w2v.similarity(pair1[0], pair1[1])
    cos_dist2_w = w2v.similarity(pair2[0], pair2[1])
    print('word2vec cosine similarity of {}: {}'.format(pair1, cos_dist1_w) )
    print('word2vec cosine similarity of {}: {}'.format(pair2, cos_dist2_w) )

if 'glove' in models:
    cos_dist1_g = glove.similarity(pair1[0], pair1[1])
    cos_dist2_g = glove.similarity(pair2[0], pair2[1])
    print('\nGloVe cosine similarity of {}: {}'.format(pair1, cos_dist1_g) )
    print('GloVe cosine similarity of {}: {}'.format(pair2, cos_dist2_g) )

word2vec cosine similarity of ['coffee', 'cup']: 0.3560178279876709
word2vec cosine similarity of ['coffee', 'tea']: 0.5635291934013367


The problem above is that similarity doesn't always translate to synonyms - the target word 'minor' is closer to 'major' than to 'small'.

In [5]:
# vector representation of the word

if 'word2vec' in models:
    vec_pair1_0_w = w2v.get_vector(pair1[0])
    print("word2vec Vector embedding dimension: ",vec_pair1_0_w.shape)
    print("\nPrinting a subset of the whole vector for the word '{}':".format(pair1[0]))
    print(vec_pair1_0_w[1:20])

if 'glove' in models:
    vec_pair1_0_g = glove.get_vector(pair1[0])
    print("\nGloVe vector embedding dimension: ",vec_pair1_0_g.shape)
    print("\nPrinting a subset of the whole vector for the word '{}':".format(pair1[0]))
    print(vec_pair1_0_g[1:20])

word2vec Vector embedding dimension:  (300,)

Printing a subset of the whole vector for the word 'coffee':
[-0.13671875 -0.37304688  0.6171875   0.10839844  0.02722168  0.10009766
 -0.15136719 -0.01660156  0.38085938  0.06542969 -0.13183594  0.25390625
  0.09082031  0.02868652  0.25390625 -0.20507812  0.1640625   0.22070312
 -0.17480469]


In [6]:
# most similar words - by word
n_similar = 15
thisWord = 'coffee'

if 'word2vec' in models:
    print("Most similar {} words (by word) for '{}' by word2vec model:".format(n_similar, thisWord))
    display(w2v.similar_by_word(thisWord, n_similar))

if 'glove' in models:
    print("\nMost similar {} words (by word) for '{}' by GloVe model:".format(n_similar, thisWord))
    display(glove.similar_by_word(thisWord, n_similar))

Most similar 15 words (by word) for 'coffee' by word2vec model:


[('coffees', 0.721267819404602),
 ('gourmet_coffee', 0.7057087421417236),
 ('Coffee', 0.6900455355644226),
 ('o_joe', 0.6891065835952759),
 ('Starbucks_coffee', 0.6874972581863403),
 ('coffee_beans', 0.6749703884124756),
 ('latté', 0.664122462272644),
 ('cappuccino', 0.6625496745109558),
 ('brewed_coffee', 0.6621608734130859),
 ('espresso', 0.6616826057434082),
 ('java', 0.6504806876182556),
 ('iced_coffee', 0.6272041201591492),
 ('freshly_brewed_coffee', 0.6258745193481445),
 ('coffe', 0.6254313588142395),
 ('decaf', 0.619594931602478)]

It can be seen that the list of similar words returned by the model is different between the word2vec and GloVe models.

This is expected as these two pre-trained models have different source corpus.
This variety can be utilized to capture more 'potential' candidates, but at the same time, it also burdens the next step to screen out the less relevant ones. 

Maybe we could utilize the **APSyn/APSynP** for a decisive similarity metric. Another approach would be to rule out outliers using the outlier detection techniques.

In [7]:
# most similar words - by vector

if 'word2vec' in models:
    print("Most similar {} words (by vector) for '{}' by word2vec model:".format(n_similar, thisWord))
    display(w2v.similar_by_vector(thisWord, n_similar))

if 'glove' in models:
    print("\nMost similar {} words (by vector) for '{}' by GloVe model:".format(n_similar, thisWord))
    display(glove.similar_by_vector(thisWord, n_similar))

Most similar 15 words (by vector) for 'coffee' by word2vec model:


[('coffees', 0.721267819404602),
 ('gourmet_coffee', 0.7057087421417236),
 ('Coffee', 0.6900455355644226),
 ('o_joe', 0.6891065835952759),
 ('Starbucks_coffee', 0.6874972581863403),
 ('coffee_beans', 0.6749703884124756),
 ('latté', 0.664122462272644),
 ('cappuccino', 0.6625496745109558),
 ('brewed_coffee', 0.6621608734130859),
 ('espresso', 0.6616826057434082),
 ('java', 0.6504806876182556),
 ('iced_coffee', 0.6272041201591492),
 ('freshly_brewed_coffee', 0.6258745193481445),
 ('coffe', 0.6254313588142395),
 ('decaf', 0.619594931602478)]

One analysis to be done is to evaluate similar words returned *by word* contrasted with *by vector* metric.

## Evaluation:

In [8]:
# load evaluation datsets
path_evalset1 = '..\\datasets\\SimLex-999\\SimLex-999.txt'
evalset1 = pd.read_csv(path_evalset1, sep='\t')
evalset1.head()

Unnamed: 0,word1,word2,POS,SimLex999,conc(w1),conc(w2),concQ,Assoc(USF),SimAssoc333,SD(SimLex)
0,old,new,A,1.58,2.72,2.81,2,7.25,1,0.41
1,smart,intelligent,A,9.2,1.75,2.46,1,7.11,1,0.67
2,hard,difficult,A,8.77,3.76,2.21,2,5.94,1,1.19
3,happy,cheerful,A,9.55,2.56,2.34,1,5.85,1,2.18
4,hard,easy,A,0.95,3.76,2.07,2,5.82,1,0.93


Based on the above structure for SimLex-999 as evaluation dataset1, we'll be using the word pairs (columns `word1` and `word2`) as well as the `SimLex999` column for score on scale 1-10. Additionally, the `SD(SimLex)` column for standard deviation indicative of agreement between the human annotators for the given word pair, could be used for further investigation and possible waiving, in case discrepancy of the output from the model(s) to be evaluated.

**Spearman correlation**

Taken from [scipy.stats.spearman()](https://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.stats.spearmanr.html)

The Spearman correlation is a nonparametric measure of the monotonicity of the relationship between two datasets. Unlike the Pearson correlation, the Spearman correlation does not assume that both datasets are normally distributed. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation

In [9]:
# Scaling cosine score
# This scaling might need a non linear function, but for now, a simple linear scaling is being used

def scale_cosine(score, scale=10.0):
    return scale*score

In [15]:
# Evaluating the models scoring on eval datasets

models_score = evalset1.filter(items=['word1', 'word2', 'SimLex999', 'SD(SimLex)'])
models_score['w2v'] = models_score.apply(lambda row: scale_cosine(
    w2v.similarity(row['word1'], row['word2'])), axis=1 )
display(models_score.head())
w2v_spearman = scipy.stats.spearmanr(models_score['SimLex999'], models_score['w2v'])
print('Spearman correlation for word2vec model: {:6.4f} '.format(w2v_spearman[0]) )

Unnamed: 0,word1,word2,SimLex999,SD(SimLex),w2v
0,old,new,1.58,0.41,2.227803
1,smart,intelligent,9.2,0.67,6.495278
2,hard,difficult,8.77,1.19,6.025748
3,happy,cheerful,9.55,2.18,3.837738
4,hard,easy,0.95,0.93,4.709633


Spearman correlation for word2vec model: 0.4420 


## Next steps:
- Calculate Spearman score for GloVe.
- Try other forms of embeddings that can improve upon word2vec e.g. 
    + GloVe
    - fastText 
- Inspect the performace across less frequent words (fastText should perform better in this scenario)

## Other resources
- http://mccormickml.com/2016/04/12/googles-pretrained-word2vec-model-in-python/
- https://www.quora.com/Where-can-I-find-some-pre-trained-word-vectors-for-natural-language-processing-understanding
- https://textminingonline.com/getting-started-with-word2vec-and-glove-in-python
