# Semantically related words using pre-trained embeddings

In this notebook, pre-trained word embeddings using word2vec on google news corpus or GloVe on Twitter data is utilized to arrive at synsets (synomyms sets) that are words with similar meanings.

**Gensim word2vec APIs**: https://radimrehurek.com/gensim/models/word2vec.html

**Pre-trained word2vec model on google news**: https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing

**Pre-trained GloVe model on Twitter 2B tweets**: https://nlp.stanford.edu/projects/glove/

The above models are in the form of binary/text files that can be loaded into the environment at runtime.

*Notebook runtime: ~ 4 mins*

In [1]:
import gensim                     # implements word2vec model infrastructure and provides interfacing APIs 
from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.test.utils import get_tmpfile
import warnings
warnings.filterwarnings('ignore')



In [2]:
# load pre-trained word2vec model
word2vec_vectors = '../pretrained/GoogleNews-vectors-negative300.bin'

w2v = gensim.models.KeyedVectors.load_word2vec_format(word2vec_vectors, binary=True)

In [3]:
# load pre-trained GloVe model
glove_vectors = '../pretrained/glove.twitter.27B.200d.txt'
tmp_file = get_tmpfile("test_word2vec.txt")

glove2word2vec(glove_input_file=glove_vectors, word2vec_output_file=tmp_file)
glove = gensim.models.KeyedVectors.load_word2vec_format(tmp_file)

In [4]:
# similarity 
pair1 = ['minor','small']
pair2 = ['minor','major']
cos_dist1_w = w2v.similarity(pair1[0], pair1[1])
cos_dist2_w = w2v.similarity(pair2[0], pair2[1])

print('word2vec cosine similarity of {}: {}'.format(pair1, cos_dist1_w) )
print('word2vec cosine similarity of {}: {}'.format(pair2, cos_dist2_w) )

cos_dist1_g = glove.similarity(pair1[0], pair1[1])
cos_dist2_g = glove.similarity(pair2[0], pair2[1])

print('\nGloVe cosine similarity of {}: {}'.format(pair1, cos_dist1_g) )
print('GloVe cosine similarity of {}: {}'.format(pair2, cos_dist2_g) )

word2vec cosine similarity of ['minor', 'small']: 0.3416362702846527
word2vec cosine similarity of ['minor', 'major']: 0.47539088129997253

GloVe cosine similarity of ['minor', 'small']: 0.42706066370010376
GloVe cosine similarity of ['minor', 'major']: 0.703789472579956


The problem above is that similarity doesn't always translate to synonyms - the target word 'minor' is closer to 'major' than to 'small'.

In [5]:
# vector representation of the word
vec_pair1_0_w = w2v.get_vector(pair1[0])
print("word2vec Vector embedding dimension: ",vec_pair1_0_w.shape)
print("\nPrinting a subset of the whole vector for the word '{}':".format(pair1[0]))
print(vec_pair1_0_w[1:20])

vec_pair1_0_g = glove.get_vector(pair1[0])
print("\nGloVe vector embedding dimension: ",vec_pair1_0_g.shape)
print("\nPrinting a subset of the whole vector for the word '{}':".format(pair1[0]))
print(vec_pair1_0_g[1:20])

word2vec Vector embedding dimension:  (300,)

Printing a subset of the whole vector for the word 'minor':
[ 0.06640625 -0.00228882  0.00402832 -0.28710938 -0.21972656  0.34765625
 -0.00494385 -0.01757812  0.12988281 -0.15917969 -0.15527344 -0.16992188
  0.06933594 -0.14257812 -0.07958984  0.16992188  0.12109375  0.125
 -0.06494141]

GloVe vector embedding dimension:  (200,)

Printing a subset of the whole vector for the word 'minor':
[-0.81609   -0.10689   -0.53273   -0.20412   -0.37599    0.12386
 -0.12322   -0.80024   -0.017576   0.30317   -0.068888  -1.0975
 -0.56645    0.37651   -0.46615   -0.42359   -0.076921  -0.012701
 -0.0067806]


In [6]:
# most similar words - by word
n_similar = 15
thisWord = 'major'

print("Most similar {} words (by word) for '{}' by word2vec model:".format(n_similar, thisWord))
display(w2v.similar_by_word(thisWord, n_similar))
print("\nMost similar {} words (by word) for '{}' by GloVe model:".format(n_similar, thisWord))
display(glove.similar_by_word(thisWord, n_similar))

Most similar 15 words (by word) for 'major' by word2vec model:


[('biggest', 0.657293975353241),
 ('significant', 0.619140088558197),
 ('big', 0.6057686805725098),
 ('main', 0.5380213856697083),
 ('key', 0.5354758501052856),
 ('huge', 0.5329675674438477),
 ('signficant', 0.5157025456428528),
 ('amajor', 0.49914824962615967),
 ('largest', 0.49542921781539917),
 ('greatest', 0.49444860219955444),
 ('Major', 0.4887048006057739),
 ('massive', 0.4786103069782257),
 ('minor', 0.47539088129997253),
 ('substantial', 0.46729937195777893),
 ('monumental', 0.46554115414619446)]


Most similar 15 words (by word) for 'major' by GloVe model:


[('minor', 0.7037895321846008),
 ('huge', 0.6762630939483643),
 ('massive', 0.655586838722229),
 ('big', 0.6330057382583618),
 ('biggest', 0.6215412020683289),
 ('another', 0.6144845485687256),
 ('third', 0.6137520670890808),
 ('any', 0.6084322333335876),
 ('serious', 0.6081491112709045),
 ('issues', 0.6023706197738647),
 ('first', 0.5963584780693054),
 ('having', 0.5878738164901733),
 ('two', 0.5866069197654724),
 ('other', 0.5805503129959106),
 ('many', 0.5805253982543945)]

It can be seen that the list of similar words returned by the model is different between the word2vec and GloVe models.

This is expected as these two pre-trained models have different source corpus.
This variety can be utilized to capture more 'potential' candidates, but at the same time, it also burdens the next step to screen out the less relevant ones. 

Maybe we could utilize the **APSyn/APSynP** for a decisive similarity metric. Another approach would be to rule out outliers using the outlier detection techniques.

In [7]:
# most similar words - by vector
print("Most similar {} words (by vector) for '{}' by word2vec model:".format(n_similar, thisWord))
display(w2v.similar_by_vector(thisWord, n_similar))
print("\nMost similar {} words (by vector) for '{}' by GloVe model:".format(n_similar, thisWord))
display(glove.similar_by_vector(thisWord, n_similar))

Most similar 15 words (by vector) for 'major' by word2vec model:


[('biggest', 0.657293975353241),
 ('significant', 0.619140088558197),
 ('big', 0.6057686805725098),
 ('main', 0.5380213856697083),
 ('key', 0.5354758501052856),
 ('huge', 0.5329675674438477),
 ('signficant', 0.5157025456428528),
 ('amajor', 0.49914824962615967),
 ('largest', 0.49542921781539917),
 ('greatest', 0.49444860219955444),
 ('Major', 0.4887048006057739),
 ('massive', 0.4786103069782257),
 ('minor', 0.47539088129997253),
 ('substantial', 0.46729937195777893),
 ('monumental', 0.46554115414619446)]


Most similar 15 words (by vector) for 'major' by GloVe model:


[('minor', 0.7037895321846008),
 ('huge', 0.6762630939483643),
 ('massive', 0.655586838722229),
 ('big', 0.6330057382583618),
 ('biggest', 0.6215412020683289),
 ('another', 0.6144845485687256),
 ('third', 0.6137520670890808),
 ('any', 0.6084322333335876),
 ('serious', 0.6081491112709045),
 ('issues', 0.6023706197738647),
 ('first', 0.5963584780693054),
 ('having', 0.5878738164901733),
 ('two', 0.5866069197654724),
 ('other', 0.5805503129959106),
 ('many', 0.5805253982543945)]

One analysis to be done is to evaluate similar words returned *by word* contrasted with *by vector* metric.

## Next steps:
- Try other forms of embeddings that can improve upon word2vec e.g. 
    + GloVe
    - fastText 
- Inspect the performace across less frequent words (fastText should perform better in this scenario)

## Other resources
- http://mccormickml.com/2016/04/12/googles-pretrained-word2vec-model-in-python/
- https://www.quora.com/Where-can-I-find-some-pre-trained-word-vectors-for-natural-language-processing-understanding
- https://textminingonline.com/getting-started-with-word2vec-and-glove-in-python
