# Synonyms using pre-trained word2vec embeddings

In this notebook, using word2vec pretrained embeddings on google news corpus is utilized to arrive at synsets (synomyms sets) that are words with similar meanings.

**Gensim word2vec APIs**: https://radimrehurek.com/gensim/models/word2vec.html

**Pre-trained word2vec model on google news**: https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing

The above model is in the form of binary file that can be loaded into the environment.


In [20]:
import gensim                     # implements word2vec model infrastructure and provides interfacing APIs 
import warnings
warnings.filterwarnings('ignore')

In [2]:
# load pre-trained vectors
w2v = gensim.models.KeyedVectors.load_word2vec_format('./pretrained/GoogleNews-vectors-negative300.bin', binary=True)  

In [15]:
# similarity 
pair1 = ['minor','small']
pair2 = ['minor','major']
cos_dist1 = w2v.similarity(pair1[0], pair1[1])
cos_dist2 = w2v.similarity(pair2[0], pair2[1])

print('Cosine similarity of {}: {}'.format(pair1, cos_dist1) )
print('Cosine similarity of {}: {}'.format(pair2, cos_dist2) )

Cosine similarity of ['minor', 'small']: 0.3416362702846527
Cosine similarity of ['minor', 'major']: 0.47539088129997253


The problem above is that similarity doesn't always translate to synonyms - the target word 'minor' is closer to 'major' than to 'small'.

In [27]:
# vector representation of the word
vec_pair1_0 = w2v.get_vector(pair1[0])
print("Vector embedding dimension: ",vec_pair1_0.shape)
print("\nPrinting a subset of the whole vector for the word '{}':".format(pair1[0]))
print(vec_pair1_0[1:20])

Vector embedding dimension:  (300,)

Printing a subset of the whole vector for the word 'minor':
[ 0.06640625 -0.00228882  0.00402832 -0.28710938 -0.21972656  0.34765625
 -0.00494385 -0.01757812  0.12988281 -0.15917969 -0.15527344 -0.16992188
  0.06933594 -0.14257812 -0.07958984  0.16992188  0.12109375  0.125
 -0.06494141]


In [12]:
# most similar words - by word
w2v.similar_by_word('major')

[('biggest', 0.657293975353241),
 ('significant', 0.619140088558197),
 ('big', 0.6057686805725098),
 ('main', 0.5380213856697083),
 ('key', 0.5354758501052856),
 ('huge', 0.5329675674438477),
 ('signficant', 0.5157025456428528),
 ('amajor', 0.49914824962615967),
 ('largest', 0.49542921781539917),
 ('greatest', 0.49444860219955444)]

In [19]:
# most similar words - by vector
w2v.similar_by_vector(vec_pair1_0)

[('minor', 1.0),
 ('serious', 0.5410230159759521),
 ('slight', 0.530189573764801),
 ('media_minHeight_=', 0.5136477947235107),
 ('Dr._Silvia_Priori', 0.5083508491516113),
 ('Minor', 0.5080995559692383),
 ('Soaked_hillsides_gave', 0.49214568734169006),
 ('minimal', 0.4815067946910858),
 ('WBO_lightweight_belts', 0.4774216115474701),
 ('major', 0.4753909111022949)]

It appears that similarity metric *by word* may offer better results than *by vector*. The above is just based on single sample, and needs further analysis with more samples.

## Next steps:
- Try other forms of embeddings e.g. GloVe, fastText that can improve upon word2vec
- Inspect the performace across less frequent words (fastText should perform better in this scenario)

## Other resources
- http://mccormickml.com/2016/04/12/googles-pretrained-word2vec-model-in-python/
- https://www.quora.com/Where-can-I-find-some-pre-trained-word-vectors-for-natural-language-processing-understanding
- https://textminingonline.com/getting-started-with-word2vec-and-glove-in-python
