# Word Embeddings using Word2Vec



In [1]:
import gensim.downloader as gen_api

In [27]:
from sklearn.metrics.pairwise import cosine_similarity

import numpy as np


In [2]:
gen_info = gen_api.info()

In [3]:
for model_name, model_data in gen_info["models"].items():
  print(model_name)
  print(model_data["description"])
  print()

fasttext-wiki-news-subwords-300
1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens).

conceptnet-numberbatch-17-06-300
ConceptNet Numberbatch consists of state-of-the-art semantic vectors (also known as word embeddings) that can be used directly as a representation of word meanings or as a starting point for further machine learning. ConceptNet Numberbatch is part of the ConceptNet open data project. ConceptNet provides lots of ways to compute with word meanings, one of which is word embeddings. ConceptNet Numberbatch is a snapshot of just the word embeddings. It is built using an ensemble that combines data from ConceptNet, word2vec, GloVe, and OpenSubtitles 2016, using a variation on retrofitting.

word2vec-ruscorpora-300
Word2vec Continuous Skipgram vectors trained on full Russian National Corpus (about 250M words). The model contains 185K words.

word2vec-google-news-300
Pre-trained vectors trained on a part of the Google N

In Gensim, there are 4 big type: Fasttext, ConceptNet, Word2Vec, and GloVe. You can see that the algorithm is one thing, but the corpus training is another. Each corpus training will give you a slightly different bend, depending on the context.

FYI uncased means everything has been lower case. (uncased == no case)

First we need to load the algorithm for the word embedding. This is also called pre-trained model.

You can download the pre-trained Word2Vec model from Google here https://code.google.com/p/word2vec/

In [4]:
#let's load word2vec Google News 300. Takes a while. 1.6GB
w2v = gen_api.load ('word2vec-google-news-300')

After loading, let's see how they represent texts

In [5]:
w2v.most_similar("coffee")

[('coffees', 0.721267819404602),
 ('gourmet_coffee', 0.7057086825370789),
 ('Coffee', 0.6900454759597778),
 ('o_joe', 0.6891065835952759),
 ('Starbucks_coffee', 0.6874972581863403),
 ('coffee_beans', 0.6749704480171204),
 ('lattÃ©', 0.664122462272644),
 ('cappuccino', 0.662549614906311),
 ('brewed_coffee', 0.6621608138084412),
 ('espresso', 0.6616826057434082)]

In [6]:
w2v['coffee'] #What's the vectorization of the word?

array([-1.61132812e-01, -1.36718750e-01, -3.73046875e-01,  6.17187500e-01,
        1.08398438e-01,  2.72216797e-02,  1.00097656e-01, -1.51367188e-01,
       -1.66015625e-02,  3.80859375e-01,  6.54296875e-02, -1.31835938e-01,
        2.53906250e-01,  9.08203125e-02,  2.86865234e-02,  2.53906250e-01,
       -2.05078125e-01,  1.64062500e-01,  2.20703125e-01, -1.74804688e-01,
       -2.01171875e-01,  1.30859375e-01, -3.22265625e-02, -2.41210938e-01,
       -3.19824219e-02,  2.48046875e-01, -2.37304688e-01,  2.89062500e-01,
        1.64794922e-02,  1.29394531e-02,  1.72119141e-02, -3.53515625e-01,
       -1.66992188e-01, -5.90820312e-02, -2.81250000e-01,  9.94873047e-03,
       -1.94091797e-02, -3.22265625e-01,  1.73339844e-02, -5.83496094e-02,
       -2.59765625e-01,  1.42669678e-03,  5.81054688e-02,  1.13769531e-01,
       -8.64257812e-02,  3.54003906e-02, -4.29687500e-01,  2.86865234e-03,
        6.98852539e-03,  1.80664062e-01, -1.79687500e-01,  2.95410156e-02,
       -1.56250000e-01, -

In [7]:
w2v['brexit'] #What's the vectorization of the word that outside of the library?

KeyError: "Key 'brexit' not present"

Using Cosine similarity to see how similar

Higher = more similar with cosine_similarity.

1.0 = identical
0.0 = completely unrelated
-1.0 = opposite

In [8]:
w2v.distance("coffee","cream") #similar things should be closer

0.7535746246576309

In [31]:
text1 = w2v["coffee"]
text2 = w2v["cream"]
text3 = w2v["basketball"]
cosine_similarity([text1], [text2])

array([[0.24642539]], dtype=float32)

In [25]:
w2v.distance("coffee","basketball") #diferrent things should be different

0.9099433422088623

In [33]:
cosine_similarity([text1], [text3])

array([[0.09005667]], dtype=float32)

In [34]:
#wanna try king - man + woman?
w2v.most_similar_cosmul(positive=['king','woman'], negative=['men']) #function on the most word that has similar cosine similarity, a similarity metric

[('queen', 0.9834331274032593),
 ('monarch', 0.954605221748352),
 ('princess', 0.9464353322982788),
 ('ruler', 0.9094336628913879),
 ('prince', 0.8853203654289246),
 ('crown_prince', 0.8740110993385315),
 ('maharaja', 0.8640182614326477),
 ('sultan', 0.8614116311073303),
 ('King_Ahasuerus', 0.8606166839599609),
 ('Queen_Consort', 0.845000147819519)]

In [35]:
#guess this one first before you run
w2v.most_similar_cosmul(positive=['restaurant','cocktail'], negative=['dinner']) #function on the most word that has similar cosine similarity, a similarity metric

[('eatery', 0.8693193793296814),
 ('bartender', 0.8536876440048218),
 ('bartenders', 0.8526809811592102),
 ('nightspot', 0.8493297100067139),
 ('Buddha_Bar', 0.8486438393592834),
 ('Pegu_Club', 0.8456864953041077),
 ('brewpub', 0.8379691243171692),
 ('La_Floridita', 0.836773157119751),
 ('cafe', 0.8341168165206909),
 ('Tres_Agaves', 0.830361545085907)]