<a href="https://colab.research.google.com/github/graviraja/100-Days-of-NLP/blob/master/embeddings/Word2Vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word2Vec

Word2Vec is one of the most popular pretrained word embeddings developed by Google. Word2Vec is trained on the Google News dataset (about 100 billion words).

The architecture of Word2Vec is really simple. It’s a feed-forward neural network with just one hidden layer. Hence, it is sometimes referred to as a Shallow Neural Network architecture.

Depending on the way the embeddings are learned, Word2Vec is classified into two approaches:

- Continuous Bag-of-Words (CBOW)
- Skip-gram model

Continuous Bag-of-Words (CBOW) model learns the focus word given the neighboring words whereas the Skip-gram model learns the neighboring words given the focus word. 

There are a lot of online material available to explain the concept about Word Embeddings. I can't explain any better than that. So my focus here will be on, how to use pre-trained word embeddings. I will provide relevant resources to look into more details.

## Resources

- [Word Embeddings - Sebastian Ruder](https://ruder.io/word-embeddings-1/)
- [Skip Gram Model - Chris McCormick](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/)
- [Learning Word Embeddings Andrew NG](https://www.youtube.com/watch?v=xtPXjvwCt64)
- [Word2Vec Andrew NG](https://www.youtube.com/watch?v=jak0sKPoKu8)
- [Stanford NLP Lecture 1](https://www.youtube.com/watch?v=8rXD5-xhemo&list=PLoROMvodv4rOhcuXMZkNm7j3fVwBBY42z&index=1)
- [Word2Vec Paper](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)
- [Google Word2Vec](https://code.google.com/archive/p/word2vec/)


There are many libraries available which support Word2Vec based models natively. Pretrained models are also available. Covering each and everything will be overwhelming. So I will provide the usage with the prominent library:
- [Gensim](https://radimrehurek.com/gensim/models/word2vec.html)


# Gensim

Gensim is an open-source library for unsupervised topic modeling and natural language processing, using modern statistical machine learning.

Gensim includes streamed parallelized implementations of fastText,word2vec and doc2vec algorithms, as well as latent semantic analysis (LSA, LSI, SVD), non-negative matrix factorization (NMF), latent Dirichlet allocation (LDA), tf-idf and random projections. [source](https://en.wikipedia.org/wiki/Gensim)

References:

- [How to download Pretrained word embeddings in gensim](https://radimrehurek.com/gensim/auto_examples/howtos/run_downloader_api.html)

- [Using Fasttext in gensim](https://radimrehurek.com/gensim/auto_examples/tutorials/run_fasttext.html#sphx-glr-auto-examples-tutorials-run-fasttext-py)

### Initial Setup

In [0]:
import gensim

In [0]:
from gensim.models import KeyedVectors

In [31]:
!ls

sample_data  wiki-news-300d-1M.vec  wiki-news-300d-1M.vec.zip


### Loading the Pretrained Word2Vec model

I am using the model downloaded from fasttext. You can download any of the word vectors provided by fasttext [here](https://fasttext.cc/docs/en/english-vectors.html).

Fasttext provides pre-trained word vectors for 157 languages, trained on Common Crawl and Wikipedia using fastText. These models were trained using CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives. We also distribute three new word analogy datasets, for French, Hindi and Polish.

In [0]:
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip

In [0]:
!unzip wiki-news-300d-1M.vec.zip

In [0]:
!ls

In [0]:
# loading the model
model = KeyedVectors.load_word2vec_format('wiki-news-300d-1M.vec')

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [0]:
model['hello'].shape

(300,)

In [0]:
model['hello'][:50]

array([-0.192 ,  0.1544,  0.0467,  0.0592,  0.1369, -0.0772, -0.0384,
        0.0537,  0.1435, -0.1353, -0.053 , -0.0668,  0.0185,  0.0873,
        0.0903,  0.1663,  0.0035, -0.2102,  0.201 , -0.0249, -0.0279,
       -0.3241, -0.0066, -0.0264, -0.1628, -0.1094, -0.0882,  0.0097,
        0.1228,  0.0059, -0.051 ,  0.0649,  0.1577,  0.0174,  0.0991,
        0.1328, -0.0586,  0.1814, -0.0098,  0.1877,  0.0518, -0.0697,
       -0.0629, -0.1981, -0.1373, -0.0811, -0.0631, -0.0639,  0.1244,
       -0.0247], dtype=float32)

### Word Similarity

Here, we will see how similar are two words to each other 

In [0]:
print(f'Similarity between night and nights: {model.similarity("night", "nights")}')
print(f'Similarity between reb and blue: {model.similarity("red", "blue")}')
print(f'Similarity between hello and heyy: {model.similarity("hello", "heyy")}')
print(f'Similarity between king and queen: {model.similarity("king", "queen")}')
print(f'Similarity between london and moscow: {model.similarity("london", "moscow")}')
print(f'Similarity between car and bike: {model.similarity("car", "bike")}')

Similarity between night and nights: 0.7854782938957214
Similarity between reb and blue: 0.8833013772964478
Similarity between hello and heyy: 0.56700599193573
Similarity between king and queen: 0.7638539671897888
Similarity between london and moscow: 0.5399850606918335
Similarity between car and bike: 0.6203061938285828


  if np.issubdtype(vec.dtype, np.int):


### Most Similar Words

Here, we will ask our model to find the words which are most similar

In [0]:
similar = model.most_similar("january")
for i in similar:
    print(i)

('december', 0.8822824954986572)
('november', 0.8583322763442993)
('october', 0.857945442199707)
('july', 0.8488935232162476)
('february', 0.8297372460365295)
('june', 0.8287782669067383)
('september', 0.817267656326294)
('feb', 0.7887735366821289)
('april', 0.7763540744781494)
('jan', 0.7581983804702759)


  if np.issubdtype(vec.dtype, np.int):


### Odd-One-Out

Here, we ask our model to give us the word that does not belong to the list!

In [0]:
print(model.doesnt_match("breakfast cereal dinner lunch".split()))

cereal


  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)
  if np.issubdtype(vec.dtype, np.int):


### Analogy difference

Which word is to women as king is to queen?

In [0]:
model.most_similar(positive=["women", "king"], negative=["queen"])

  if np.issubdtype(vec.dtype, np.int):


[('men', 0.7986569404602051),
 ('people', 0.6009145975112915),
 ('woman', 0.5940523147583008),
 ('Men', 0.5867083668708801),
 ('minorities', 0.5796009302139282),
 ('soldiers', 0.5783532857894897),
 ('man', 0.5754462480545044),
 ('Women', 0.5716798305511475),
 ('husbands', 0.5706273913383484),
 ('persons', 0.5616658926010132)]

In [0]:
def analogy(x1, x2, y1):
    result = model.most_similar(positive=[y1, x2], negative=[x1])
    return result[0][0]

In [0]:
analogy('japan', 'japanese', 'china')

  if np.issubdtype(vec.dtype, np.int):


'chinese'

# Training Word2Vec Models

You can train your own custom Word2Vec model for your data using Gensim/Fasttext. Please refer to the following links for training Word2Vec:

- [Training using Gensim](https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#sphx-glr-auto-examples-tutorials-run-word2vec-py)

- [Training Using Fasttext](https://fasttext.cc/docs/en/unsupervised-tutorial.html)

- [Blog using Gensim for training](https://kavita-ganesan.com/gensim-word2vec-tutorial-starter-code/#.XsK3oRMzaRs)

# Visualizations (Will be added soon !!)