# Word2Vec

Word2vec is a group of models used to transform text data into word-embeddings. Its more useful as compared to using Countvectorizer, TFIDFvectorizer etc as - 

1. It captures the context of the text and encodes semantics in some form
2. It has fixed dimensional (hyperparameter) vectors which helps with computation
3. These are usually pre-trained over vast amounts of data and used directly in other models

The word vectors created by the models are positioned in the vector space such that the words that share common contexts in the corpus are located closer to each other in the space.

The 2 main model architectures in word2vec are - 

1. CBOW (Continous Bag of words) - Predicts the current word based on the surrounding context words as input
2. SG (Skip gram) - Predicts surrounding context words based on current word

#### Parameters

<u>Training algo</u> - Heirarchial softmax is used when model seeks to maximize conditional log-likelihood and uses Huffman tree. This is better for infrequent words, and lower number of epochs. Negative Sampling is useful when minimizing the log-likelihood of sampled negative instances. Its better for frequent words, lower dimensional vectors and higher number of epochs.

<u>Sub-Sampling</u> - Higher freq words have lesser information, this is a threshold to subsample them to increase training speed

<u>Dimensionality</u> - Quality of embeddings increases with dimensionality but after a point the marginal gain will diminish.

<u>Context window</u> - context window determines how many words before and after the current word will be its context words. Recommended is 10 for SG and 5 for CBOW.


<img src='https://lilianweng.github.io/lil-log/assets/images/word2vec-skip-gram.png'></img>

**SKIP GRAM ARCHITECURE**

## Training Word2Vec using Gensim

In [44]:
from gensim.models import Word2Vec
from gensim.test.utils import common_texts
import numpy as np
from collections import defaultdict

In [5]:
common_texts

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

In [31]:
model = Word2Vec(common_texts, size=100, window=5, min_count=1, workers=4)
model

<gensim.models.word2vec.Word2Vec at 0x1a1e9fdb38>

In [32]:
#Online training the model with more sentences
model.build_vocab([["hello", "world"],['hey','world']], update=True)
model.train([["hello", "world"],['hey','world']], total_examples=1, epochs=1)

(0, 4)

In [33]:
model.wv['world']

array([-1.9072831e-03, -5.0078036e-04, -1.2205215e-03,  1.7031934e-03,
       -8.7878905e-04,  3.2642719e-03,  1.6955030e-03,  1.8635712e-03,
        6.0884620e-04,  3.6294644e-03, -5.1687384e-05, -6.7350105e-04,
       -1.3855461e-03,  3.8039340e-03, -2.2583394e-03,  2.9560386e-03,
       -3.5702975e-03,  2.1870460e-03,  2.5708969e-03,  1.3285197e-03,
        1.8306220e-03,  2.4980409e-03, -2.2381789e-03,  4.3474101e-03,
       -2.2699737e-03,  2.1978179e-03, -3.5987915e-03, -1.4745519e-03,
        8.9427346e-04,  2.5238441e-03, -3.7145237e-03,  2.1708685e-03,
       -3.0585675e-04, -1.9012406e-03,  1.7093649e-03,  3.2643725e-03,
       -2.7753536e-03,  2.0877942e-03,  8.5415441e-04, -2.3626110e-03,
       -1.7397344e-03, -2.4701545e-03, -4.7957557e-03, -2.7853139e-03,
        1.9269293e-03,  5.1363495e-05,  2.0468340e-04, -1.8245459e-03,
        5.4087868e-04, -3.0424343e-03,  2.0268066e-03,  4.8809272e-04,
       -1.6237929e-03,  3.8381990e-03, -4.0810690e-03,  1.5460260e-03,
      

## Using pre-trained Word Embeddings

In [None]:
#First download the pre-trained word embeddings
#### link - https://github.com/RaRe-Technologies/gensim-data ####

#Load pre-trained model
model = gensim.models.Word2Vec.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True)
model.wv['computer']

These can now be used for creating sentence vectors by taking average (or weighted average with tfidf), or fed into a sequential model to create sequence representations with context.

## Word2Vec from Scratch

In [37]:
settings = {}
settings['n'] = 5                     #Dimensionality of word vectors
settings['window_size'] = 2           #Context window
settings['min_count'] = 0             #min word count
settings['epochs'] = 5000             #training epochs
settings['neg_samp'] = 10             #number of negative words to use during training
settings['learning_rate'] = 0.01      #learning rate
np.random.seed(0)

In [207]:
corpus = ['have you heard the word the word about the bird', 'hello world word']
corpus

['have you heard the word the word about the bird', 'hello world word']

In [208]:
from sklearn.feature_extraction.text import CountVectorizer
cnt = CountVectorizer(stop_words=None)
cnt.fit(corpus)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [213]:
def get_training_data(corpus):
    X, y = [],[]
    for sentence in corpus:
        sentence = sentence.split()
        len_sent = len(sentence)
        
        for i in range(len_sent):
            
            #Get current word
            current_word = get_onehot(sentence[i])
            
            #get context words
            context_words = []
            for j in range(i-settings['window_size'], i+settings['window_size']+1):
                if j!=i and j<=len_sent-1 and j>=0:
                    context_words.append(get_onehot(sentence[j]))
            X.append(current_word)
            y.append(context_words)
            
    return X, y

In [214]:
X, y = get_training_data(corpus)

In [215]:
[cnt.inverse_transform(i) for i in X]

[[array(['have'], dtype='<U5')],
 [array(['you'], dtype='<U5')],
 [array(['heard'], dtype='<U5')],
 [array(['the'], dtype='<U5')],
 [array(['word'], dtype='<U5')],
 [array(['the'], dtype='<U5')],
 [array(['word'], dtype='<U5')],
 [array(['about'], dtype='<U5')],
 [array(['the'], dtype='<U5')],
 [array(['bird'], dtype='<U5')],
 [array(['hello'], dtype='<U5')],
 [array(['world'], dtype='<U5')],
 [array(['word'], dtype='<U5')]]

In [216]:
y

[[array([0, 0, 0, 0, 0, 0, 0, 0, 1]), array([0, 0, 0, 1, 0, 0, 0, 0, 0])],
 [array([0, 0, 1, 0, 0, 0, 0, 0, 0]),
  array([0, 0, 0, 1, 0, 0, 0, 0, 0]),
  array([0, 0, 0, 0, 0, 1, 0, 0, 0])],
 [array([0, 0, 1, 0, 0, 0, 0, 0, 0]),
  array([0, 0, 0, 0, 0, 0, 0, 0, 1]),
  array([0, 0, 0, 0, 0, 1, 0, 0, 0]),
  array([0, 0, 0, 0, 0, 0, 1, 0, 0])],
 [array([0, 0, 0, 0, 0, 0, 0, 0, 1]),
  array([0, 0, 0, 1, 0, 0, 0, 0, 0]),
  array([0, 0, 0, 0, 0, 0, 1, 0, 0]),
  array([0, 0, 0, 0, 0, 1, 0, 0, 0])],
 [array([0, 0, 0, 1, 0, 0, 0, 0, 0]),
  array([0, 0, 0, 0, 0, 1, 0, 0, 0]),
  array([0, 0, 0, 0, 0, 1, 0, 0, 0]),
  array([0, 0, 0, 0, 0, 0, 1, 0, 0])],
 [array([0, 0, 0, 0, 0, 1, 0, 0, 0]),
  array([0, 0, 0, 0, 0, 0, 1, 0, 0]),
  array([0, 0, 0, 0, 0, 0, 1, 0, 0]),
  array([1, 0, 0, 0, 0, 0, 0, 0, 0])],
 [array([0, 0, 0, 0, 0, 0, 1, 0, 0]),
  array([0, 0, 0, 0, 0, 1, 0, 0, 0]),
  array([1, 0, 0, 0, 0, 0, 0, 0, 0]),
  array([0, 0, 0, 0, 0, 1, 0, 0, 0])],
 [array([0, 0, 0, 0, 0, 1, 0, 0, 0]),
  array

In [179]:
cnt.inverse_transform(training_data[2][0])

[array(['heard'], dtype='<U5')]

In [180]:
[cnt.inverse_transform(i) for i in training_data[2][1]]

[[array(['have'], dtype='<U5')],
 [array(['you'], dtype='<U5')],
 [array(['the'], dtype='<U5')],
 [array(['word'], dtype='<U5')]]

In [212]:
cnt.vocabulary_

{'have': 2,
 'you': 8,
 'heard': 3,
 'the': 5,
 'word': 6,
 'about': 0,
 'bird': 1,
 'hello': 4,
 'world': 7}

In [218]:
yy = training_data[2][1]

In [217]:
import keras.backend as K

Using TensorFlow backend.


In [220]:
[i for i in yy]

[array([0, 0, 1, 0, 0, 0, 0]),
 array([0, 0, 0, 0, 0, 0, 1]),
 array([0, 0, 0, 0, 1, 0, 0]),
 array([0, 0, 0, 0, 0, 1, 0])]