# Word2Vec practice(Pytorch)

If you don't have data for word2vec, you can download the dataset
from https://wit3.fbk.eu/get.php?path=XML_releases/xml/ted_en-20160408.zip&filename=ted_en-20160408.zip,  
or you can download the dataset using urlib.request like following.

### import urlib.request  
urllib.request.urlretrieve("https://raw.githubusercontent.com/GaoleMeng/RNN-and-FFNN-textClassification/master/ted_en-20160408.xml", filename="ted_en-20160408.xml")

In [1]:
# Packages for preprocessing
import re
import math
import json
import random
import itertools
import numpy as np
from lxml import etree
from collections import Counter
from numpy.random import multinomial

# Pakages for training
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torch.autograd as autograd

## Preprocess dataset  
I follow preprocessing .xml file in the following site.  
https://wikidocs.net/60855  
  
1. Load the dataset: open()  
2. Extract the contents between CONTENTS and /CONTENTS
3. Substitute not text element to ' '
4. Split the text into sentences.
5. Eiminate the punctuation marks and substitute it to blank
   & Change the capital letter to a small letter
6. Tokenize the preprocessed sentences

In [2]:
dataset = open('dataset/ted_en-20160408.xml', 'r', encoding='UTF8')

text = '\n'.join(etree.parse(dataset).xpath('//content/text()'))
text = re.sub(r'\([^)]*\)', ' ', text)
print("*Print one sentence in text:\n\n{}".format(text[:95]))

sentences = text.split('.')
print("\n*Print one sentence in sentences:\n\n{}".format(sentences[0]))

pre_sentences = []
for sentence in sentences:
    pre_sentences.append(re.sub(r"[^a-z0-9]+", " ", sentence.lower()))

print("\n*Print one sentence in pre_sentences:\n\n{}".format(pre_sentences[0]))

Tokenized_sentence = [sentence.split(" ") for sentence in pre_sentences]
tokenized_sentence = []
for sentence in Tokenized_sentence:
    if len(sentence) < 5: continue
    tokenized_sentence.append([w for w in sentence if w != ''])
print("\n*Print one sentence in tokenized_sentence:\n\n{}".format(tokenized_sentence[0]))

print("\nNumber of tokenized sentences: {}".format(len(tokenized_sentence)))

*Print one sentence in text:

Here are two reasons companies fail: they only do more of the same, or they only do what's new.

*Print one sentence in sentences:

Here are two reasons companies fail: they only do more of the same, or they only do what's new

*Print one sentence in pre_sentences:

here are two reasons companies fail they only do more of the same or they only do what s new

*Print one sentence in tokenized_sentence:

['here', 'are', 'two', 'reasons', 'companies', 'fail', 'they', 'only', 'do', 'more', 'of', 'the', 'same', 'or', 'they', 'only', 'do', 'what', 's', 'new']

Number of tokenized sentences: 242019


### Sample sentences
  
Since this dataset is very large, we shrink the size of the corpus.

In [3]:
nSample = 10000
tokenized_sentence = random.sample(tokenized_sentence, nSample)
print("Number of tokenized sentences: {}".format(len(tokenized_sentence)))

Number of tokenized sentences: 10000


## Word to index & Index to Word
I follow the instruction from https://towardsdatascience.com/implementing-word2vec-in-pytorch-skip-gram-model-e6bae040d2fb    

1. Make vocabulary from tokenized_sentence
2. Count the word frequency and cut words that are appear less than min_freq
3. Subsampling frequent words
4. Create dictionaries for mapping between word and index 

### Min frequency
  
Words below the minimun frequency are dropped before training occurs.
So, before starting the training, I cut the words that appears less than 'min_freq'

### Sub sampling

Word2Vec researchers have decided to reduce the amount of learning in a probabilistic way for words that appear frequently in the corpus. This is because there are many opportunities to be updated as much as the frequency of appearance.  
Word2Vec researchers say the i-th word (wi)
The probability of excluding from learning is defined below.  
  
$$ P(w_i) = 1 - \sqrt{\frac{t}{f(w_i)}}$$  
  
They recommend the value of t as 0.00001

In [4]:
min_freq = 5

vocabulary = {}

for sentence in tokenized_sentence:
    for token in sentence:
        if token not in vocabulary:
            vocabulary[token] = 1
        else:
            vocabulary[token] += 1

In [5]:
# CUT
VOCAB = {word:cnt for (word,cnt) in vocabulary.items() if cnt >= min_freq}

In [6]:
# Sub Sampling
sum_word_counts = sum(list(VOCAB.values()))
words_prob = {word: cnt/float(sum_word_counts) for word, cnt in VOCAB.items()}

filtered = []
for sentence in tokenized_sentence:
    filtered.append([])
    for token in sentence:
        if token not in VOCAB: continue
        prob = 1 - math.sqrt(0.00001/words_prob[token])
        if random.random() >= prob:
            filtered[-1].append(token)

In [7]:
SENTENCE = []

for sentence in filtered:
    if len(sentence) < 5: continue
    SENTENCE.append(sentence)

word2index = {word: idx for idx, (word, cnt) in enumerate(VOCAB.items())}
index2word = {idx: word for idx, (word, cnt) in enumerate(VOCAB.items())}

vocab_size = len(VOCAB)

print("Total vocabulary size: {}".format(vocab_size))
print("Total number of sentences: {}".format(len(SENTENCE)))

Total vocabulary size: 3134
Total number of sentences: 914


## Skip gram
  
We will use Skip gram, not CBOW.  
This is the probability distribution for single pair. 
  
$$ P(context|center;\theta) $$  
  
Then, maximize this distribution through all word/context pairs.  
  
$$ max \prod_{context} \prod_{center} P(context|center;\theta) $$  
  
After then, make this prob. distribution as negative log likelihood  
  
$$ min_\theta -\frac{1}{T} \Sigma_{center} \Sigma_{context} log P(context|center;\theta) $$  
  
### Define P
  
We have to define the probability distribution. Assume there are vectors that represent the word in two ways.  
1. v : if a word is the center word
2. u : if a word is the context word
  
Then, we can write P as follows:  
  
$$ P(context|center;\theta) = \frac{exp(u^T_{context} v_{center})}{\Sigma_{w \in vocab} exp(u^T_{w} v_{center})}$$

## Get pairs of words that exists within the window size

1. Negative Sampling
2. Get pairs of words that exists within the window size.  
  
We will use them to train the word2vec embedding model.  
Ref.: https://rguigoures.github.io/word2vec_pytorch/
  
### Negative Sampling
  
Since, softmax algorithm takes long time cause of large vocabulary, word2vec researcher suggested to use Negative Sampling algorithm.  
This algorithm select the words that are not in context words, and use it to calculate simple softmax value.  
You can find the paper here: https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf
  
  
Select the words that are not in contexts of each word with prob.  
$$ P(w_i) = \frac{f(w_i)^{\frac{3}{4}}}{\Sigma_{j=0}^{n}f(w_j)^{\frac{3}{4}}}$$
  
Since, we applied negative sampling method, the objective function of unsupervised word2vect model changes as follows:  
  
$$ J_t(\theta) = log \sigma (u_o^Tv_c) + \Sigma_{j ~ P(w)}[log\sigma(-u_j^Tv_c)]$$  


You can change the window size. But we select the value of window size as 5

In [8]:
# Negatvie Sampling
sample_prob = {}
word_counts = dict(Counter(list(itertools.chain.from_iterable(SENTENCE))))
norm_factor = sum([v**0.75 for v in word_counts.values()])
for word in word_counts:
    sample_prob[word] = word_counts[word]**0.75/norm_factor
words = np.array(list(word_counts.keys()))
probs = list(sample_prob.values())

In [9]:
def negative_sampling(ng_size, word_pair):
    select = []
    sample = []
    while True:
        select = list(multinomial(ng_size, probs))
        selected_words = [words[idx] for idx, cnt in enumerate(select) if cnt > 0]
        if not(set(selected_words) & set(word_pair)):
            break
    for idx, cnt in enumerate(select):
        for _ in range(cnt):
            sample.append(word2index[words[idx]])
    return sample

In [10]:
window_size = 5
negative_size = 5
word_pairs = []

# Save word_pairs with negative samples
try:
    with open('word_pairs.json','r') as f:
        word_pairs = json.load(f)
except:
    for sentence in SENTENCE:
        indices = [word2index[word] for word in sentence]

        for center_idx in range(len(indices)):
            # save window
            for context_idx in range(center_idx - window_size, center_idx + window_size + 1):
                if context_idx < 0 or context_idx >= len(indices) or context_idx == center_idx: continue
                ng_sample = negative_sampling(negative_size, [index2word[indices[center_idx]], index2word[indices[context_idx]]])
                word_pairs.append((indices[center_idx], indices[context_idx], ng_sample))
    with open('word_pairs.json','w') as f:
        json.dump(word_pairs,f)

## Train Word2Vec model

Now, we are ready to train the word2vec model.  

1. Get Batch  
To speed up word2vec model learning, we use batch learning.  
This way makes the training faster and also regularizes the parameters of the model
2. Define Model Word2Vec  
Initialize the weight of center & context embeddings.  
3. Criterion, Optimizer  
We use criterion as Cross Entropy Loss.  
And use optimizer as Adam.

In [11]:
batch_size = 100
emb_dim = 100
n_epochs = 100

In [12]:
def get_batches(word_pairs, batch_size = batch_size):
    random.shuffle(word_pairs)
    batches = []
    batch_target, batch_context, batch_negative = [], [], []
    for idx, (target, context, negative) in enumerate(word_pairs):
        batch_target.append(target)
        batch_context.append(context)
        batch_negative.append([idx for idx in negative])
        if (idx + 1) % batch_size == 0 or idx == len(word_pairs)-1:
            tensor_target = autograd.Variable(torch.from_numpy(np.array(batch_target)).long())
            tensor_context = autograd.Variable(torch.from_numpy(np.array(batch_context)).long())
            tensor_negative = autograd.Variable(torch.from_numpy(np.array(batch_negative)).long())
            batches.append((tensor_target, tensor_context, tensor_negative))
            batch_target, batch_context, batch_negative = [], [], []
    return batches

In [25]:
class Word2Vec(nn.Module):
    
    def __init__(self, emb_dim, vocab_size):
        super(Word2Vec, self).__init__()
        self.emb_target = nn.Embedding(vocab_size, emb_dim)
        self.emb_context = nn.Embedding(vocab_size, emb_dim)
        self.emb_target.weight = nn.Parameter(torch.cat([torch.zeros(1, emb_dim), torch.FloatTensor(vocab_size - 1, emb_dim).uniform_(-0.5 / emb_dim, 0.5 / emb_dim)]))
        self.emb_context.weight = nn.Parameter(torch.cat([torch.zeros(1, emb_dim), torch.FloatTensor(vocab_size - 1, emb_dim).uniform_(-0.5 / emb_dim, 0.5 / emb_dim)]))
        self.emb_target.weight.requires_grad = True
        self.emb_context.weight.requires_grad = True
        
    def forward(self, target, context, negative):
        emb_target = self.emb_target(target)
        emb_context = self.emb_context(context)
        positive = torch.mul(emb_target, emb_context)
        positive = torch.sum(positive, dim=1)
        out = torch.sum(F.logsigmoid(positive))
        
        emb_negative = self.emb_context(negative)
        _negative = torch.bmm(emb_negative, emb_target.unsqueeze(2))
        _negative = torch.sum(_negative, dim = 1)
        out += torch.sum(F.logsigmoid(-_negative))
        return -out

In [26]:
criterion = nn.CrossEntropyLoss()
model = Word2Vec(emb_dim = emb_dim, vocab_size = vocab_size)
optimizer = optim.Adam(model.parameters())
for epoch in range(1, n_epochs+1):
    batch_word_pairs = get_batches(word_pairs, batch_size = batch_size)
    losses = []
    for i in range(len(batch_word_pairs)):
        model.train()
        optimizer.zero_grad()
        tt, ct, nt = batch_word_pairs[i]
        loss = model(tt, ct, nt)
        loss.backward()
        optimizer.step()
        losses.append(loss.item())
    print("Epoch: {}, Loss: {}".format(epoch,np.mean(losses)))

Epoch: 1, Loss: 126.79674339954416
Epoch: 2, Loss: 103.63983894796932
Epoch: 3, Loss: 98.82628349119405
Epoch: 4, Loss: 89.61861005868879
Epoch: 5, Loss: 74.51335606030527
Epoch: 6, Loss: 57.016483164988585
Epoch: 7, Loss: 41.264257170337295
Epoch: 8, Loss: 29.11978720628679
Epoch: 9, Loss: 20.49834604329304
Epoch: 10, Loss: 14.584076025081753
Epoch: 11, Loss: 10.553599129086134
Epoch: 12, Loss: 7.783419966285204
Epoch: 13, Loss: 5.8418169009231775
Epoch: 14, Loss: 4.458371350921974
Epoch: 15, Loss: 3.450731998908891
Epoch: 16, Loss: 2.7037047952104194
Epoch: 17, Loss: 2.1405258636573605
Epoch: 18, Loss: 1.7112004232035376
Epoch: 19, Loss: 1.3785165563792918
Epoch: 20, Loss: 1.1175237812591672
Epoch: 21, Loss: 0.9117798532994148
Epoch: 22, Loss: 0.7475192889210263
Epoch: 23, Loss: 0.615669179591753
Epoch: 24, Loss: 0.5089675539711355
Epoch: 25, Loss: 0.4222492705837253
Epoch: 26, Loss: 0.3513898245964496
Epoch: 27, Loss: 0.29313624789454945
Epoch: 28, Loss: 0.24506533078875095
Epoch: 2

In [15]:
def closest_word(word, topn = 5):
    word_distance = []
    emb = model.emb_target
    dist = nn.PairwiseDistance()
    idx = word2index[word]
    lookup_i = torch.tensor([idx], dtype=torch.long)
    v_i = emb(lookup_i)
    for j in range(len(VOCAB)):
        if j != idx:
            lookup_j = torch.tensor([j], dtype=torch.long)
            v_j = emb(lookup_j)
            word_distance.append((index2word[j], float(dist(v_i, v_j))))
    word_distance.sort(key=lambda x: x[1])
    return word_distance[:topn]

In [36]:
closest_word("guy")

[('properties', 3.9740302562713623),
 ('different', 5.2601752281188965),
 ('older', 5.26859712600708),
 ('career', 5.358475208282471),
 ('economic', 5.413461208343506)]

Since there are few sentences to train the model, it isn't able to learn the word vector properly...  
  
We remove many sentences because it takes too long time to get negative sample in the corpus but not in context of the center words.  
  
If we make improvements to speed up to get negative sample, we can get more plausible word vectors.