# Word2Vec practice(Pytorch)

If you don't have data for word2vec, you can download the dataset
from https://wit3.fbk.eu/get.php?path=XML_releases/xml/ted_en-20160408.zip&filename=ted_en-20160408.zip,  
or you can download the dataset using urlib.request like following.

### import urlib.request  
urllib.request.urlretrieve("https://raw.githubusercontent.com/GaoleMeng/RNN-and-FFNN-textClassification/master/ted_en-20160408.xml", filename="ted_en-20160408.xml")

In [1]:
# Packages for preprocessing
import re
import math
import json
import random
import pickle
import itertools
import numpy as np
from lxml import etree
from collections import Counter
from numpy.random import multinomial
from nltk.tokenize import word_tokenize, sent_tokenize

# Pakages for training
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torch.autograd as autograd
from tqdm import tqdm
from torch.utils.data import Dataset, DataLoader

## Preprocess dataset  
I follow the steps to preprocess .xml file in the following site.  
https://wikidocs.net/60855  
  
1. Load the dataset: open()  
2. Extract the contents between CONTENTS and /CONTENTS
3. Using tokenizer(nltk.sent_tokenize), divide the corpus into sentences.
4. Eiminate the punctuation marks and change the capital letter to a small letter
6. Tokenize the preprocessed sentences using nltk.word_tokenize

In [2]:
dataset = open('dataset/ted_en-20160408.xml', 'r', encoding='UTF8')

text = '\n'.join(etree.parse(dataset).xpath('//content/text()'))
text = re.sub(r'\([^)]*\)', '', text)
print("*Print one sentence in text:\n\n{}".format(text[:95]))

sentences = sent_tokenize(text)
print("\n*Print one sentence in sentences:\n\n{}".format(sentences[0]))

pre_sentences = []
for sentence in sentences:
    pre_sentences.append(re.sub(r"[^a-z0-9]+", " ", sentence.lower()))

print("\n*Print one sentence in pre_sentences:\n\n{}".format(pre_sentences[0]))

tokenized_sentence = [word_tokenize(sentence) for sentence in pre_sentences]

print("\n*Print one sentence in tokenized_sentence:\n\n{}".format(tokenized_sentence[0]))

print("\nNumber of tokenized sentences: {}".format(len(tokenized_sentence)))

*Print one sentence in text:

Here are two reasons companies fail: they only do more of the same, or they only do what's new.

*Print one sentence in sentences:

Here are two reasons companies fail: they only do more of the same, or they only do what's new.

*Print one sentence in pre_sentences:

here are two reasons companies fail they only do more of the same or they only do what s new 

*Print one sentence in tokenized_sentence:

['here', 'are', 'two', 'reasons', 'companies', 'fail', 'they', 'only', 'do', 'more', 'of', 'the', 'same', 'or', 'they', 'only', 'do', 'what', 's', 'new']

Number of tokenized sentences: 273424


## Word to index & Index to Word
I follow the code & instructions from https://github.com/theeluwin/pytorch-sgns  

  
I did preprocessing dataset in the following sequence.
1. Count the word that appears in the dataset & save it into word_count variable
2. Define idx2word(index to word) and word2idx(word to index) variable
3. Make vocabulary from tokenized_sentence
4. Define Skipgram function. It returns center word and context word. Context words are padded with 'unk' word.
5. Make dataset which are composed of center and context words.
6. Define word_frequency variable.
7. Define subsampling probability threshold of each word.
  
Variable:  
* t : sub sampling threshold
* window_size : window size
* num_negs: number of negative words for each center word
* max_vocab : usuable word ranking to train the model
* emb_dim : how large to make word representation
* padding_idx : padding index
* n_epochs : number of epochs
* batch_size : mini batch size
* device : True if current device can use GPU, else False

### Skip gram
  
We will use Skip gram, not CBOW.  
The following is the probability distribution for single pair. 
  
$$ P(context|center;\theta) $$  
  
Skip gram model maximizes this distribution through all word/context pairs.  
  
$$ max \prod_{context} \prod_{center} P(context|center;\theta) $$  
  
After then, make this prob. distribution as negative log likelihood  
  
$$ min_\theta -\frac{1}{T} \Sigma_{center} \Sigma_{context} log P(context|center;\theta) $$  


### Sub sampling

Word2Vec researchers have decided to reduce the amount of learning in a probabilistic way for words that appear frequently in the corpus. This is because there are many opportunities to be updated as much as the frequency of appearance.  
The probability of excluding from learning is defined below.  
  
$$ P(w_i) = 1 - \sqrt{\frac{t}{f(w_i)}}$$  
  
But the researchers use the probability like below.  
  
$$ P(w_i) = \frac{f(w_i)-t}{f(w_i)} - \sqrt{\frac{t}{f(w_i)}} $$

They recommend the value of t as 0.00001

In [3]:
t = 0.00001 # sub sampling threshold
window_size = 5
num_negs = 20
max_vocab = 20000
emb_dim = 300
padding_idx = 0
n_epochs = 20
batch_size = 4096
device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [4]:
unk = 'unk'
word_count = {}
word_count[unk] = 1

for sentence in tokenized_sentence:
    for token in sentence:
        if token not in word_count:
            word_count[token] = 1
        else:
            word_count[token] += 1

In [5]:
idx2word = [unk] + [key for key, _ in sorted(word_count.items(), key = (lambda x: x[1]), reverse=True)][:max_vocab-1]
word2idx = {idx2word[idx]: idx for idx, _ in enumerate(idx2word)}

In [6]:
vocab = set([word for word in word2idx])
print("Vocabulary size: {}".format(len(vocab)))

Vocabulary size: 20000


In [7]:
# Skip gram

def skipgram(sentence, index):
    left = sentence[max(0, index-window_size):index]
    right = sentence[index+1:min(len(sentence), index+window_size) +1]
    
    return sentence[index], [unk for _ in range(window_size - len(left))] + left + right + [unk for _ in range(window_size - len(right))]

In [8]:
train_data = []
for sentence in tokenized_sentence:
    sent = []
    for word in sentence:
        sent.append(word if word in vocab else unk)
    for idx in range(len(sent)):
        center, contexts = skipgram(sent, idx)
        train_data.append((word2idx[center], [word2idx[context] for context in contexts]))

In [9]:
print("Training data size: {}".format(len(train_data)))
train_example_idx = random.choice(range(0, len(train_data)))

print("Randomly chosen index of train_data: {}".format(train_example_idx))
print("Training data example: {}".format(train_data[train_example_idx]))
print("The words of example:")
center, contexts = train_data[train_example_idx]
print("center: {}".format(idx2word[center]))
print("contexts:", end = " ")
for word in contexts:
    print(idx2word[word], end = " ")

Training data size: 4475758
Randomly chosen index of train_data: 2324351
Training data example: (4, [0, 93, 8, 5, 431, 394, 11, 60, 3, 72])
The words of example:
center: of
contexts: unk well in a couple months we had to get 

In [10]:
word_frequency = np.array([word_count[word] for word in idx2word])
word_frequency = word_frequency/word_frequency.sum()

In [11]:
# Sub sampling
subsample_prob = (word_frequency - t)/(word_frequency) - np.sqrt(t/word_frequency)
subsample_prob = np.clip(subsample_prob, 0, 1)

In [12]:
random_idx = random.choice(range(0, len(list(subsample_prob))))
print("Random index: {}".format(random_idx))
print("The probability to exclude training the word {} is {}".format(idx2word[random_idx],subsample_prob[random_idx]))

Random index: 3231
The probability to exclude training the word picking is 0.0


# Define Model

## Word2Vec
  
Word2Vec class gets maximum vocabulary size(max_vocab), embedding dimension(emb_dim) and padding index as parameters.
  
This class consists of input layer and output layer.
* input layer: it gets center word as Long Tensor  
    The weights of this layer is initilazed uniformly ~ U(-0.5/embedding dim, 0.5/embedding dim)
* output layer: it gets contexts words and negative words as Long Tensor  
    The weights of this layer is initilazed uniformly ~ U(-0.5/embedding dim, 0.5/embedding dim)

## Skip gram with Negative Sampling

### Negative Sampling
  
Since, softmax algorithm takes long time cause of large vocabulary, word2vec researcher suggested to use Negative Sampling algorithm.  
This algorithm select the words that are not in context words, and use it to calculate simple softmax value.  
You can find the paper here: https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf
  
  
Select the words that are in vocab with prob(with replacement).  
$$ P(w_i) = \frac{f(w_i)^{\frac{3}{4}}}{\Sigma_{j=0}^{n}f(w_j)^{\frac{3}{4}}}$$
  
Since, we applied negative sampling method, the objective function of unsupervised Word2Vec model changes as follows:  
  
$$ J_t(\theta) = log \sigma (u_o^Tv_c) + \Sigma_{j ~ P(w)}[log\sigma(-u_j^Tv_c)]$$  


You can change the window size. But we select the value of window size as 5

In [13]:
# Word2Vec model
class Word2Vec(nn.Module):
    def __init__(self, vocab_size=max_vocab, emb_dim = emb_dim, padding_idx = 0):
        super(Word2Vec, self).__init__()
        self.vocab_size = vocab_size
        self.emb_dim = emb_dim
        self.centers = nn.Embedding(self.vocab_size, self.emb_dim, padding_idx = padding_idx)
        self.contexts = nn.Embedding(self.vocab_size, self.emb_dim, padding_idx = padding_idx)
        self.centers.weight = nn.Parameter(torch.cat([torch.zeros(1, self.emb_dim),
                                                     torch.FloatTensor(self.vocab_size -1, self.emb_dim).uniform_(-0.5/self.emb_dim, 0.5/self.emb_dim)]))
        self.contexts.weight = nn.Parameter(torch.cat([torch.zeros(1, self.emb_dim),
                                                     torch.FloatTensor(self.vocab_size -1, self.emb_dim).uniform_(-0.5/self.emb_dim, 0.5/self.emb_dim)]))
        self.centers.weight.requires_grad = True
        self.contexts.weight.requires_grad = True
        
    def forward(self, data):
        return self.forward_input(data)
    
    def forward_input(self, data):
        vector = torch.LongTensor(data).to(device)
        return self.centers(vector)
    
    def forward_output(self, data):
        vector = torch.LongTensor(data).to(device)
        return self.contexts(vector)

In [14]:
# SkipGram with Negative Sampling
class SGNS(nn.Module):
    
    def __init__(self, emb_model, vocab_size = max_vocab, num_negs = num_negs, weights = None):
        super(SGNS, self).__init__()
        self.emb_model = emb_model
        self.vocab_size = vocab_size
        self.num_negs = num_negs
        
        word_frequency = np.power(weights, 0.75)
        word_frequency = word_frequency / word_frequency.sum()
        self.weights = torch.FloatTensor(word_frequency)
        
    def forward(self, center, contexts):
        batch_size = center.size()[0]
        context_size = contexts.size()[1]
        negative = torch.multinomial(self.weights, batch_size * context_size * self.num_negs, replacement = True).view(batch_size, -1)
        
        centerV = self.emb_model.forward_input(center).unsqueeze(2)
        contextsV = self.emb_model.forward_output(contexts)
        negativeV = self.emb_model.forward_output(negative).neg()
        
        context_loss = F.logsigmoid(torch.bmm(contextsV, centerV).squeeze()).mean(1)
        negative_loss = F.logsigmoid(torch.bmm(negativeV, centerV).squeeze()).view(-1, context_size, self.num_negs).sum(2).mean(1)
        
        return -(context_loss + negative_loss).mean()

In [15]:
model = Word2Vec(vocab_size = max_vocab, emb_dim = emb_dim)
model.to(device)

Word2Vec(
  (centers): Embedding(20000, 300, padding_idx=0)
  (contexts): Embedding(20000, 300, padding_idx=0)
)

In [16]:
sgns = SGNS(emb_model = model, vocab_size=max_vocab, num_negs=num_negs, weights=word_frequency)
sgns.to(device)

SGNS(
  (emb_model): Word2Vec(
    (centers): Embedding(20000, 300, padding_idx=0)
    (contexts): Embedding(20000, 300, padding_idx=0)
  )
)

In [17]:
optimization = torch.optim.Adam(sgns.parameters())
print(optimization)

Adam (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    eps: 1e-08
    lr: 0.001
    weight_decay: 0
)


## Train Word2Vec model
  
Before training the model, we must sub sample the words.  
  
### PermutedSubsampledCorpus
  
Since we have the threshold of subsampling probability of each word, we can simply sample each word according to the probability.  
This class returns permuted and sub sampled dataset.
  
### Train
  
We used mini-batch training method.  
Using DataLoader class, you can split the dataset into mini-batch easily.  
To show training process, we use tqdm.

In [18]:
# Get dataset
# Now we apply the sub sampling method

class PermutedSubsampledCorpus(Dataset):
    def __init__(self, train_data = None, subsample_prob = None):
        self.data = []
        for center, contexts in train_data:
            if random.random() > subsample_prob[center]:
                self.data.append((center, contexts))
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        center, contexts = self.data[idx]
        return center, np.array(contexts)

In [19]:
# train
model.train()
for epoch in range(1, n_epochs + 1):
    dataset = PermutedSubsampledCorpus(train_data = train_data, subsample_prob = subsample_prob)
    dataloader = DataLoader(dataset, batch_size = batch_size , shuffle = True)
    total_batches = int(np.ceil(len(dataset)/batch_size))
    pbar = tqdm(dataloader)
    pbar.set_description("[Epoch {}]".format(epoch))
    for center, contexts in pbar:
        loss = sgns(center, contexts)
        optimization.zero_grad()
        loss.backward()
        optimization.step()
        pbar.set_postfix(loss=loss.item())

[Epoch 1]: 100%|██████████| 262/262 [00:35<00:00,  7.34it/s, loss=4.27]
[Epoch 2]: 100%|██████████| 262/262 [00:34<00:00,  7.55it/s, loss=4.08]
[Epoch 3]: 100%|██████████| 263/263 [00:35<00:00,  7.49it/s, loss=5.42]
[Epoch 4]: 100%|██████████| 262/262 [00:36<00:00,  7.08it/s, loss=4.11]
[Epoch 5]: 100%|██████████| 262/262 [00:37<00:00,  6.95it/s, loss=4.06]
[Epoch 6]: 100%|██████████| 262/262 [00:37<00:00,  7.02it/s, loss=3.97]
[Epoch 7]: 100%|██████████| 262/262 [00:36<00:00,  7.17it/s, loss=4]   
[Epoch 8]: 100%|██████████| 262/262 [00:35<00:00,  7.37it/s, loss=4.03]
[Epoch 9]: 100%|██████████| 262/262 [00:39<00:00,  6.60it/s, loss=3.99]
[Epoch 10]: 100%|██████████| 262/262 [00:38<00:00,  6.81it/s, loss=3.98]
[Epoch 11]: 100%|██████████| 262/262 [00:36<00:00,  7.16it/s, loss=3.98]
[Epoch 12]: 100%|██████████| 262/262 [00:36<00:00,  7.25it/s, loss=3.99]
[Epoch 13]: 100%|██████████| 262/262 [00:36<00:00,  7.18it/s, loss=3.97]
[Epoch 14]: 100%|██████████| 262/262 [00:35<00:00,  7.37it/s

In [20]:
# Save the model
idx2vec = model.centers.weight.data.cpu().numpy()
pickle.dump(idx2vec, open('idx2vec.dat', 'wb'))
torch.save(sgns.state_dict(), 'word2vec.pt')
torch.save(optimization.state_dict(), 'word2vec.optim.pt')

## Get closest word
  
Using trained model's lookup table, we can find similar word.  
If the word's vector representation of model is not good, then model can't predict properly similar word of given word.

In [21]:
def closest_word(word, topn = 5):
    i = word2idx[word]
    word_distance = []
    dist = nn.PairwiseDistance()
    v_i = idx2vec[i]
    tensor_i = torch.FloatTensor([v_i])
    for j in range(len(vocab)):
        if j != i:
            v_j = idx2vec[j]
            tensor_j = torch.FloatTensor([v_j])
            word_distance.append((idx2word[j], float(dist(tensor_i, tensor_j))))
    word_distance.sort(key=lambda x: x[1])
    print(word_distance[:topn])
    return

In [22]:
closest_word('woman')

[('girl', 1.0328195095062256), ('man', 1.0591298341751099), ('child', 1.2242408990859985), ('mother', 1.265128493309021), ('faiza', 1.293604850769043)]


In [23]:
closest_word('man')

[('woman', 1.0591306686401367), ('studs', 1.1442370414733887), ('max', 1.1473325490951538), ('hale', 1.1486926078796387), ('guy', 1.1503463983535767)]


In [24]:
closest_word('free')

[('renew', 1.2119112014770508), ('horizons', 1.231445550918579), ('barter', 1.2369675636291504), ('support', 1.2469265460968018), ('certification', 1.2520084381103516)]
