<a href="https://colab.research.google.com/github/Mjh9122/ML_lit_review/blob/main/word2vec/word2vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Efficient Estimation of Word Representations in Vector Space
## Authors: Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean
### Notes: Michael Holtz

### Abstract

They describe two NN architectures for embedding words into vector spaces. The quality is measured via a "word similarity task," and they find higher accuracy at much lower computational cost. Furthermore, the vectors produced provide state-of-the-art performance on a test set for syntactic and semantic word similarities.

### Intro

Many previous NLP systems treat words as simply an element in a set of words, with no notion of similarity between words. These simple techniques have notable limits in many tasks. While trillions of words might be necessary to achieve performance with these simple methods, tasks such as automatic speech recognition or machine translation may have corpora with only millions or billions of words. In these scenarios, a more complex strategy is needed.

### Paper Goals

The goal is to create high-quality vector embeddings for millions of words from a billion+ word corpora. The expectation of the embedding is that similar words should be close to one another and that words can have multiple degrees of similarity (such as a similar ending). More surprising is that algebraic operations on these vectors hold their meaning. Ex. King - man + woman = queen. They also develop a test set for syntactic and semantic regularities and discuss how time and accuracy depend on embedding dimension.


### Previous work

Previous attempts at word embeddings via a neural network language model. The first proposed models learned both a word vector representation and a statistical language model. Later architectures attempted to learn the embedding via a single hidden layer and then the vectors were used to train the NNLM. Other work also found that NLP tasks became easier when working with word vectors. The architecture in this paper seeks to find these vectors in a much more computationally efficient way.

### NNLM Architectures

#### Feedforward NNLM
This model takes in N-words encoded in one-of-V coding. The input layer is then projected to a projection layer P. This layer is passed to a hidden layer, which in turn predicts a probability distribution over the 1xV output layer.

#### Recurrent NNLM (RNNLM)
This model removes the need to specify the context length for the input. The RNNLM removes the projection layer, consisting of only input, hidden, and output layers. It is a recurrent architecture becuase there are time delayed connections from the hidden layer to itself. These connection theoretically allow for short term memory, allowing past words to influence future predictions.


### New log-linear models
The main focus of the paper. New models are proposed which avoid the nonlinear nature of the neural nets above, allowing for much more efficient training. These new models can then be used to train the above architectures on a much smaller input dimension.

#### CBOW
Continous bag of words (CBOW) is similar to the feedforward model but there is no non-linear hidden layer. Instead the projection layer is shared for all words, and input contains words from the past as well as the future. The goal is to classify the word in the middle of the input.

#### Imports

In [1]:
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import random
import re

from tqdm import tqdm
from torch.utils.tensorboard import SummaryWriter
from torch.utils.data import DataLoader, Dataset
from sklearn.model_selection import train_test_split

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
writer = SummaryWriter('runs')
print(f'Using: {device}')

Using: cuda


#### Import corpus and build datasets

In [2]:
with open('text8', 'r') as f:
    text = f.read()

tokens = re.findall(r'\b[a-zA-Z]+\b', text)
corpus = [word.lower() for word in tokens]

In [None]:
# Create vocab
vocab = set(corpus)
vocab_size = len(vocab)
print(f'vocab size: {vocab_size}')

# To convert from word to embedding index
word_to_ix = {word:ix for ix, word in enumerate(vocab)}
ix_to_word = {ix:word for ix, word in enumerate(vocab)}

# Create dataset by taking four words on either side of a target word
context_length = 4
contexts, targets = [], []

for i in tqdm(range(context_length, len(corpus) - context_length)):
    context = corpus[i - context_length: i] + corpus[i + 1 : i + context_length + 1]
    target = corpus[i]
    targets.append(target)
    contexts.append(context)


CBOW_Xs = torch.tensor([[word_to_ix[w] for w in x] for x in contexts])
CBOW_ys = torch.tensor([word_to_ix[y] for y in targets])
SKIP_Xs, SKIP_ys = [], []

# for context, target in tqdm(zip(contexts, targets), total = len(targets)):
#     for ctxt_word in context:
#         SKIP_Xs.append(word_to_ix[target])
#         SKIP_ys.append(word_to_ix[ctxt_word])
# 
# SKIP_Xs = torch.tensor(SKIP_Xs)
# SKIP_ys = torch.tensor(SKIP_ys)
# 
# print(f'CBOW training length: {CBOW_Xs.shape} Skipgram training length: {SKIP_ys.shape}')

vocab size: 833184


  0%|          | 663/124301818 [00:02<129:18:32, 267.02it/s]

#### Define dataset class

In [None]:
# Quick dataset class for dataloading
class Simple_Dataset(Dataset):
    def __init__(self, X, Y):
        self.X = X
        self.Y = Y

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        return self.Y[idx],self.X[idx]

#### Define CBOW and Skipgram classes

In [None]:
# CBOW model proper. No non-linear activations like the paper says.
class CBOW(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(CBOW, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear = nn.Linear(embedding_dim, vocab_size)
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    def forward(self, inputs):
        embeds = self.embeddings(inputs).sum(dim=1)
        out = self.linear(embeds)
        return out

    def get_word_embedding(self, word):
        word = torch.tensor(word_to_ix[word], device=self.device)
        return self.embeddings(word).view(1,-1)

In [None]:
# SKIPGRAM model. No non-linear activations. No negative sampling. 
class Skipgram(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(Skipgram, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear = nn.Linear(embedding_dim, vocab_size, bias=False)
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    def forward(self, inputs):
        embeds = self.embeddings(inputs).view(-1, 1)
        out = self.linear(embeds)
        return out

    def get_word_embedding(self, word):
        word = torch.tensor(word_to_ix[word])
        return self.embeddings(word).view(1,-1)

#### Instantiate CBOW, loss, optim, etc.

In [None]:
# Create cbow instance, loss, optimizer, dataset, and dataloader
batch_size = 256

cbow_model = CBOW(vocab_size, 256)
cbow_model.to(device)
cbow_loss = nn.CrossEntropyLoss(reduction='sum')
cbow_optim = torch.optim.Adam(cbow_model.parameters())

CBOW_X_train, CBOW_X_test, CBOW_y_train, CBOW_y_test = train_test_split(CBOW_Xs, CBOW_ys, test_size = .1)
cbow_train = Simple_Dataset(CBOW_X_train, CBOW_y_train)
cbow_test = Simple_Dataset(CBOW_X_test, CBOW_y_test)
cbow_train_dataloader = torch.utils.data.DataLoader(cbow_train, batch_size=batch_size, shuffle=True)
cbow_test_dataloader = torch.utils.data.DataLoader(cbow_test, batch_size=batch_size, shuffle=False)

len(cbow_train)

899992

#### CBOW Training

In [None]:
for epoch in range(5):
    running_correct = 0
    running_loss = 0

    cbow_model.train()
    for i, (labels, features) in tqdm(enumerate(cbow_train_dataloader), total = len(cbow_train_dataloader)):
        labels, features = labels.to(device), features.to(device)
        cbow_optim.zero_grad()
        y_pred = cbow_model(features)
        loss = cbow_loss(y_pred, labels)
        loss.backward()
        cbow_optim.step()

        preds = torch.argmax(y_pred, dim=1)
        running_correct += (preds == labels).sum().item()
        running_loss += loss.item()

        if i % 99 == 0:
            writer.add_scalar('training batch loss', running_loss/(batch_size * 100), epoch * len(cbow_train_dataloader) + i)
            running_loss = 0

    writer.add_scalar('train accuracy', running_correct/len(cbow_train), epoch)
    running_correct = 0

    cbow_model.eval()
    with torch.no_grad():
        for i, (labels, features)in tqdm(enumerate(cbow_test_dataloader),  total = len(cbow_test_dataloader)):
            labels, features = labels.to(device), features.to(device)
            y_pred = cbow_model(features)
            preds = torch.argmax(y_pred, dim=1)
            running_correct += (preds == labels).sum().item()
    
    writer.add_scalar('test accuracy', running_correct/len(cbow_test), epoch)

100%|██████████| 3516/3516 [00:55<00:00, 63.42it/s]
 11%|█         | 391/3516 [00:01<00:08, 353.86it/s]
100%|██████████| 3516/3516 [00:58<00:00, 59.73it/s]
 11%|█         | 391/3516 [00:01<00:08, 358.49it/s]
  5%|▌         | 180/3516 [00:02<00:53, 61.92it/s]

In [None]:
from sklearn.metrics.pairwise import cosine_similarity 

In [None]:
king = cbow_model.get_word_embedding('king').cpu().detach().numpy()
woman = cbow_model.get_word_embedding('woman').cpu().detach().numpy() 
man = cbow_model.get_word_embedding('man').cpu().detach().numpy()
queen = cbow_model.get_word_embedding('queen').cpu().detach().numpy() 
cosine_similarity(king, queen), cosine_similarity(man, woman)

(array([[0.06784402]], dtype=float32), array([[0.0668553]], dtype=float32))

In [None]:
cosine_similarity(queen, king - man + woman)

array([[0.03422299]], dtype=float32)