# 4. Pretrain Word2Vec

We go on to implement the skip-gram model defined in [Section 1](./readme.md). Then we will pretrain word2vec using negative sampling on the PTB dataset. First of all, let us obtain the data iterator and the vocabulary for this dataset by calling the `d2l.load_data_ptb` function, which was described in [Section 3](./readme.md)

In [8]:
import math
import torch
from torch import nn
from C14_3_Dataset import *
batch_size, max_window_size, num_noise_words = 512, 5, 5
data_iter, vocab = load_data_ptb(batch_size, max_window_size, num_noise_words)

## 4.1 The Skip-Gram Model

We implement the skip-gram model by using embedding layers and batch matrix multiplications. 

First, let us review how embedding layers work.

### 4.1.1 Embedding Layer

An embedding layer *maps* a **token’s index** to its **feature vector**. The weight of this layer is a matrix whose number of rows equals to the dictionary size (input_dim) and number of columns equals to the vector dimension for each token (output_dim). After a word embedding model is trained, this weight is what we need.

In [6]:
embed = nn.Embedding(num_embeddings=20, embedding_dim=4)
print(f'Parameter embedding_weight ({embed.weight.shape}, '
      f'dtype={embed.weight.dtype}')

Parameter embedding_weight (torch.Size([20, 4]), dtype=torch.float32


The input of an embedding layer is the index of a token (word). For any token index $i$, its vector representation can be obtained from the $i^{th}$ row of the weight matrix in the embedding layer. Since the vector dimension (`output_dim`) was set to 4, the embedding layer returns vectors with shape (2, 3, 4) for a minibatch of token indices with shape (2, 3).

In [7]:
x = torch.tensor([[1,2,3], [4,5,6]]) ## a minibatch of token indices with shape (2, 3).
embed(x) ##  returns vectors with shape (2, 3, 4) 

tensor([[[-0.1306,  1.0344,  1.8776, -0.1079],
         [ 0.2219, -0.7160,  0.8888, -1.7895],
         [ 2.2135, -1.2377,  1.2388, -0.8718]],

        [[ 0.2856,  1.7473, -0.3993, -0.8712],
         [ 0.4307,  0.4617, -0.3798, -0.5880],
         [ 1.2015, -0.0852, -1.1266,  0.8604]]], grad_fn=<EmbeddingBackward>)

## 4.1.2. Defining the Forward Propagation
In the forward propagation, the input of the skip-gram model includes **the center word indices** `center` of shape (*batch size*, *1*) and **the concatenated context and noise word indices** `contexts_and_negatives` of shape (*batch size*, `max_len`), where `max_len` is defined in [Section 3.5](./readme.md). 

These two variables are 
first transformed from the **token indices** into **vectors** via the embedding layer, 
then their batch matrix multiplication returns an output of shape (*batch size*, *1*, `max_len`). 

Each element in the output is the dot product of a center word vector and a context or noise word vector.

In [11]:
def skip_gram(center, contexts_and_negatives, embed_v, embed_u):
    """
    返回center与噪声的向量内积
    :param center: 中心词, size = (batch, 1)
    :param contexts_and_negatives: 背景+负样本, size = (batch, maxlen)
    :param embed_v: 中心词的向量形式，size = (batch, 1, v)
    :param embed_u: 背景+噪声的向量形式, size = (batch, maxlen, v)
    :return: embed_v @ embed_u
    """
    v = embed_v(center)
    u = embed_u(contexts_and_negatives)
    pred = torch.bmm(v, u.permute(0, 2, 1)) # 矩阵乘法, (b,1,v)*(b,v,maxlen)
    return pred

Let us print the output shape of this `skip_gram` func for some example inputs.

In [12]:
skip_gram(torch.ones((2, 1), dtype=torch.long), 
          torch.ones((2, 4), dtype=torch.long), embed, embed).shape

torch.Size([2, 1, 4])