# L1: Language modelling

In this lab you will implement and train two neural language models: the fixed-window model and the recurrent neural network model. You will evaluate these models by computing their perplexity on a benchmark dataset.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

For this lab, you should use the GPU if you have one:

In [2]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [3]:
print(torch.cuda.is_available())

True


## Data

The data for this lab is [WikiText](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/), a collection of more than 100 million tokens extracted from the “Good” and “Featured” articles on Wikipedia. We will use the small version of the dataset, which contains slightly more than 2.5 million tokens.

The next cell contains code for an object that will act as a container for the “training” and the “validation” section of the data. We fill this container by reading the corresponding text files. The only processing we do is to whitespace-tokenise, and to enclose each non-empty line within `<bos>` (beginning-of-sentence) and `<eos>` (end-of-sentence) tokens.

In [4]:
class WikiText(object):
    
    def __init__(self):
        self.vocab = {}
        self.train = self.read_data('wiki.train.tokens')
        self.valid = self.read_data('wiki.valid.tokens')
    
    def read_data(self, path):
        ids = []
        with open(path, encoding='utf-8') as source:
            for line in source:
                line = line.rstrip()
                if line:
                    for token in ['<bos>'] + line.split() + ['<eos>']:
                        if token not in self.vocab:
                            self.vocab[token] = len(self.vocab)
                        ids.append(self.vocab[token])
        return ids

The cell below loads the data and prints the total number of tokens and the size of the vocabulary.

In [5]:
wikitext = WikiText()
print('Tokens in train:', len(wikitext.train))
print('Tokens in valid:', len(wikitext.valid))
print('Vocabulary size:', len(wikitext.vocab))

Tokens in train: 2099444
Tokens in valid: 218808
Vocabulary size: 33279


## Problem 1: Fixed-window model

In this section, you will implement and train the fixed-window neural language model proposed by [Bengio et al. (2003)](http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf) and presented in the lectures. Recall that an input to the network takes the form of a vector of $n-1$ integers representing the preceding words. Each integer is mapped to a vector via an embedding layer. (All positions share the same embedding.) The embedding vectors are then concatenated and sent through a two-layer feed-forward network with a non-linearity in the form of a rectified linear unit (ReLU) and a final softmax layer.

### Problem 1.1: Vectorise the data (1&nbsp;point)

Your first task is to write code for transforming the data in the WikiText container into a vectorised form that can be fed to the fixed-window model. Concretely, you will implement a [collate function](https://pytorch.org/docs/stable/data.html#dataloader-collate-fn) in the form of a callable vectoriser object. Complete the skeleton code in the cell below:

In [6]:
class FixedWindowVectorizer(object):
    def __init__(self, n):
        # n-gram order
        self.n = n

    def __call__(self, data):
        # TODO: Replace the following line with your own code
        N = len(data) - (self.n-1) # calculate how many n-grams in the dataset 

        X = [data[i:i+self.n-1] for i in range(N)]
        y = [data[i+self.n-1] for i in range(N) ]                               # [ , )
        return torch.tensor(X), torch.tensor(y)

Your code should implement the following specification:

**__init__** (*self*, *n*)

> Creates a new vectoriser with n-gram order $n$. Your code should be able to handle arbitrary n-gram orders $n \geq 1$.

**__call__** (*self*, *data*)

> Transforms WikiText *data* (a list of word ids) into a pair of tensors $\mathbf{X}$, $\mathbf{y}$ that can be used to train the fixed-window model. Let $N$ be the total number of $n$-grams from the token list; then $\mathbf{X}$ is a matrix with shape $(N, n-1)$ and $\mathbf{y}$ is a vector with length $N$.

#### 🤞 Test your code

Test your implementation by running the code in the next cell. Does the output match your expectation?

In [7]:
valid_x, valid_y = FixedWindowVectorizer(2)(wikitext.valid)

In [8]:
train_x, train_y = FixedWindowVectorizer(2)(wikitext.train)

### Problem 1.2: Implement the model (2&nbsp;points)

Your next task is to implement the fixed-window model based on the graphical specification given in the lecture.

In [9]:
class FixedWindowModel(nn.Module):

    def __init__(self, n, n_words, embedding_dim=50, hidden_dim=50):
        # n_words -- number of words in vocabulary
        # n       -- n-gram
        
        super().__init__()
        self.embedding = nn.Embedding(n_words,embedding_dim)
        self.fc = nn.Linear((n - 1) * embedding_dim, hidden_dim) 
        self.out = nn.Linear(hidden_dim, n_words)   
        # TODO: Add your own code

    def forward(self, x):
        embeds = self.embedding(x)
        embeds = embeds.view(x.size(0), -1)
        hidden = self.fc(embeds)
        output = self.out(hidden) # output is one-hot code for every word in vocablary
        probs = F.softmax(output, dim=1)        
        return output

Here is the specification of the two methods:

**__init__** (*self*, *n*, *n_words*, *embedding_dim*=50, *hidden_dim*=50)

> Creates a new fixed-window neural language model. The argument *n* specifies the model&rsquo;s $n$-gram order. The argument *n_words* is the number of words in the vocabulary. The arguments *embedding_dim* and *hidden_dim* specify the dimensionalities of the embedding layer and the hidden layer of the feedforward network, respectively; their default value is 50.

**forward** (*self*, *x*)

> Computes the network output on an input batch *x*. The shape of *x* is $(B, n-1)$, where $B$ is the batch size. The output of the forward pass is a tensor of shape $(B, V)$ where $V$ is the number of words in the vocabulary.

**Hint:** The most efficient way to implement the vector concatenation in this model is to use the [`view()`](https://pytorch.org/docs/stable/tensors.html#torch.Tensor.view) method.

#### 🤞 Test your code

Test your code by instantiating the model and feeding it a batch of examples from the training data.

In [10]:
n_words = len(wikitext.vocab)  
model = FixedWindowModel(2, n_words)

batch_size = 5
train_x_batch = train_x[:batch_size]

output = model(train_x_batch)
print(output)

tensor([[ 0.2684,  0.2154,  0.1707,  ..., -0.0739,  0.2872,  0.1453],
        [-0.5042,  0.4012,  0.1209,  ...,  0.0705,  0.1328,  0.1931],
        [ 0.0322,  0.3868, -0.0400,  ..., -0.3120,  0.3930, -0.0578],
        [ 0.4440, -0.4100,  0.5210,  ...,  0.0647, -0.8711, -0.1738],
        [-0.6703,  0.4635,  0.4008,  ..., -0.2705, -0.1936,  0.2240]],
       grad_fn=<AddmmBackward0>)


### Problem 1.3: Train the model (3&nbsp;points)

Your final task is to write code to train the fixed-window model using minibatch gradient descent and the cross-entropy loss function. This should be a straightforward generalisation of the training loops that you have seen so far. Complete the skeleton code in the cell below:

In [11]:
def get_batches(data, batch_size):
    for i in range(0, len(data) - batch_size + 1, batch_size):
        yield data[i:i + batch_size]

In [14]:
def train_fixed_window(n, n_epochs=2, batch_size=3200, lr=0.01):
    n_words = len(wikitext.vocab)
    model = FixedWindowModel(n, n_words)
    optimizer = optim.Adam(model.parameters(), lr=lr)
    total_loss = 0

    for epoch in range(n_epochs):
        model.train()  # 设置模型为训练模式
        for batch_x, batch_y in zip(get_batches(train_x, batch_size), get_batches(train_y, batch_size)):
            # batch_x 和 batch_y 是配对的输入和目标

            # 清空梯度
            optimizer.zero_grad()

            # 前向传播
            outputs = model(batch_x)
            
            # 计算损失
            loss = F.cross_entropy(outputs, batch_y)
            print("train_loss: ",loss)
            # 反向传播和优化
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
            
    train_loss = total_loss/len(train_x)

    model.eval()
    total_loss = 0
    with torch.no_grad():
        for batch_x,batch_y in zip(get_batches(valid_x, batch_size), get_batches(valid_y, batch_size)):
            batch_y = torch.tensor(batch_y, dtype=torch.long)
            outputs = model(batch_x)
            total_loss += F.cross_entropy(outputs, batch_y, reduction='sum').item()

        # 计算验证集上的平均损失
    valid_loss = total_loss / len(valid_x)
    perplexity = torch.exp(torch.tensor(valid_loss))

    print(f'Epoch {epoch+1}, Train Loss: {train_loss}, Validation Perplexity: {perplexity}') 

    return model


Here is the specification of the training function:

**train_fixed_window** (*n*, *n_epochs* = 1, *batch_size* = 3200, *lr* = 0.01)

> Trains a fixed-window neural language model of order *n* using minibatch gradient descent and returns it. The parameters *n_epochs* and *batch_size* specify the number of training epochs and the minibatch size, respectively. Training uses the cross-entropy loss function and the [Adam optimizer](https://pytorch.org/docs/stable/optim.html#torch.optim.Adam) with learning rate *lr*. After each epoch, prints the perplexity of the model on the validation data.

The code in the cell below trains a bigram model.

In [15]:
model_fixed_window = train_fixed_window(2)

train_loss:  tensor(10.4926, grad_fn=<NllLossBackward0>)
train_loss:  tensor(10.3766, grad_fn=<NllLossBackward0>)
train_loss:  tensor(10.2626, grad_fn=<NllLossBackward0>)
train_loss:  tensor(10.1719, grad_fn=<NllLossBackward0>)
train_loss:  tensor(10.0996, grad_fn=<NllLossBackward0>)
train_loss:  tensor(9.8421, grad_fn=<NllLossBackward0>)
train_loss:  tensor(9.8001, grad_fn=<NllLossBackward0>)
train_loss:  tensor(9.5943, grad_fn=<NllLossBackward0>)
train_loss:  tensor(9.3518, grad_fn=<NllLossBackward0>)
train_loss:  tensor(9.2322, grad_fn=<NllLossBackward0>)
train_loss:  tensor(9.0240, grad_fn=<NllLossBackward0>)
train_loss:  tensor(8.9626, grad_fn=<NllLossBackward0>)
train_loss:  tensor(8.9888, grad_fn=<NllLossBackward0>)
train_loss:  tensor(8.8243, grad_fn=<NllLossBackward0>)
train_loss:  tensor(8.6283, grad_fn=<NllLossBackward0>)
train_loss:  tensor(8.3033, grad_fn=<NllLossBackward0>)
train_loss:  tensor(8.3118, grad_fn=<NllLossBackward0>)
train_loss:  tensor(8.2905, grad_fn=<NllLos

  batch_y = torch.tensor(batch_y, dtype=torch.long)


Epoch 2, Train Loss: 0.0038611886944710174, Validation Perplexity: 344.9189147949219


#### Performance goal

**Your submitted notebook must contain output demonstrating a validation perplexity of at most 350.** If you do not reach this perplexity after the first epoch, try training for a second epoch.

⚠️ Computing the validation perplexity in one go (for the full validation set) will probably exhaust your computer’s memory and/or take a lot of time. If you run into this problem, do the computation at the minibatch level and aggregate the results.

#### 🤞 Test your code

To see whether your network is learning something, print or plot the loss and/or the perplexity on the training data. If the two values do not decrease during training, try to find the problem before wasting time (and electricity) on useless computation.

Training and even evaluation will take some time – on a CPU, you should expect several minutes per epoch, depending on hardware. Our reference implementation uses a GPU and runs in less than 30 seconds per epoch on [Colab](http://colab.research.google.com).

## Problem 2: Recurrent neural network model

In this section, you will implement the recurrent neural network language model. Recall that an input to this model is a vector of word ids. Each integer is mapped to an embedding vector. The sequence of embedded vectors is then fed into an unrolled LSTM. At each position $i$ in the sequence, the hidden state of the LSTM at that position is sent through a linear transformation into a final softmax layer representing the probability distribution over the words at position $i+1$. In theory, the input vector could represent the complete training data; for practical reasons, however, we will truncate the input to some fixed value *bptt_len*. This length is called the **backpropagation-through-time horizon**.

### Problem 2.1: Vectorise the data (1&nbsp;point)

As in the previous problem, your first task is to transform the data in the WikiText container into a vectorised form that can be fed to the model.

In [12]:
class RNNVectorizer(object):
    def __init__(self, bptt_len):
        # backpropagation-through-time horizon
        self.bptt_len = bptt_len

    def __call__(self, data):
        # TODO: Replace the following line with your own code
        num_sequences = len(data) // self.bptt_len

        # Initialize X and Y tensors
        X = torch.zeros((num_sequences, self.bptt_len), dtype=torch.long)
        Y = torch.zeros((num_sequences, self.bptt_len), dtype=torch.long)

        for i in range(num_sequences):
            start_idx = i * self.bptt_len
            end_idx = start_idx + self.bptt_len

            # Populate X with the current sequence
            X[i] = torch.tensor(data[start_idx:end_idx])

            # Populate Y with the sequence offset by one position
            Y[i] = torch.tensor(data[start_idx + 1:end_idx + 1] if end_idx < len(data) else data[start_idx + 1:] + [0])

        return X, Y

Your vectoriser should meet the following specification:

**__init__** (*self*, *bptt_len*)

> Creates a new vectoriser. The parameter *bptt_len* specifies the backpropagation-through-time horizon.

**__call__** (*self*, *data*)

> Transforms a list of token indexes *data* into a pair of tensors $\mathbf{X}$, $\mathbf{Y}$ that can be used to train the recurrent neural language model. The rows of both tensors represent contiguous subsequences of token indexes of length *bptt_len*. Compared to the sequences in $\mathbf{X}$, the corresponding sequences in $\mathbf{Y}$ are shifted one position to the right. More precisely, if the $i$th row of $\mathbf{X}$ is the sequence that starts at token position $j$, then the same row of $\mathbf{Y}$ is the sequence that starts at position $j+1$.

#### 🤞 Test your code

Test your implementation by running the following code:

In [13]:
valid_x, valid_y = RNNVectorizer(32)(wikitext.valid)

print(valid_x.size(), valid_y.size())

torch.Size([6837, 32]) torch.Size([6837, 32])


In [14]:
train_x, train_y = RNNVectorizer(32)(wikitext.train)

In [15]:
train_x[0]

tensor([ 0,  1,  2,  3,  4,  1,  5,  0,  6,  7,  2,  8,  9, 10,  3, 11, 12,  9,
        13, 14, 15, 16,  2, 17, 18, 19,  8, 20, 14, 21, 22, 23])

In [18]:
print(train_y[0])

tensor([ 1,  2,  3,  4,  1,  5,  0,  6,  7,  2,  8,  9, 10,  3, 11, 12,  9, 13,
        14, 15, 16,  2, 17, 18, 19,  8, 20, 14, 21, 22, 23, 24])


### Problem 2.2: Implement the model (2&nbsp;points)

Your next task is to implement the recurrent neural network model based on the graphical specification.

In [21]:
class RNNModel(nn.Module):

    def __init__(self, n_words, embedding_dim=50, hidden_dim=50):
        super().__init__()
        self.embedding = nn.Embedding(n_words, embedding_dim)
        self.LSTM = nn.LSTM(embedding_dim, hidden_dim, batch_first = True)
        self.fc = nn.Linear(hidden_dim, n_words)

    def forward(self, x):
        # TODO: Replace the next line with your own code
        x = self.embedding(x)
        lstm_out, _ = self.LSTM(x)
        out = self.fc(lstm_out)
        #out = F.softmax(out, dim = -1)
        return out

Your implementation should follow this specification:

**__init__** (*self*, *n_words*, *embedding_dim* = 50, *hidden_dim* = 50)

> Creates a new recurrent neural network language model based on an LSTM. The argument *n_words* is the number of words in the vocabulary. The arguments *embedding_dim* and *hidden_dim* specify the dimensionalities of the embedding layer and the LSTM hidden layer, respectively; their default value is 50.

**forward** (*self*, *x*)

> Computes the network output on an input batch *x*. The shape of *x* is $(B, H)$, where $B$ is the batch size and $H$ is the length of each input sequence. The shape of the output tensor is $(B, H, V)$, where $V$ is the size of the vocabulary.

#### 🤞 Test your code

Test your code by instantiating the model and feeding it a batch of examples from the training data.

In [22]:
print(valid_x[0])
model = RNNModel(n_words=len(wikitext.vocab))
output = model(valid_x[0])
print(output)

tensor([    0,     1, 32967, 32968,     1,     5,     0, 32967, 32968,    14,
          407,    24,    18,  6254, 19903,   311,  1445, 19903,    14,    27,
           28,  2577,    17,    10, 19903,   116,    18,  4930,  4122,  9612,
           14,  4855])
tensor([[ 0.0219, -0.0266, -0.0625,  ...,  0.0937,  0.1366, -0.1284],
        [-0.0493, -0.1852, -0.1740,  ...,  0.1004,  0.2059, -0.0193],
        [-0.0632, -0.0537,  0.0171,  ..., -0.0079,  0.0997, -0.0614],
        ...,
        [-0.1008,  0.0491, -0.1663,  ..., -0.0189,  0.1513, -0.1380],
        [ 0.0257,  0.0202, -0.0204,  ..., -0.0980,  0.0571, -0.1446],
        [ 0.0385,  0.0463, -0.1642,  ..., -0.0316,  0.0994,  0.0009]],
       grad_fn=<AddmmBackward0>)


### Problem 2.3: Train the model (3&nbsp;points)

The training loop for the recurrent neural network model is essentially identical to the loop that you wrote for the feed-forward model. The only thing to note is that the cross-entropy loss function expects its input to be a two-dimensional tensor; you will therefore have to re-shape the output tensor from the LSTM as well as the gold-standard output tensor in a suitable way. The most efficient way to do so is to use the [`view()`](https://pytorch.org/docs/stable/tensors.html#torch.Tensor.view) method.

In [23]:
train_x

tensor([[    0,     1,     2,  ...,    21,    22,    23],
        [   24,     2,     3,  ...,    44,    26,    14],
        [   47,    27,    18,  ...,    23,    18,    60],
        ...,
        [  349,     7,   428,  ...,   285, 23961,    27],
        [  495,   490,   152,  ...,  4855,  2491,    16],
        [   84,  9617,    27,  ...,   565,  1363,   152]])

In [26]:
def get_batches(data, batch_size):
    for i in range(0, len(data) - batch_size + 1, batch_size):
        yield data[i:i + batch_size]

def train_rnn(n_epochs=1, batch_size=1000, lr=0.01):
    # Initialize the model
    model = RNNModel(n_words=len(wikitext.vocab))
    optimizer = optim.Adam(model.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss()

    for epoch in range(n_epochs):
        model.train()
        total_loss = 0

        # Iterate over batches
        for batch_x, batch_y in zip(get_batches(train_x, batch_size), get_batches(train_y, batch_size)):
            optimizer.zero_grad()
            outputs = model(batch_x)

            # Reshape outputs and targets to fit CrossEntropyLoss expectations
            outputs = outputs.view(-1, outputs.shape[-1])
            batch_y = batch_y.view(-1)

            loss = criterion(outputs, batch_y)# why is better than F.cross_entropy
            print("train_loss: ", loss.item())

            loss.backward()
            optimizer.step()
            total_loss += loss.item()

        train_loss = total_loss/len(train_x)

        # Calculate perplexity on validation data
        # perplexity = calculate_perplexity(model, validation_data, batch_size, ...)
        # print(f'Epoch {epoch+1}, Perplexity: {perplexity}')
        model.eval()
        total_loss = 0
        with torch.no_grad():
            for batch_x, batch_y in zip(get_batches(valid_x, batch_size), get_batches(valid_y, batch_size)):
                batch_y = torch.tensor(batch_y, dtype=torch.long)
                outputs = model(batch_x)

                outputs = outputs.view(-1, outputs.shape[-1])
                batch_y = batch_y.view(-1)

                loss = criterion(outputs, batch_y)# why is better than F.cross_entropy

                total_loss += loss.item()

        # 计算验证集上的平均损失
    valid_loss = total_loss / len(valid_x)
    perplexity = torch.exp(torch.tensor(valid_loss))

    print(f'Epoch {epoch+1}, Train Loss: {train_loss}, Validation Perplexity: {perplexity}') 
    

    return model


# Additional functions like 'create_batches' and 'calculate_perplexity' need to be defined.

Here is the specification of the training function:

**train_rnn** (*n_epochs* = 1, *batch_size* = 1000, *bptt_len* = 32, *lr* = 0.01)

> Trains a recurrent neural network language model on the WikiText data using minibatch gradient descent and returns it. The parameters *n_epochs* and *batch_size* specify the number of training epochs and the minibatch size, respectively. The parameter *bptt_len* specifies the length of the backpropagation-through-time horizon, that is, the length of the input and output sequences. Training uses the cross-entropy loss function and the [Adam optimizer](https://pytorch.org/docs/stable/optim.html#torch.optim.Adam) with learning rate *lr*. After each epoch, prints the perplexity of the model on the validation data.

Evaluate your model by running the following code cell:

In [27]:
model_rnn = train_rnn()

train_loss:  10.433974266052246
train_loss:  10.363835334777832
train_loss:  10.27663803100586
train_loss:  10.114789009094238
train_loss:  9.857285499572754
train_loss:  9.405230522155762
train_loss:  8.930143356323242
train_loss:  8.54808235168457
train_loss:  8.156769752502441
train_loss:  7.964465618133545
train_loss:  7.716857433319092
train_loss:  7.578598499298096
train_loss:  7.436985015869141
train_loss:  7.399197101593018
train_loss:  7.502132415771484
train_loss:  7.4681572914123535
train_loss:  7.534809112548828


#### Performance goal

**Your submitted notebook must contain output demonstrating a validation perplexity of at most 280.** If you do not reach this perplexity after the first epoch, try training for a second epoch.

## Problem 3: Parameter initialisation (3&nbsp;points)

The error surfaces explored when training neural networks can be very complex. Because of this, it is important to choose “good” initial values for the parameters. In PyTorch, the weights of the embedding layer are initialised by sampling from the standard normal distribution $\mathcal{N}(0, 1)$. Test how changing the initialisation affects the perplexity of your feed-forward language model. Find research articles that propose different initialisation strategies.

Write a short (150&nbsp;words) report about your experiments and literature search. Use the following prompts:

* What different initialisation did you try? What results did you get?
* How do your results compare to what was suggested by the research articles?
* What did you learn? How, exactly, did you learn it? Why does this learning matter?

You are allowed to consult sources for this problem if you appropriately cite them. If in doubt, please read the [Academic Integrity Policy](https://www.ida.liu.se/~TDDE09/logistics/policies.html#academic-integrity-policy).

*TODO: Enter your text here*