# 9.3 Language Models

The goal of language models is to estimate the joint probability of the whole sequence:

$$
P(x_1, x_2, ..., x_T)
$$

where statistical tools in Section 9.1 can be applied.

In [1]:
import torch
from d2l import torch as d2l

## 9.3.1 Learning Language Models

The obvious question is how we should model a document, or even a sequence of tokens. Suppose that we tokenize text data at the word level. Let's start by applying basic probability rules:

$$
P(x_1, x_2, ..., x_T) = \Pi_{i=1}^T P(x_t|x_{t-1}, ..., x_1)
$$

For example, the probability of a text sequence containing four words would be given as:

$$
P(deep, learning, is, fun) = P(deep)P(learning|deep) P(is|deep, learning) P(fun | deep, learning, is)
$$

### Markove Models and n-grams

Among those sequence model analyses in Section 9.1, let's apply Markov models to language modeling. A distribution over sequences satisfies the Markov property of first order if $P(x_{t+1}|x_t, ..., x_1) = P(x_{t+1}, x_t)$. Higher orders correspond to longer dependencies. This leads to a number of approximations that we could apply to model a sequence:

$$
P(x_1, x_2, x_3, x_4) = P(x_1)P(x_2)P(x_3)P(x_4) \\
P(x_1, x_2, x_3, x_4) = P(x_1)P(x_2|x_1)P(x_3|x_2)P(x_4|x_3) \\
P(x_1, x_2, x_3, x_4) = P(x_1)P(x_2|x_1)P(x_3|x_2, x_1)P(x_4|x_3, x_2)
$$

The probability formulae that involve one, two, and three variables are typically referred to as unigram, bigram, and trigram models, respectively. In order to compute the language model, we need to calculate the probability of words and the conditional probability of a word given the previous few words.

### Word Frequency

The probability of words can be calculated from the relative word frequency of a given word in the training dataset. For example, the estimate $\hat{P}(deep)$ can be calculated as the probability of any sentence starting with the word "deep". A slightly less accurate approach would be to count all occurrences of the word "deep" and divide it by the total number of words in the corpus. This works fairly well, particularly for frequent words. Moving on, we could attempt to estimate

$$
\hat{P}(learning|deep) = \cfrac{n(deep, learning)}{n(deep)}
$$

where $n(x)$ and $n(x, x')$ are the number of occurences of singletons and consecutive word paris, respectively. Unfortunately, estimating the probability of a word pair is somewhat more difficult, since the occurrences of “deep learning” are a lot less frequent. In particular, for some unusual word combinations it may be tricky to find enough occurrences to get accurate estimates. As suggested by the empirical results in Section 9.2.5, things take a turn for the worse for three-word combinations and beyond. There will be many plausible three-word combinations that we likely will not see in our dataset. Unless we provide some solution to assign such word combinations a nonzero count, we will not be able to use them in a language model. If the dataset is small or if the words are very rare, we might not find even a single one of them.

### Laplace Smoothing

A common strategy is to perform some form of Laplace smoothing. The solution is to add a small constant to all counts. Denote by $n$ the total number of words in the training set and $m$ the number of unique counts. This solution helps with singletons, e.g., via

$$

\hat{P}(x) = \cfrac{n(x) + \epsilon_1/m}{n + \epsilon_1} \\
\hat{P}(x'|x) = \cfrac{n(x, x') + \epsilon_2\hat{P}(x')}{n(x) + \epsilon_2} \\
\hat{P}(x''|x, x') = \cfrac{n(x, x', x'') + \epsilon_3\hat{P}(x'')}{n(x, x') + \epsilon_3}
$$

(As epsilon approaches to infinity, the resulting formula assumes independence)

## 9.3.2 Perplexity

How to measure the quality of the language model? One way is to check how surprising the text is. A good language model is to predict, with high accuracy, the tokens that come next. Consider the following continuations of the phrase "It is raining", as proposed by different language models:

1. It is raining outside
2. It is raining banana tree
3. It is raining awepoifjawepoifj

In terms of quality, Example 1 is the best. We might measure the quality of the model by computing the likelihood of the sequence. Unfortunately this is a number that is hard to understand and difficult to compare. After all, shorter sequences are much more likely to occur than the longer ones, hence evaluating model on Tolstoy's magnum opus War and Peace will inevitably produce a much smaller likelihood. What is missing is the equivalent of an average.

Thus, we geometrically average the likelihood, to take account the length of a text. Thus,

$$
Perplexity = \left( \cfrac{1}{\Pi_{i=1}^n P(x_i|x_{<i})} \right)^{1/n}
$$

Note that, this can be interpreted as the unit cross-entropy of a model with sample size 1 monte-carlo estimation.

## 9.3.3 Partitioning Sequences

We will design language models using neural networks and use perplexity to evaluate how good the model is at predicting the next token given the current set of tokens in text sequences. Before introducing the model, let's assume that it processes a minibatch of sequences with predefined length at a time. Now the question is how to read minibatches of input sequences and target sequences at random.

Suppose that the dataset takes the form of a sequence of $T$ token indices in corpus. We will partition it into subsequences, where each subsequence has $n$ tokens (time steps). To iterate over all the tokens of the entire dataset for each epoch and obtain all possible length-$n$ subsequences, we can introduce randomness. More concretely, at the beginning of each epoch, discard the first $d$ tokens, where $d \in [0, n)$ is uniformly sampled at random.

The rest of the sequence is then partitioned into $m = {(T-d)/n}$ subsequences. Denote by $\mathbb{x}_t = [x_t,..., x_{t+n-1}]$ the length-n subsequence starting from token $x_t$ at time step $t$. The resulting $m$ partitioned subsequences are $\mathbb{x}_{d}, \mathbb{x}_{d+n}, ..., \mathbb{x}_{d+n(m-1)}$. Each subsequence will be used as an input sequence into the language model.

For language modeling, the goal is to predict the next token based on the tokens we have seen so far; hence the targets are the original sequence, shifted by one token. The target sequence for any input sequence $\mathbb{x}_t$ is $\mathbb{x}_{t+1}$ with length $n$.

In [3]:
@d2l.add_to_class(d2l.TimeMachine) #@save
def __init__(self, batch_size, num_steps, num_train=10000, num_val=5000):
    super(d2l.TimeMachine, self).__init__()
    self.save_hyperparameters()
    corpus, self.vocab = self.build(self._download())
    array = torch.tensor([corpus[i:i+num_steps+1] for i in range(len(corpus)-num_steps)])
    self.X, self.Y = array[:,:-1], array[:,1:]

@d2l.add_to_class(d2l.TimeMachine) #@save
def get_dataloader(self, train):
    idx = slice(0, self.num_train) if train else slice(self.num_train, self.num_train + self.num_val)
    return self.get_tensorloader([self.X, self.Y], train, idx)

data = d2l.TimeMachine(batch_size=2, num_steps=10)
for X, Y in data.train_dataloader():
    print('X:', X, '\nY:', Y)
    break

X: tensor([[24,  2, 20,  0,  7, 13, 22, 20,  9,  6],
        [ 8, 19,  6,  6, 12,  0,  7, 19, 16, 14]]) 
Y: tensor([[ 2, 20,  0,  7, 13, 22, 20,  9,  6,  5],
        [19,  6,  6, 12,  0,  7, 19, 16, 14,  0]])
