# **Language Models**

In this Jupyter Notebook, we will explore techniques for the development and implementation of a trigram language model, such as model training, evaluation, and testing to ensure accuracy and performance. This is a hands-on approach to understanding how basic statistical language models operate and how they can be applied to predict the probability of sequences of words in a language.

In [1]:
%pip install -q pytest ipytest

Note: you may need to restart the kernel to use updated packages.


In [2]:
%precision 4

import numpy as np
import gzip
from cytoolz import concat, sliding_window
from collections import Counter

In [3]:
import pytest

try:
    get_ipython()

    import ipytest

    ipytest.autoconfig()

    def init_test():
        ipytest.clean()

    def run_test():
        ipytest.run()

except NameError:

    def init_test():
        pass

    def run_test():
        pass

***

## **1. Trigram Language Model**

We load some data and use it to train a simple trigram language model.

In [4]:
def read_corpus(filename):
    return [line.lower().split() for line in gzip.open(filename)]


sentences = read_corpus("../data/bnc_train.txt.gz")
sentences_train = sentences[:175000]
sentences_test = sentences[175000:]

In [5]:
class TrigramLM:
    def __init__(self, alpha):
        self.alpha = alpha

    def preprocess(self, sentence):
        """Normalize sentence and add filler tokens <s> and </s>"""
        return ["<s>", "<s>"] + [w.lower() for w in sentence] + ["</s>","</s>"]

    def get_unigram_counts(self, train_corpus):
        self.unigrams = Counter(concat(train_corpus))

    def get_bigram_counts(self, train_corpus):
        self.bigrams = Counter(sliding_window(2, concat(train_corpus)))

    def get_trigram_counts(self, train_corpus):
        self.trigrams = Counter(sliding_window(3, concat(train_corpus)))

    def train(self, train_corpus):
        """Count bigram and unigram frequencies in the training corpus."""
        train_corpus = [self.preprocess(sentence) for sentence in train_corpus]
        self.get_unigram_counts(train_corpus)
        self.get_bigram_counts(train_corpus)
        self.get_trigram_counts(train_corpus)
        self.V = len(self.unigrams)

    def log_prob(self, sentence):
        """Calculate the log_2 probability of a sentence given the model."""
        p = 0.0
        try:
            for (w1, w2, w3) in sliding_window(3, self.preprocess(sentence)):
                p = (
                        p
                        + np.log2(self.trigrams[w1, w2, w3] + self.alpha)
                        - np.log2(self.bigrams[w1, w2] + self.alpha * self.V)
                )
            return p
        except ZeroDivisionError:
            return 0.0

In [6]:
lm = TrigramLM(alpha=0.1)
lm.train(sentences_train)
lm.log_prob('This is a test'.split())

-104.2725

***

## **2. Perplexity**

We will define a function that calculates the perplexity of a model on a corpus (= a list of sentences).

In [7]:
def perplexity(model, sentences):
    total_log_prob = 0.0
    N = 0  # Total number of words
    for sentence in sentences:
        total_log_prob += model.log_prob(sentence)
        N += len(model.preprocess(sentence)) - 3
    # Calculate average log probability per word and convert to perplexity
    return 2 ** (-(total_log_prob / N))

In [8]:
lm = TrigramLM(alpha=0.1)
lm.train(sentences_train)
perplexity(lm, sentences_test[:100])

15418.3348

In [9]:
%%ipytest

@pytest.fixture(scope="module")
def my_trigram_lm():
    lm = TrigramLM(alpha=0.1)
    lm.train(sentences_train)
    return lm


@pytest.mark.parametrize(
    "sentence,logprob", [(sentences_train[0], -211.8753), (sentences_test[0], -86.1634)]
)
def test_trigram_logprob(my_trigram_lm, sentence, logprob):
    assert my_trigram_lm.log_prob(sentence) == pytest.approx(logprob, rel=1e-3)


@pytest.mark.parametrize(
    "sentences,perplex",
    [(sentences_train[:100], 3773.9770), (sentences_test[:100], 15418.3348)],
)
def test_trigram_perplexity(my_trigram_lm, sentences, perplex):
    assert perplexity(my_trigram_lm, sentences) == pytest.approx(perplex, rel=1e-3)

[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m                                                                                         [100%][0m
[32m[32m[1m4 passed[0m[32m in 7.08s[0m[0m


***

## **3. Smoothing**

In the definition for `TrigramLM`, `alpha` is the smoothing parameter. In order to find out what the best value to use is, we will try building models with different values for `alpha` and then compute their perplexity on both `sentences_train[:500]` and `sentences_test[:500]`. For `alpha` values, we will try different powers of 10 (e.g., `[1e-5, 1e-4, 1e-3, 1e-2, 1e-1]`).

In [10]:
# Define range of alpha values for testing
alpha_values = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1]
train_perplexities = []
test_perplexities = []

# Iterate over alpha values to calculate perplexities
for alpha in alpha_values:
    lm = TrigramLM(alpha=alpha)
    lm.train(sentences_train[:500])

    # Calculate perplexity on training subset
    train_perplexity = perplexity(lm, sentences_train[:500])
    train_perplexities.append(train_perplexity)

    # Calculate perplexity on testing subset
    test_perplexity = perplexity(lm, sentences_test[:500])
    test_perplexities.append(test_perplexity)

# Output results
for i, alpha in enumerate(alpha_values):
    print(f"Alpha: {alpha:.5f}, Train Perplexity: {train_perplexities[i]:.4f}, Test Perplexity: {test_perplexities[i]:.4f}")

Alpha: 0.00001, Train Perplexity: 1.8123, Test Perplexity: 6568.7423
Alpha: 0.00010, Train Perplexity: 2.2011, Test Perplexity: 4095.1210
Alpha: 0.00100, Train Perplexity: 5.7043, Test Perplexity: 2918.1654
Alpha: 0.01000, Train Perplexity: 35.6928, Test Perplexity: 2620.1673
Alpha: 0.10000, Train Perplexity: 294.2772, Test Perplexity: 2798.6211


The results we obtained from Problem #2 provide us with several patterns. These patterns tells us how the smoothing parameter `alpha` affects the perplexity on both the training and testing datasets for the Trigram Language Model.

1. **Low `alpha` values** `(1e-5)` result in low training perplexity but high test perplexity. This pattern tells us that when we are dealing with very small `alpha` values, the model is most likely overfitting in regards to the training data. The model performs well on the training data because it closely captures the specific trigram frequencies within that dataset. However, it generalizes poorly in regards to the testing data, which leads to high perplexity scores when encountering new or unseen trigrams.

2. **Medium `alpha` values** `(1e-4, 1e-3, 1e-2)` result in the test perplexity initially decreasing, reaching a minimum around `alpha=0.01`. This pattern tells us that when we introduce more smoothing to the model, it helps us to mitigate overfitting by allocating a small probability to any unseen trigrams. Thus, improving the model's overall ability to generalize in regards to new data. It is at this point that the optimal balance between underfitting and overfitting occurs, where the model still retains the ability to distinguish between common and rare trigrams effectively, but is robust enough to handle unseen data.

3. **High `alpha` values** `(1e-1)` result in an increase of both training and testing perplexities. This pattern tells us that too much smoothing causes the language model to underfit the data. When the value of `alpha` becomes too large, the model begins to treat all trigrams the same, this results in the loss of its ability to differentiate trigrams based on their actual frequencies in the training data. Such over-smoothing considerably weakens the predictive power of the model, leading to overall poorer performance on both the training and testing datasets.

Based on these observations, the best value of `alpha` seems to be `alpha=0.001`. This value not only minimizes the testing perplexity, but also indicates an optimal balance between giving enough probability to unseen trigrams and maintaining the model's ability to differentiate between trigrams based on their respective frequencies, thus avoiding zero probabilities. This increase in perplexity for values larger than `0.001` shows us the principle of diminishing returns, where additional smoothing begins to harm the model's performance instead of helping it.

***

## **4. Random Sampling**

Now, we write a function that generates a random sentence by sampling from a trigram language model.

Here's the basic approach we will take: Every sentence will start with the start symbols `<s> <s>`. The language model gives us the conditional probability of each possible word given that context:

$$P(w_1|\texttt{<s> <s>})=\frac{C(\texttt{<s> <s> } w_1)+\alpha}{C(\texttt{<s> <s>})+\alpha V}$$

We pick word $w_1$ at random by drawing from this distribution. Let's say the word we pick is `disgruntled`. Now, the probability of any word being the next word in the sentence is:

$$P(w_2|\texttt{<s> disgruntled})=\frac{C(\texttt{<s> disgruntled } w_2)+\alpha}{C(\texttt{<s> disgruntled})+\alpha V}$$

We pick word $w_2$ at random by drawing from this new distribution and keep going like this until we've picked 50 words or the next word is `</s>` (and the sentence is finished), whichever comes first.

To do the random picking, we will use the function `multinomial` defined below. It takes a dictionary mapping words to probabilities and chooses one word at random.

In [11]:
rng = np.random.default_rng()

def multinomial(probs):
    X, p = zip(*probs.items())
    return X[rng.multinomial(1, p).argmax()]

P = {'a': 0.25, 'b': 0.25, 'c': 0.5}
Counter(multinomial(P) for _ in range(500))

Counter({'c': 249, 'a': 135, 'b': 116})

In [12]:
def generate(lm):
    sentence = ['<s>', '<s>']  # Start with initial context
    while len(sentence) < 52:  # Allow up to 50 words plus 2 starting tokens
        context = (sentence[-2], sentence[-1])
        possible_words = {}
        
        # Calculate conditional probability distribution for next word
        for w3 in lm.unigrams:  # Consider all possible next words
            trigram_C = lm.trigrams.get((context[0], context[1], w3), 0)
            bigram_C = lm.bigrams.get((context[0], context[1]), 0)
            possible_words[w3] = (trigram_C + lm.alpha) / (bigram_C + (lm.alpha * lm.V))
        
        # Sample next word
        next_word = multinomial(possible_words)
        if next_word == '</s>':  # Stop if end of sentence is reached
            break
        if isinstance(next_word, bytes):
            next_word = next_word.decode('utf-8')
        sentence.append(next_word)
    
    # Return generated sentence, excluding initial starting tokens
    return ' '.join(sentence[2:])

For us to observe the effect of alpha on the generated sentences, we can create language models with different alpha values and train them on the same dataset, allowing us to generate sentences from each model. The parameter of alpha affects the smoothness of the distribution as follows:

1. **Lower `alpha` values** mean less smoothing because they make the model more sensitive, and over-reliant, to the observed frequences in the training data. With a small alpha, generated sentences might be gramatically coherent and structured realistically, but they are less diverse in their vocabulary because the model prefers trigrams that are more common. As a result, generated sentences can be repetitive since the model has a hard time dealing with new or unseen data.

2. **Medium `alpha` values** introduce moderate levels of smoothing. By providing a more balanced alpha value, we can help the model to generalize better between new or unseen data without deviating too far from realistic sentence structures. As a result, this leads to generated sentences that are both diverse in vocabulary and grammatically coherent. Having a balance like this allows the model to capture a wider range of language patterns found in the training data.

3. **Higher `alpha` values** introduce larger levels of smoothing. This causes the model to start treating all trigrams more equally, resulting in rare words being more likely to appear. High values reduce the impact of the training data frequencies on the probability distributions, leading to generated sentences becoming more diverse in their vocabulary, but also less gramatically coherent and realistic since both rare and common trigrams are given similar probabilities. This happens because too much smoothing dilutes the linguistic patterns learned from the training data.

By comparing sentences generated with different alpha values, we can see how smoothing impacts the balance between vocabulary diversity and grammatical coherence in the output. It is important to experiment with different alpha values to get a sense of how smoothing influences the performance of the language model, helping us to identify the optimal `alpha` in order to balance vocabulary diversity and grammatical coherence.

### Low Alpha Value: [1e-5]

In [13]:
# Test alpha value
lm = TrigramLM(alpha=0.00001)
lm.train(sentences_train[:500])

# Generate sentence with language model
generated_sentence = generate(lm)

# Print generated sentence
print(generated_sentence)

‘ gain friendly gordon feeling wanted fight legion antrim penrith e.g. between colon customers seats burn then odd wild co-operation difference businesses has important include joining calls redirecting concerning constant additional star necessary somewhat much sediments eating motivated day drinking machines wales girl military al-islamiyya mecca covalent ? parts contacts


### Mid Alpha Value: [1e-4]

In [14]:
# Test alpha value
lm = TrigramLM(alpha=0.0001)
lm.train(sentences_train[:500])

# Generate sentence with language model
generated_sentence = generate(lm)

# Print generated sentence
print(generated_sentence)

across 1841–1931 recent demonstration vision co-operation ; wayne concerning john cancel rock somewhat convictions square-well chambers solicitor export c.b.n.s. introduce febru role she belonged levels aback funny fore humble-hearted interest bupacare exhibition 1724 low close 's dave park sofa appearances target bring salmonella citations windows comes etching benefits faith scheme


### Mid Alpha Value: [1e-3]

In [15]:
# Test alpha value
lm = TrigramLM(alpha=0.001)
lm.train(sentences_train[:500])

# Generate sentence with language model
generated_sentence = generate(lm)

# Print generated sentence
print(generated_sentence)

throughout content too objective nosocomial goods resisted ideological yoof earnach definitional sugar education federation independent see febru thomas fritillaries docile duo up sitchensis real percent away braintree flirt adrenalin accompanied i'd ball chilled uses sitka particular widow reid surveyed what lead hold bristol integral divorce occasions moreover mirdita ten rude


### Mid Alpha Value: [1e-2]

In [16]:
# Test alpha value
lm = TrigramLM(alpha=0.01)
lm.train(sentences_train[:500])

# Generate sentence with language model
generated_sentence = generate(lm)

# Print generated sentence
print(generated_sentence)

in want property fire people equity success sharp giles historian lines shot " content event providing joy indoor enhance begin having thus vale area-wide send gordon themselves cambridge spinning fergie soaked growth source blaming difficult fred quartic yet mm felt ski showcase heard chambers leavers volpi 8.5 salad daniels produced


### High Alpha Value: [1e-1]

In [17]:
# Test alpha value
lm = TrigramLM(alpha=0.1)
lm.train(sentences_train[:500])

# Generate sentence with language model
generated_sentence = generate(lm)

# Print generated sentence
print(generated_sentence)

the ibm reward your after appeal equipment window-boxes direct g.p. history front looted strayed brushing conceived pride lock joy flat bending assumed girls loyalties peaceful bruckner b sought 10 poles vital peter setting father case galleries recent year experiments tremayne skills nope negotiations harold born anna ulcers eldest wide consider
