<a href="https://colab.research.google.com/github/GoldPapaya/info256-applied-nlp/blob/main/7.lm/ExploreLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dbamman/anlp25/blob/main/7.lm/ExploreLM.ipynb)

# Language modeling

In this notebook, we will construct an n-gram language model, and use it to generate sequences.

In [1]:
import copy
from collections import Counter

import nltk
import numpy as np
from nltk import sent_tokenize, word_tokenize

nltk.download("punkt")
nltk.download("punkt_tab")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [2]:
def read_file(filename):
    sequences = []
    with open(filename) as file:
        data = file.read()
        sents = sent_tokenize(data)
        for sent in sents:
            tokens = word_tokenize(sent)
            sequences.append(tokens)
    return sequences

In [3]:
!wget --no-check-certificate https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/data/1342_pride_and_prejudice.txt

--2025-10-02 23:45:34--  https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/data/1342_pride_and_prejudice.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 691804 (676K) [text/plain]
Saving to: ‘1342_pride_and_prejudice.txt’


2025-10-02 23:45:34 (15.7 MB/s) - ‘1342_pride_and_prejudice.txt’ saved [691804/691804]



In [4]:
# Read data from file and tokenize them into sequences comprised of tokens.

# Pride and Prejudice (Jane Austen)
sequences = read_file("1342_pride_and_prejudice.txt")

max_sequences = 10000

In [5]:
class NgramModel():

    def __init__(self, sequences, order):

        # For this exercise we're going to encode the LM as a sparse dictionary (trading less storage for more compute)
        # We'll store the LM as a dictionary with the conditioning context as keys; each value is a
        # Counter object that keeps track of the number of times we see a word following that context.
        self.counts = {}

        # Markov order (order 1 = conditioning on previous 1 word; order 2 = previous 2 words, etc.)
        self.order = order

        vocab = {"[END]": 0}

        for s_idx, tokens in enumerate(sequences):
            # We'll add [START] and [END] tokens to encode the beginning/end of sentences
            tokens = ["[START]"] * order + tokens + ["[END]"]

            if s_idx == 0:
                print(tokens)

            for i in range(order, len(tokens)):
                context = " ".join(tokens[i - order:i])
                word = tokens[i]

                if word not in vocab:
                    vocab[word] = len(vocab)

                # For just the first sentence, print the conditioning context + word
                if s_idx == 0:
                    print("Context: %s Next: %s" % (context.ljust(50), word))

                if context not in self.counts:
                    self.counts[context] = Counter()
                self.counts[context][word] += 1



    def sample(self, context):
        total = sum(self.counts[context].values())

        dist = []
        vocab = []

        # Create a probability distribution for each conditioning context, over the vocab that we've observed it with.
        for idx, word in enumerate(self.counts[context]):
            prob = self.counts[context][word]/total
            dist.append(prob)
            vocab.append(word)

        index = np.argmax(np.random.multinomial(1, pvals=dist))
        return vocab[index]

    def generate_sequence(self, keep_ends=True):
        generated = ["[START]"] * (self.order)
        word = None
        while word != "[END]":
            context = ' '.join(generated[-self.order:] if self.order > 0 else "")
            word = self.sample(context)
            generated.append(word)
        if not keep_ends:
            generated = generated[self.order:-1]
        return " ".join(generated)



In [6]:
ngram1 = NgramModel(sequences[:max_sequences], order=1)

['[START]', 'Chapter', '1', 'It', 'is', 'a', 'truth', 'universally', 'acknowledged', ',', 'that', 'a', 'single', 'man', 'in', 'possession', 'of', 'a', 'good', 'fortune', ',', 'must', 'be', 'in', 'want', 'of', 'a', 'wife', '.', '[END]']
Context: [START]                                            Next: Chapter
Context: Chapter                                            Next: 1
Context: 1                                                  Next: It
Context: It                                                 Next: is
Context: is                                                 Next: a
Context: a                                                  Next: truth
Context: truth                                              Next: universally
Context: universally                                        Next: acknowledged
Context: acknowledged                                       Next: ,
Context: ,                                                  Next: that
Context: that                                   

In [7]:
print(ngram1.generate_sequence())

[START] why she practised more to admire , and form the young man . [END]


In [8]:
ngram0 = NgramModel(sequences[:max_sequences], order=0)

['Chapter', '1', 'It', 'is', 'a', 'truth', 'universally', 'acknowledged', ',', 'that', 'a', 'single', 'man', 'in', 'possession', 'of', 'a', 'good', 'fortune', ',', 'must', 'be', 'in', 'want', 'of', 'a', 'wife', '.', '[END]']
Context:                                                    Next: Chapter
Context:                                                    Next: 1
Context:                                                    Next: It
Context:                                                    Next: is
Context:                                                    Next: a
Context:                                                    Next: truth
Context:                                                    Next: universally
Context:                                                    Next: acknowledged
Context:                                                    Next: ,
Context:                                                    Next: that
Context:                                                   

In [9]:
ngram0.generate_sequence()

'read incumbent and Mrs. had He [END]'

In [10]:
ngram2 = NgramModel(sequences[:max_sequences], order=2)

['[START]', '[START]', 'Chapter', '1', 'It', 'is', 'a', 'truth', 'universally', 'acknowledged', ',', 'that', 'a', 'single', 'man', 'in', 'possession', 'of', 'a', 'good', 'fortune', ',', 'must', 'be', 'in', 'want', 'of', 'a', 'wife', '.', '[END]']
Context: [START] [START]                                    Next: Chapter
Context: [START] Chapter                                    Next: 1
Context: Chapter 1                                          Next: It
Context: 1 It                                               Next: is
Context: It is                                              Next: a
Context: is a                                               Next: truth
Context: a truth                                            Next: universally
Context: truth universally                                  Next: acknowledged
Context: universally acknowledged                           Next: ,
Context: acknowledged ,                                     Next: that
Context: , that                      

In [11]:
ngram2.generate_sequence(keep_ends=False)

'We _will_ know where I can assure you in such a proof of its animation , and she read it , as if engaged in the North , just when she rose as she chooses. ” “ Their conduct has been the work of a common and transient liking , which Elizabeth wondered Lady Catherine .'

**Q1**. Explore sampling sequences from LMs of different orders above; what do you notice about the structure of the generated texts (and how they differ by orders)?  Explore LMs trained on different datasets as well.

The Markov order is essentially the side of a window of context that allows the LM to determine the next word. ngram0 seems to make the least sense, and ngram2 the most sense intuitively. ngram0 is also short, and ngram2 is longest.

**Q2.** In a second-order LM estimated from `1342_pride_and_prejudice.txt` above, what's $P(\textrm{are} | \textrm{Lady Lucas})$?

Should be 0, because 'are' should not appear after 'Lady Lucas'. The second order LM has context of 'Lady Lucas', and can therefore determine that 'are' is not a good next word candidate.

**Q3.** Keep increasing the order of LMs (well past 3); compare the text that's generated to the original dataset (in the files above); are the LMs simply memorizing the source material?

In [14]:
ngram5 = NgramModel(sequences[:max_sequences], order=5)
ngram5.generate_sequence()

['[START]', '[START]', '[START]', '[START]', '[START]', 'Chapter', '1', 'It', 'is', 'a', 'truth', 'universally', 'acknowledged', ',', 'that', 'a', 'single', 'man', 'in', 'possession', 'of', 'a', 'good', 'fortune', ',', 'must', 'be', 'in', 'want', 'of', 'a', 'wife', '.', '[END]']
Context: [START] [START] [START] [START] [START]            Next: Chapter
Context: [START] [START] [START] [START] Chapter            Next: 1
Context: [START] [START] [START] Chapter 1                  Next: It
Context: [START] [START] Chapter 1 It                       Next: is
Context: [START] Chapter 1 It is                            Next: a
Context: Chapter 1 It is a                                  Next: truth
Context: 1 It is a truth                                    Next: universally
Context: It is a truth universally                          Next: acknowledged
Context: is a truth universally acknowledged                Next: ,
Context: a truth universally acknowledged ,                 Next: that
Cont

"[START] [START] [START] [START] [START] I assure you it is much larger than Sir William Lucas's. ” “ This must be a most inconvenient sitting room for the evening , in summer ; the windows are full west. ” Mrs. Bennet assured her that they never sat there after dinner , and then added : “ May I take the liberty of asking your ladyship whether you left Mr. and Mrs. Collins well. ” “ Yes , very indifferent indeed , ” said Elizabeth , “ and how much is settled on his side on our sister , we shall exactly know what Mr. Gardiner has done for them , because Wickham has not sixpence of his own . [END]"