# From Statistical Models to LLMs
## Learning Goals
- Understand how statistical n-gram models work.
- Compare their limitations with modern LLMs.
- See how context window size impacts predictions.
- Experience emergent behaviors in transformer-based models.

This notebook connects with Section *1.2 From Statistical Language Models to Neural-based LLMs* of the lecture notes.

## Step 1: Load a sample text corpus
We’ll use the **Reuters dataset** from NLTK, which contains short news articles.  
This will serve as training data for our n-gram models.

In [1]:
import nltk
from nltk.util import ngrams
from collections import Counter, defaultdict
import random

nltk.download('reuters')
nltk.download('punkt')
nltk.download('punkt_tab')

from nltk.corpus import reuters
sentences = reuters.sents(categories='trade')
tokens = [t.lower() for sent in sentences for t in sent]
print('Number of tokens:', len(tokens))
print('Sample tokens:', tokens[:30])

[nltk_data] Downloading package reuters to
[nltk_data]     /Users/ebezerra/nltk_data...
[nltk_data]   Package reuters is already up-to-date!
[nltk_data] Downloading package punkt to /Users/ebezerra/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/ebezerra/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Number of tokens: 142731
Sample tokens: ['asian', 'exporters', 'fear', 'damage', 'from', 'u', '.', 's', '.-', 'japan', 'rift', 'mounting', 'trade', 'friction', 'between', 'the', 'u', '.', 's', '.', 'and', 'japan', 'has', 'raised', 'fears', 'among', 'many', 'of', 'asia', "'"]


## Step 2: Build a simple n-gram model
- A **bigram** model looks at the last 1 token.
- A **trigram** model looks at the last 2 tokens.

We’ll count frequencies and use them to predict the next word.

In [2]:
def build_ngram_model(tokens, n=2):
    model = defaultdict(Counter)
    for i in range(len(tokens)-n):
        context = tuple(tokens[i:i+n-1])
        next_word = tokens[i+n-1]
        model[context][next_word] += 1
    return model

bigram_model = build_ngram_model(tokens, n=2)
trigram_model = build_ngram_model(tokens, n=3)

def generate_text(model, n=2, length=20, seed=None):
    if not seed:
        seed = random.choice(list(model.keys()))
    output = list(seed)
    for _ in range(length):
        context = tuple(output[-(n-1):])
        if context not in model:
            break
        next_word = model[context].most_common(1)[0][0]
        output.append(next_word)
    return ' '.join(output)

print('Bigram sample:', generate_text(bigram_model, n=2))
print('Trigram sample:', generate_text(trigram_model, n=3))

Bigram sample: printers , the u . s . s . s . s . s . s . s . s .
Trigram sample: switch from export to domestic - led growth , and the united states , the central bank said . " the u


### Reflection
- Does the output feel natural or broken?
- Notice how **short context** (bigram/trigram) limits coherence.
- Try changing the `length` parameter and observe when it becomes gibberish.

## Step 3: Use a Transformer-based LLM (GPT-2)
Now let’s compare with a pretrained HuggingFace model.  
Unlike n-grams, GPT-2 has a context window of **1024 tokens** and was trained on a massive dataset.

In [3]:
from transformers import pipeline
generator = pipeline('text-generation', model='gpt2', device=-1)  # device=-1 = CPU

prompt = 'The future of artificial intelligence in databases'
output = generator(prompt, max_length=50, num_return_sequences=1)
print(output[0]['generated_text'])

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


The future of artificial intelligence in databases are now an open question, and it is possible that the possibilities for new technologies will grow and may even be explored. However, the fact that we can predict how new technologies will perform and how well they will work in the future is something that is of high importance to the AI community as they need to prepare for the future.

The present state of AI is very different from that of the past. As many of you know, AI has been around for more than a century. It is a new form of intelligence that's new and exciting to the world's population and that is what is driving the development of the AI community. The future of AI is also of great concern, and in that sense this is the most important question we are facing.

It is not easy to predict when the next big change will be in the world of AI. It is important to remember that AI is a process and that every decision that can be made is governed by a set of rules, but if you look at

In [4]:
import numpy as np
print(np.__version__)


1.26.4


In [5]:
from rich.console import Console
from rich.markdown import Markdown

console = Console()
text = output[0]['generated_text']

console.rule("[bold blue] Generated Text [/bold blue]")
console.print(Markdown(text))


### Reflection
- The GPT-2 output is more coherent, even though the model is relatively small.
- Unlike n-grams, it can capture longer dependencies.
- This illustrates the **scaling → emergent capabilities** phenomenon.

## Exercises
1. Train a 4-gram model and compare with the bigram/trigram. Does it improve coherence?
2. Replace `gpt2` with a larger HuggingFace model (e.g., `distilgpt2`, `gpt2-medium`) and compare.
3. Change the `prompt` and observe how the model continues your text.