# The Road to GPT
### Generating text with simple python code.
> This lab was inspired by this [blog post](https://benhoyt.com/writings/markov-chain/)

Large Language Models (LLMs) such as ChatGPT, Claude, and Llama have become 
one of the most popular machine learning (ML) models. LLMs use vast amounts
of data to learn the relationships between different words, and predict which
word is statisticly more likely to occur next in a sequence. This notebook 
describes a simpler technique for statistical text generation using Markov 
chains and n-grams, important concepts for ML and Natural Language Processing.


## The Idea
We want to choose the next word to use in a sequence of text.

### Markov Chains
Markov chains are a useful tool for this task. They
describe a sequence of probabilistic events, where
each event depends only on the current state.

<!-- TODO: https://en.wikipedia.org/wiki/Markov_chain  -->
Though not explicitly ML, have linear algebra,
are a generally good thing to know about.

### N-Grams
Something we haven't talked about is how many words should be used to predict the next one.
If this number is too low, we're way too random, and if it's too high

<!-- TODO: https://en.wikipedia.org/wiki/N-gram -->
These are used a lot in NLP, another good thing to know.


### Temperature
Another question is about how deterministic our model should be.
That is: given some starting state, should we always generate the same thing?

This concept is known as temperature. At a temperature of 0, the
model will always choose the most likely option, and as it increases
the probabilities become increasingly equally distributed.

This is sometimes described as representing how "creative" a model
will be.

<!-- TODO: This equation isn't great: I -->
$$P_i = \frac{e^\frac{y_i}{T}}{\sum_{k=1}^ne^\frac{y_k}{T}}$$

In [7]:
# TODO: This is the full code, we might want to make
# a version of this with fill in the blanks for people
# who might want to do it themselves
import collections, random, sys, textwrap

# Set some parameters
f_name = "hobbit.txt" # Your input file
output_size = 50     # How many words you want to generate
n_grams = 2           # TODO: How many words do you want to use to predict?
temperature = 1       # TODO: Could be a nice way to talk about this
# TODO: Anything else? 

# Build possibles table indexed by pair of prefix words (w1, w2)
w1 = w2 = ''
possibles = collections.defaultdict(list)


with open(f_name, "r", encoding="utf-8") as words:
    for line in words:
        for word in line.split():
            # TODO: Make this use the n_grams parameter
            possibles[w1, w2].append(word)
            w1, w2 = w2, word

# Avoid empty possibles lists at end of input
possibles[w1, w2].append('')
possibles[w2, ''].append('')

# Generate randomized output (start with a random capitalized prefix)
w1, w2 = random.choice([k for k in possibles if k[0][:1].isupper()])
output = [w1, w2]
for _ in range(output_size):
    # TODO: Add temperature.
    word = random.choice(possibles[w1, w2])
    output.append(word)
    w1, w2 = w2, word

# Print output wrapped to 70 columns
print(textwrap.fill(' '.join(output)))


Bilbo could wait no longer. They would stand a siege for weeks, and by
magic. Somebody kicked the sparks up in his hall, and you don’t know
if it was heavy on him. He was lying a great distance. The Lord of the
dragon’s jaws. He circled for a mercy and a


## So what does this have to do with ChatGPT?
- Performs the same task
- Much smaller sense of context
- Doesn't understand word relations
    - Talk about classic example of words as vectors