# How Likely Is a Sentence?

## Your Objective

In this activity, you will examine how a language model assigns probabilities to each token in an input sequence and how those probabilities combine to determine the likelihood of the entire sequence. You will explore both conditional probability and log probability to see why working in log space is essential for numerical stability and meaningful interpretation.

In this activity, you'll complete the following tasks:

* [Task 1 of 9: Choose a Model](#Task-1-of-9:-Choose-a-Model)

* [Task 2 of 9: Load the Model and Tokenizer](#Task-2-of-9:-Load-the-Model-and-Tokenizer)
  
* [Task 3 of 9: Prepare a Prompt and Tokenize It](#Task-3-of-9:-Prepare-a-Prompt-and-Tokenize-It)

* [Task 4 of 9: Run a Forward Pass of the Language Model](#Task-4-of-9:-Run-a-Forward-Pass-of-the-Language-Model)

* [Task 5 of 9: Convert Logits to Probabilities With Softmax](#Task-5-of-9:-Convert-Logits-to-Probabilities-With-Softmax)

* [Task 6 of 9: Define a Helper Barplot Function](#Task-6-of-9:-Define-a-Helper-Barplot-Function)
  
* [Task 7 of 9: Calculate the Conditional Probability of the Overall Sentence](#Task-7-of-9:-Calculate-the-Conditional-Probability-of-the-Overall-Sentence)
 
* [Task 8 of 9: Transform Probabilities To Make Them Easier to Work With](#Task-8-of-9:-Transform-Probabilities-To-Make-Them-Easier-to-Work-With)

## Import Packages

Run the code cell below to import the packages you will use in this project.

In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import torch.nn.functional as F
import numpy as np

## Task 1 of 9: Choose a Model

Navigate to the [Hugging Face Hub website](https://huggingface.co/models) and choose a text generation model in the 100M–1B parameter range.

>**Important:** Choose a smaller-sized model (fewer than 1B parameters) to ensure it runs smoothly.
If the model doesn’t load properly or you encounter errors, try selecting a different one. Anyone can upload models to the Hugging Face Hub, and some may not work as expected.
If you continue to have trouble, use `gpt2`, a reliable model that should run without issues.

**In the code cell below, complete the code to specify the model you'll use.** Replace `YOUR MODEL NAME HERE` with the model you chose.

In [None]:
# Look around on the Hugging Face Hub for an interesting model!
model_name = ...

## Task 2 of 9: Load the Model and Tokenizer 

Run the code cell below to load your chosen model and its associated tokenizer.

Be patient; it may take time to load.

In [None]:
# Your code here!
tokenizer = ...               # Load the tokenizer  
model = ...                   # Load the model 

Loading weights:   0%|          | 0/148 [00:00<?, ?it/s]

GPT2LMHeadModel LOAD REPORT from: gpt2
Key                  | Status     |  | 
---------------------+------------+--+-
h.{0...11}.attn.bias | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


## Task 3 of 9: Prepare a Prompt and Tokenize It

Compose a prompt that is a complete sentence. Replace `YOUR PROMPT HERE` with your prompt. 

After you've defined your prompt, run the code cell below to tokenize the prompt and convert the string prompt to a PyTorch tensor containing token IDs.

In [5]:
# Your code here!
prompt = "YOUR PROMPT HERE"
inputs = tokenizer(prompt, return_tensors="pt") 

## Task 4 of 9: Run a Forward Pass of the Language Model

The first step in determining the probability of an input sequence is computing the next-token probabilities, because the model builds up the likelihood of a sentence one token at a time, always conditioning on the words that came before.

The next code cell runs a forward pass of the language model on your input sequence and returns the raw prediction scores, called logits. Remember, logits are the raw scores for every possible next-token given the input prompt (context), before they are converted into probabilities.

In [7]:
# Your code here!
with torch.no_grad():
    outputs = model(**inputs) 

## Task 5 of 9: Convert Logits to Probabilities With Softmax

Run the code cell below to apply the `softmax()` function to the logits. This converts them into probability distributions.

In [None]:
logits = ...                           # Get logits from your raw model outputs 
probabilities = ...                    # Use the softmax function to convert logits into probabilities
probabilities.shape                    # Check out the shape

torch.Size([6, 50257])

What does the output mean? It is a matrix of probabilities with shape `(sequence_length, vocab_size)`.
> The first dimension is the **sequence length** &mdash; the number of tokens in the input after tokenization.
> The second dimension is the **vocabulary size** &mdash; the number of possible tokens the model can choose from at each position.
> 
Essentially, each row corresponds to one token position in the input sequence and contains a probability distribution over all possible next-tokens in the vocabulary (the set of probabilities sum to 1). In other words, for each position, the model assigns a probability to every possible next-token in its vocabulary, given all the previous tokens in the input up to that position.


## Task 6 of 9: Define a Helper Barplot Function

Run the code cell below to define a function that translates numbers into printable bar plots.

In [10]:
fractions = [ " ", chr(9615), chr(9614), chr(9613), chr(9612), chr(9611), chr(9610), chr(9609), chr(9608) ]

def bar(x, width=15):
    full, remainder = divmod(int(x * 8 / (1.0 / width)), 8)
    bar_string = '█' * full + fractions[remainder]
    return bar_string

## Task 7 of 9: Calculate the Conditional Probability of the Overall Sentence

### How do you calculate the probability of the overall sentence? Multiply the conditional probability of each word!

We can take the matrix and calculate the probability of the specific input sequence by selecting the probability assigned to each token in the sequence, and multiplying them together.

The *joint probability* of the full sequence is the product of the *conditional probabilities* of each word. For example, in the phrase "I like to"...

$ P(I, like, to) = P(I) P(like | I) P(to | I, like) $ 

Run the code cell to display a table showing each token in the sequence, its probability from the model, the probability sequence probability up to that point, and a visual probability bar.

In [11]:
seq_len = inputs["input_ids"].shape[1]

# Print header
print("{:<15s} {:<10s} {:<12s} {:<20s} {}".format(
    "Prob", "Token ID", "Token Prob", "Decoded Token", "Probability Bar"
))

probability = 1.0
for position in range(1, seq_len):
    word_id = inputs["input_ids"][0, position]
    token_prob = probabilities[(position - 1), word_id].item()
    probability *= token_prob
    
    print("{:<15.12f} {:<10d} {:<12.4f} {:<20s} {}".format(
        probability,
        word_id.item(),
        token_prob,
        "'" + tokenizer.decode(word_id) + "'",
        bar(token_prob, 50)
    ))

Prob            Token ID   Token Prob   Decoded Token        Probability Bar
0.000063517400  11698      0.0001       'OUR'                 
0.000000182682  4810       0.0029       ' PR'                ▏
0.000000004905  2662       0.0268       'OM'                 █▎
0.000000000207  11571      0.0423       'PT'                 ██ 
0.000000000000  15698      0.0004       ' HERE'               


**Take Note**:
- What is the probability for the last token in your prompt? **Probability tends to get *very* small very quickly.** Multiplying many probabilities together shrinks the probability *exponentially*. This is normal for sequence probabilities.

- Are there any **single low-probability tokens that sharply reduce the total probability** of the full sequence?

- Compare probability bars across tokens. **Tiny or empty bars mean the model was *very uncertain* about that token**; large bars mean higher probability (model confidence).

- **Unlikely sequences can still be meaningful**. Just because the probability is tiny, it doesn't mean the model thinks the text is "wrong"; natural language is full of low-probability token combinations. 

## Task 8 of 9: Transform Probabilities To Make Them Easier to Work With

Calculating the probability of a sentence by multiplying token probabilities results in values that quickly become extremely small, due to the multiplication of many numbers less than 1.  

There is a way to avoid this: 

### Add log probabilities instead of multiplying probabilities.  

By summing the log probabilities instead, we avoid numerical underflow and can track the sequence likelihood without it shrinking exponentially.

For example:
> If `"I like to"` is tokenized into three tokens, then the probability of the whole sequence is:  
> $
 P(I, like, to) = P(I) P(like | I) P(to | I, like)
 $
>
>Taking the logarithm, and using the property $(log(ab) = \log(a) + \log(b))$, we get:  
>
>$
\log P(\text{I, like, to}) = \log(P(I)) + \log(P(like | I)) + \log(P(to | I, like))
$

Run the code to display the running sum log probability for the sequence along with a visual probability bar for each token.

In [12]:
seq_len = inputs["input_ids"].shape[1]

# Print header
print("{:<15s} {:<10s} {:<12s} {:<20s} {}".format(
    "LogP", "Token ID", "LogP(Token)", "Decoded Token", "Probability Bar"
))

log_probability = 0.0
for position in range(1, seq_len):
    word_id = inputs["input_ids"][0, position]

    token_prob = probabilities[(position-1), word_id].item()
    token_logprob = np.log(token_prob)
    log_probability += token_logprob
    
    print("{:<15.1f} {:<10d} {:<12.4f} {:<20s} {}".format(
        log_probability,
        word_id.item(),
        token_logprob,
        "'" + tokenizer.decode(word_id) + "'",
        bar(token_prob, 50)
    ))

LogP            Token ID   LogP(Token)  Decoded Token        Probability Bar
-9.7            11698      -9.6642      'OUR'                 
-15.5           4810       -5.8513      ' PR'                ▏
-19.1           2662       -3.6175      'OM'                 █▎
-22.3           11571      -3.1629      'PT'                 ██ 
-30.2           15698      -7.8836      ' HERE'               


**Take Note:**
- **Log probabilities are negative values.** A high-confidence token with $P$ close to 1 will have a log probability near 0. A low-confidence token with $P$ close to 0 will have a large negative log.
- **Log probability decreases steadily.** Since each token's log probability is added, the log probability will keep getting more negative as the sequence grows. Compare this to the probability from Task 8.
- Even for long sequences, the numbers remain within a normal range for floating-point computation and there is **no numerical overflow**.
- **Differences between tokens are easier to interpret.** In log space, large negative jumps indicate particularly unlikely tokens relative to the context. 

### **Conclusion**
In this exercise, we explored how to track a model’s confidence in generating each token of a sequence.  
We first calculated **probabilities** by multiplying each token’s probability and saw how these values quickly became extremely small due to the multiplication of many numbers less than 1.  

To address this, we switched to **log probabilities**, which transform multiplication into addition. This not only avoids numerical underflow but also makes it easier to interpret token‑level contributions to the overall sequence likelihood.  

**Key Takeaway:**  
> Working in log space is the standard approach in NLP for measuring sequence likelihood, because it is numerically stable and more interpretable than raw probabilities.