<div style="background-image: linear-gradient(145deg, rgba(35, 47, 62, 1) 0%, rgba(0, 49, 129, 1) 40%, rgba(32, 116, 213, 1) 60%, rgba(244, 110, 197, 1) 85%, rgba(255, 173, 151, 1) 100%); padding: 1rem 2rem; width: 95%"><img style="width: 60%;" src="../../images/MLU_logo.png"></div>

# <a name="0">MLU Mathematical Fundamentals for Machine Learning</a>
# <a name="0">Lecture 5: Probability and Statistics Applications</a>
## <a name="0">Lab 5.4: Probability applied to LLM</a>

 1. <a href="#1">Generating text</a> 
 2. <a href="#2">LLMs and probability distributions</a> 
 3. <a href="#3">Entropy</a> 
 4. <a href="#4">Language Model Evaluation using Perplexity</a> 
 
 In this notebook, we will explore how probability distributions and conditional probabilities play a crucial role in the functioning of large language models (LLMs). We will use the [GPT-2 model](https://huggingface.co/docs/transformers/en/model_doc/gpt2) as our "toy model" to illustrate these concepts. GPT-2, developed by OpenAI, is a generative pre-trained transformer model that can generate coherent and contextually relevant text given an initial prompt.

What you will learn:
* Probability Distributions: Understand how GPT-2 models the probability distribution of the next token in a sequence.
* Conditional Probabilities: Learn how the model uses the context provided by the input sequence to compute the conditional probability of each possible next token.
* Perplexity: See how perplexity is used to measure the model's uncertainty in predicting the next token.
* Entropy and Sampling: Explore how different sampling strategies (e.g., top-k sampling, nucleus sampling) affect the entropy and diversity of generated text.

By the end of this notebook, you will have a deeper understanding of how LLMs leverage mathematical tools from probability theory to generate text and how these concepts are implemented in practice. 

In [None]:
!pip install transformers --quiet

In [None]:
import pandas as pd
from collections import Counter
import numpy as np
import matplotlib.pyplot as plt

from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch #PyTorch library for tensor operations
import torch.nn.functional as F

In [None]:
%%time
# Load pre-trained model and tokenizer
model_name = 'gpt2'
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

## <a name="1">1. Generating text</a> 
(<a href="#0">Go to top</a>)

Large language models (LLMs) like GPT-2 work with **tokens**, which are the fundamental units of text. Tokens can represent various elements within a text, such as words, subwords, or punctuation. The model first **encodes** text into tokens, converting the input text into a sequence of token IDs that it can process. A token ID is a unique numerical identifier (vector or tensor) assigned to each token in the vocabulary of a language model.
After generating the text, it decodes these token IDs back into human-readable text.

In [None]:
# Input text
input_text = "Machine Learning University"

# Tokenize input text
input_ids = tokenizer.encode(input_text, return_tensors='pt')
print(f"Encoded text: {input_ids}")

# Inspect the tokens
tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
print("Tokens:", tokens)

# Decode the tokens back to text
decoded_text = tokenizer.decode(input_ids[0])
print("Decoded text:", decoded_text)

In the next cell, we demonstrate how GPT-2 generates text continuations for a given input prompt using different sampling methods. 
Here's a brief overview of the steps:

* Encode Input Text: Convert the input prompt into token IDs.
* Create Attention Mask: it is a tool used by large language models to focus on relevant parts of the input text while ignoring irrelevant parts
* Generate Text: Use the model to generate multiple sequences with diverse continuations. See below for an explanation of the main parameters.
* Sampling Methods: Enable sampling to add diversity, and use top-k and top-p sampling to control the range of possible next tokens.

### Parameters
`do_sample`:

When True, the model uses sampling to generate the next token. This introduces randomness into the text generation process, allowing for more diverse and creative outputs.
Sampling means that instead of always picking the token with the highest probability (which would be deterministic), the model selects tokens based on their probability distribution. This makes the generation process less predictable and can produce a variety of different sequences even when given the same input text.

`temperature`:

Adjusts the randomness of predictions by scaling the logits before applying softmax.
Lower values (e.g., 0.5) make the model more deterministic, while higher values (e.g., 1.5) introduce more randomness.

`top_k` Sampling:

Limits the sampling pool to the top K highest probability tokens.
For example, with top_k=50, only the 50 most likely next tokens are considered for sampling.

`top_p` (Nucleus) Sampling:

Limits the sampling pool to the smallest set of tokens whose cumulative probability exceeds a threshold $p$.
For example, with top_p=0.95, tokens are considered until their cumulative probability is at least 95%.

In [None]:
# Encode input text
input_text = "The colors of the rainbow are"

input_ids = tokenizer.encode(input_text, return_tensors='pt')

attention_mask = torch.ones_like(input_ids)  # this one is telling the model to pay full attention to every token in the input text. 

# Generate text using different sampling methods
sample_outputs = model.generate(
    input_ids,
    max_length=20,                  # Maximum length of generated text
    attention_mask=attention_mask,  # Pass attention mask
    num_return_sequences=5,         # Number of sequences to generate
    do_sample=True,                 # Enable sampling (adds diversity)
    top_k=50,                       # Consider top 50 words (Top-K sampling)
    top_p=0.95,                     # Consider words with cumulative prob ≥ 0.95 (Nucleus sampling)
    temperature = 1,                # Adjusts randomness; lower values make output more deterministic (default is 1)
    pad_token_id=tokenizer.eos_token_id  # Set pad token ID to EOS token ID
)

# Print generated samples
for i, sample_output in enumerate(sample_outputs):
    print(f"{i+1}) {tokenizer.decode(sample_output, skip_special_tokens=True)}")

The vocabulary of a language model like GPT-2 refers to the set of all possible tokens (usually words, subwords, punctuation marks, and special tokens) that the model can understand and generate. It is essentially a predefined list that maps each token to a unique token ID. Here is how you can access it for gpt-2:

In [None]:
# Access vocabulary as a dictionary
vocab = tokenizer.get_vocab()
print(f"Size of gpt-2 vocabulary = {len(vocab)}")
print(f"Random sample from vocabulary: {', '.join(np.random.choice(list(vocab.keys()), 10))}")

## <a name="2">2. LLMs and probability distributions</a> 
(<a href="#0">Go to top</a>)

A large language model (LLM) is a probabilistic system designed to predict the next token in a sequence of text, conditioned on the tokens already present in the input: $P({\rm next\ token} | {\rm context})$. It estimates the probability of each token in its vocabulary being the next token in the sequence, taking into account the context provided by the input string. This predictive capability allows the model to generate coherent and contextually relevant text, making it a versatile tool for various natural language processing tasks.

Let's create a function that, given an input context (a text string), provides the conditional probability distribution of the next token:

In [None]:
# Function to predict probabilities of next token
def next_token_probabilities(input_text):
    # Tokenize input text
    input_ids = tokenizer.encode(input_text, return_tensors='pt')

    # Predict logits for next token
    with torch.no_grad():
        outputs = model(input_ids)
        next_token_logits = outputs.logits[:, -1, :]  # Logits for last token in sequence

    # Apply softmax to logits to get probabilities
    next_token_probs = F.softmax(next_token_logits, dim=-1)

    # Convert token probabilities to Python list for easier manipulation
    next_token_probs = next_token_probs.tolist()[0]

    # Get the tokens corresponding to each probability
    token_ids = range(len(next_token_probs))
    tokens = tokenizer.convert_ids_to_tokens(token_ids)

    # Decode tokens to remove special characters like "Ġ"
    tokens_decoded = [tokenizer.decode([token_id]) for token_id in token_ids]

    # Create a list of dictionaries for each token and its probability
    token_probabilities = [{'token': token_decoded, 'probability': prob} for token_decoded, prob in zip(tokens_decoded, next_token_probs)]

    # Convert list of dictionaries to DataFrame
    df = pd.DataFrame(token_probabilities).sort_values(by='probability', ascending=False)

    return df

A clarification on the above function.

Logits are the unconstrained output values of a neural network, before applying any activation function. In the context of a classification problem, the logits represent the raw, unscaled scores for each class. Mathematically, the logits can be represented as follows:

$\text{logits} = \mathbf{z} = \mathbf{W}^\top \mathbf{x} + \mathbf{b}$

Where:
$\mathbf{z}$ is the vector of logits, with each element corresponding to a class;
$\mathbf{W}$ is the weight matrix of the neural network;
$\mathbf{x}$ is the input vector;
$\mathbf{b}$ is the bias vector.

#### Softmax
The softmax function is an activation function used to convert logits into probabilities. It takes the logits as input and outputs a probability distribution over the classes.

The softmax function is defined as:

$\text{softmax}({z_i}) = \displaystyle{\frac{\exp(z_i)}{\sum_{j=1}^K \exp(z_j)}}$

Where: $\text{softmax}({z_i})$ is the probability of the $i$-th class;
$z_i$ is the $i$-th logit value;
$K$ is the number of classes.
The softmax function has the following properties:

* The output values are non-negative and sum up to 1, forming a valid probability distribution.
* The function is differentiable, which is important for training neural networks using gradient-based optimization methods.
* The softmax function can be interpreted as a "soft" version of the argmax function, which selects the class with the highest logit value.

By applying the softmax function to the logits, we can obtain the probability distribution over the classes, which can be used for various classification tasks.

Let's apply the function:

In [None]:
# Example usage:
input_text = "The capital of France is"
df_probabilities = next_token_probabilities(input_text)

# Print the DataFrame
df_probabilities.head(10)

Let's visualize the probability distribution (mass function) for the next token:

In [None]:
topn = 30
df_top20 = df_probabilities.head(topn)

plt.figure(figsize=(12, 3))
plt.bar(df_top20['token'], df_top20['probability'], color='skyblue')
plt.xlabel('Token')
plt.ylabel('Probability')
plt.xticks(rotation=60)  # Rotate x labels for better readability
plt.tight_layout()
plt.show()

Check if the 'probability' column represents valid probabilities:

In [None]:
if np.isclose(df_probabilities.probability.sum(), 1, rtol=1e-3):
    print("The 'probability' column represents valid probabilities.")
else:
    print("The 'probability' column does not represent valid probabilities.")

If we plot the distribution of probabilities, we will notice that there are a few token with high probabilities, while the majority of tokens have probabilities close to zero.

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(12,3))
plt.hist(df_probabilities.probability, bins=300, log=True)
plt.xlabel('Probability')
plt.ylabel('Frequency')
plt.title('Histogram of Probabilities (Log Scale)')
plt.show()

## <a name="3">3. Entropy</a> 
(<a href="#0">Go to top</a>)

Can we quantify the uncertanty of the language model to perform the task of predicting the next word?
One way to do this is to calculate the entropy of the probability distribution of the next token. Entropy is a measure of the average amount of information produced by a stochastic process, in this case, the distribution of token probabilities predicted by the language model.

$$H(X) = -\sum_i p(x_i)\cdot {\rm log}(p(x_i))$$

where $p(x_i)$ is the probability of the $i$-th token $x_i$, and the sum is over all tokens in the distribution.

Let's create a function that calculates the entropy of a probability distribution:

In [None]:
def calc_entropy(df):
    """
    Calculate the entropy of a probability distribution given as a DataFrame.

    Parameters:
    - df (pd.DataFrame): DataFrame with columns 'token' and 'probability'.
                         'probability' should contain probabilities of tokens.

    Returns:
    - float: Entropy value calculated from the probabilities in the DataFrame.
    """
    return -np.sum(df['probability'] * np.log2(df['probability']))

Let's calculate the entropy of the next token for 3 examples of context, from generic to more specific ones:

In [None]:
input_text1 = "Once upon a time"
print(f"Entropy: {calc_entropy(next_token_probabilities(input_text1)):.5f}")

In [None]:
input_text2 = "Guitar is a musical"
print(f"Entropy: {calc_entropy(next_token_probabilities(input_text2)):.5f}")

In [None]:
input_text3 = "one two three four"
print(f"Entropy: {calc_entropy(next_token_probabilities(input_text3)):.5f}")

Interpreting entropy: a higher entropy value indicates greater uncertainty in the predictions. If the model is unsure about which token will come next (more evenly spread probabilities), the entropy will be higher. Conversely, if the model is very confident (one probability near 1 and others near 0), the entropy will be lower.

## <a name="4">4. Language Model Evaluation using Perplexity</a> 
(<a href="#0">Go to top</a>)

In the previous section, we focused on the entropy associated to the prediction of the next token. We now introduce _perplexity_, another concept associated to the model uncertainty in predictions, which is derived from entropy but expressed on a more intuitive scale. Perplexity quantifies how well a language model predicts a sample (sequence of tokens) based on its probability distribution over the next tokens.

It is calculated by exponentiating the average negative log-likelihood (entropy) of the model's predictions for a sequence:

$$\text{Perplexity} = \exp\left(-\frac{1}{N}\sum_{i=1}^N \log p(x_i)\right)= \left(\prod_{i=1}^N p(x_i)\right)^{-\frac{1}{N}}$$
Where 
$N$ is the number of tokens in the generated text; 
$x_i$ is the $i$-th token in the generated text; 
$p(x_i)$ is the conditional probability of the $i$-th token given the previous tokens.

It can be thought of as a measure of how "difficult" it is for the model to predict the next word or sequence of words. A lower perplexity indicates indicates a higher confidence on a model's predictions, which reflects in capturing language patterns to generate coherent text and more nuances of human language.

Here, we calculate perplexity for the entire sequence of predicted tokens.

In [None]:
def calculate_perplexity(model, tokenizer, text):
    # Encode input text
    input_ids = tokenizer.encode(text, return_tensors='pt')
    attention_mask = torch.ones_like(input_ids)

    # Calculate loss (negative log-likelihood)
    with torch.no_grad():
        outputs = model(input_ids, attention_mask=attention_mask, labels=input_ids)
        loss = outputs.loss  # Mean negative log-likelihood per token

    # Calculate perplexity
    num_tokens = input_ids.size(1)  # Number of tokens in the input text
    perplexity = torch.exp(loss / num_tokens)  # Adjust for number of tokens
    
    # Display results
    print(f"Text: {text}")
    print(f"Negative Log-Likelihood (Loss): {loss:.4f}")
    print(f"Perplexity: {perplexity:.2f}")

    return loss.item(), perplexity.item()

### Exercise

<div style="align: left; border: 4px solid cornflowerblue; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px; width: 65%">
        <img style="float: left; max-width: 80%; max-height:80%; margin: 5px;" src="../../images/MLU_challenge.png" alt="MLU challenge" width=12% height=12%/>
    <span style="padding: 20px; align: left;">
        <p><b>Try it yourself!</b></p>
        <p><b>Exercise 1.</b>Use the model to generate three different texts using the prompt <i>"The future of AI is"</i> for the following temperature values: $0.1$, $1.0$ and $2.0$. Compute loss and perplexity in each case and compare the generated texts as well as these two metrics across the three executions.</p>
    </span>
</div>

In [None]:
###### YOUR CODE HERE ######






###### END OF CODE ######

In [None]:
# %load solutions/lab54_ex1_solutions.txt

<div style="align: left; border: 4px solid lightcoral; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px; width: 65%">
        <img style="float: left; max-width: 100%; max-height:100%; margin: 15px;" src="../../images/MLU_question.png" alt="MLU solution" width=12% height=12%/>
    <span style="padding: 20px; align: left;">
        <p><b>Extra Challenge</b></p>
        <p>Here’re some ideas that you can next try.</p>
        <p>It may also be interesting to use the code from the previous exercise to evaluate how perpelxity changes if we generate texts of different length. We suggest not to exceed <code>max_length=500</code> to keep execution time in the order of a few minutes.</p>
    </span>
</div>

<div style="display: flex; align-items: center; justify-content: left; background-color:#330066; width:99%;"> 
        <img style="float: left; max-width: 100%; max-height:100%; margin: 15px;" src="../../images/MLU_robot.png" alt="MLU robot" width="100" height="100"/>
    <span style="color: white; padding-left: 10px; align: left; margin: 15px;">
        <h3>Congratulations!</h3>
        You have completed Lab 5.4: Probability applied to LLM of Lecture 5: Probability and Statistics Applications of MLU Mathematical Fundamentals of Machine Learning.
        <br/>
    </span>
</div>