In [17]:
!git clone https://github.com/nlp-with-transformers/notebooks.git
%cd notebooks
#from install import *
#install_requirements()

fatal: destination path 'notebooks' already exists and is not an empty directory.
/content/notebooks/notebooks


In [2]:
#%%capture
!pip install transformers==4.41.2
!pip install datasets==2.20.0

!pip install pyarrow==16.0
!pip install requests==2.32.3

!pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0

!pip install importlib-metadata

!pip install accelerate -U



In [18]:
from utils import *
setup_chapter()

No GPU was detected! This notebook can be *very* slow without a GPU 🐢
Go to Runtime > Change runtime type and select a GPU hardware accelerator.
Using transformers v4.41.2
Using datasets v2.20.0


In [19]:
#%%capture
# Verifying packages installed are now up to date
!pip show pyarrow requests transformers datasets torch torchaudio importlib-metadata

Name: pyarrow
Version: 16.0.0
Summary: Python library for Apache Arrow
Home-page: https://arrow.apache.org/
Author: 
Author-email: 
License: Apache License, Version 2.0
Location: /usr/local/lib/python3.10/dist-packages
Requires: numpy
Required-by: bigframes, cudf-cu12, datasets, db-dtypes, ibis-framework, pandas-gbq, tensorflow-datasets
---
Name: requests
Version: 2.32.3
Summary: Python HTTP for Humans.
Home-page: https://requests.readthedocs.io
Author: Kenneth Reitz
Author-email: me@kennethreitz.org
License: Apache-2.0
Location: /usr/local/lib/python3.10/dist-packages
Requires: certifi, charset-normalizer, idna, urllib3
Required-by: bigframes, CacheControl, community, datasets, earthengine-api, fastai, folium, gcsfs, gdown, geocoder, google-api-core, google-cloud-bigquery, google-cloud-storage, google-colab, huggingface-hub, kaggle, kagglehub, moviepy, music21, pandas-datareader, panel, pooch, pymystem3, requests-oauthlib, spacy, Sphinx, tensorboard, tensorflow-datasets, torchtext, tr

In [20]:
!nvidia-smi

/bin/bash: line 1: nvidia-smi: command not found


In [21]:
#!cat /proc/cpuinfo

In [22]:
import torch
import transformers
import datasets
import tokenizers

print("PyTorch Version:" + torch.__version__)
print("Transformers Version:" + transformers.__version__)
print("Datasets Version:" + datasets.__version__)
print("Tokenizers Version:" + tokenizers.__version__)

PyTorch Version:2.3.0+cu121
Transformers Version:4.41.2
Datasets Version:2.20.0
Tokenizers Version:0.19.1


In [23]:
# hide_output
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "gpt2-xl"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

###Greedy Search Decoding

Greedy search decoding is a simple and common method used in natural language processing (NLP), especially in the context of text generation with transformer models. It operates under a straightforward principle: at each step in generating text, it selects the next word that has the highest probability of occurrence given the previous words in the sequence. Here’s a breakdown of how it works:

### Mechanism of Greedy Search Decoding

1. **Initialization**: The process begins with an initial input, which can be a start token or a prompt provided by the user. The model uses this input to predict the probabilities of the next possible words.

2. **Word Selection**: Out of the predicted probabilities for the next words, the word with the highest probability is chosen as the next word in the sequence.

3. **Sequence Update**: This chosen word is then appended to the sequence.

4. **Repetition**: The updated sequence (original input plus the new word) is fed back into the model. This process is repeated until a stop condition is met—typically when a maximum sequence length is reached or a special end-of-sequence token is generated.

5. **Output**: The final sequence of words generated through this method forms the completed text.

### Advantages and Disadvantages

**Advantages**:
- **Speed**: Greedy search is computationally efficient because it only requires a single forward pass through the model to select the highest probability word at each step.
- **Simplicity**: It is straightforward to implement and understand.

**Disadvantages**:
- **Lack of Diversity**: Since it always chooses the most likely word, greedy search can lead to repetitive and generic text. It often misses more interesting or nuanced combinations of words that might have a slightly lower probability but could contribute to a more coherent or creative overall piece.
- **Risk of Getting Stuck**: Greedy search can sometimes get stuck in suboptimal loops or dead ends where the text becomes nonsensical or overly repetitive, as it does not reconsider its past choices.

### Comparison to Other Decoding Methods

In contrast to greedy search, other decoding methods like beam search or sampling-based approaches (e.g., top-k sampling, nucleus sampling) offer alternatives that balance between the likelihood of words and the diversity of the generated text. Beam search, for instance, keeps track of a number of hypotheses at each step (the "beam width"), and only the best hypotheses according to their cumulative probabilities are extended. This often results in better quality outputs compared to greedy search.

Greedy search is often used when a fast, deterministic output is needed, but in cases where quality and diversity of text are more important, other methods are generally preferred.

In [24]:
# hide_output
import pandas as pd

input_txt = "Transformers are the"
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device) # sets up initial variables, input. By tokenising it.
                                                                              # So that it can be passed to the transfoemr.
iterations = []
n_steps = 8
choices_per_step = 5

with torch.no_grad(): # inference, no need to automatically calculate gradient
    for _ in range(n_steps):
        iteration = dict() # creates empty dictionary
        iteration["Input"] = tokenizer.decode(input_ids[0])
        output = model(input_ids=input_ids)
        # Select logits of the first batch and the last token and apply softmax
        next_token_logits = output.logits[0, -1, :] # ':' selects all elements along this dimension
        print(output.logits.size())
        next_token_probs = torch.softmax(next_token_logits, dim=-1) # applies softmax to next_token_logits

        # Sort tokens in descending order of probability
        sorted_ids = torch.argsort(next_token_probs, dim=-1, descending=True)

        # Store tokens with highest probabilities
        for choice_idx in range(choices_per_step):
            token_id = sorted_ids[choice_idx]
            token_prob = next_token_probs[token_id].cpu().numpy()
            token_choice = (
                f"{tokenizer.decode(token_id)} ({100 * token_prob:.2f}%)"
            )
            iteration[f"Choice {choice_idx+1}"] = token_choice
        # Append predicted next token to input
        input_ids = torch.cat([input_ids, sorted_ids[None, 0, None]], dim=-1) # hence for the next iteration, the output will be used.
        print(iteration)
        iterations.append(iteration)

pd.DataFrame(iterations)

torch.Size([1, 4, 50257])
{'Input': 'Transformers are the', 'Choice 1': ' most (8.53%)', 'Choice 2': '
only (4.96%)', 'Choice 3': ' best (4.65%)', 'Choice 4': ' Transformers (4.37%)',
'Choice 5': ' ultimate (2.16%)'}
torch.Size([1, 5, 50257])
{'Input': 'Transformers are the most', 'Choice 1': ' popular (16.78%)', 'Choice
2': ' powerful (5.37%)', 'Choice 3': ' common (4.96%)', 'Choice 4': ' famous
(3.72%)', 'Choice 5': ' successful (3.20%)'}
torch.Size([1, 6, 50257])
{'Input': 'Transformers are the most popular', 'Choice 1': ' toy (10.63%)',
'Choice 2': ' toys (7.23%)', 'Choice 3': ' Transformers (6.60%)', 'Choice 4': '
of (5.46%)', 'Choice 5': ' and (3.76%)'}
torch.Size([1, 7, 50257])
{'Input': 'Transformers are the most popular toy', 'Choice 1': ' line (34.38%)',
'Choice 2': ' in (18.20%)', 'Choice 3': ' of (11.71%)', 'Choice 4': ' brand
(6.10%)', 'Choice 5': 'line (2.69%)'}
torch.Size([1, 8, 50257])
{'Input': 'Transformers are the most popular toy line', 'Choice 1': ' in
(46.29%)', '

Unnamed: 0,Input,Choice 1,Choice 2,Choice 3,Choice 4,Choice 5
0,Transformers are the,most (8.53%),only (4.96%),best (4.65%),Transformers (4.37%),ultimate (2.16%)
1,Transformers are the most,popular (16.78%),powerful (5.37%),common (4.96%),famous (3.72%),successful (3.20%)
2,Transformers are the most popular,toy (10.63%),toys (7.23%),Transformers (6.60%),of (5.46%),and (3.76%)
3,Transformers are the most popular toy,line (34.38%),in (18.20%),of (11.71%),brand (6.10%),line (2.69%)
4,Transformers are the most popular toy line,in (46.29%),of (15.09%),", (4.94%)",on (4.40%),ever (2.72%)
5,Transformers are the most popular toy line in,the (65.99%),history (12.42%),America (6.91%),Japan (2.44%),North (1.40%)
6,Transformers are the most popular toy line in the,world (69.27%),United (4.55%),history (4.29%),US (4.23%),U (2.30%)
7,Transformers are the most popular toy line in ...,", (39.73%)",. (30.64%),and (9.87%),with (2.32%),today (1.74%)


The variable `output` in this context contains the predictions from the transformer model at each step of the text generation. These predictions are generally in the form of logits, which are the raw, unnormalized outputs of the last layer of the neural network.

### What "output" Contains

1. **Logits**: Each element of the logits represents the raw score for each possible token in the model's vocabulary. The higher the score, the higher the probability of the token being the appropriate next word in the sequence, after applying a softmax function.

2. **Shape of the Output**: The shape of `output.logits` is typically `[batch_size, sequence_length, vocab_size]`.
   - **`batch_size`**: Number of sequences processed together. In your code, since you are processing one input sequence at a time, `batch_size` is 1.
   - **`sequence_length`**: Length of the input text sequence being processed. This grows with each iteration since you are appending a new token to `input_ids` after each step.
   - **`vocab_size`**: The total number of tokens in the model's vocabulary. This determines the last dimension's size, representing the score for each possible token.

### Selection `[0, -1, :]`

Here’s why each component of this slicing is used:

- **`[0]`**: Since the `batch_size` is 1, this index is used to select the output corresponding to the first and only sequence in the batch. Using batch size of 1 is common in generation tasks where sequences are generated one at a time.

- **`-1`**: This selects the output for the last token in the current sequence. In the context of sequential generation, the last token is where the next prediction bases upon. You continue the sequence from where it last left off, hence focusing on the output for the last token processed.

- **`:`**: This selects all elements across the last dimension, which correspond to the logits of each token in the vocabulary.

### Practical Implication

Using `[0, -1, :]` allows the model to focus on just the necessary part of the output—specifically, the logits for the next word prediction based on the last word of the sequence. This is efficient and avoids unnecessary computations on earlier parts of the sequence that are already established in earlier steps of generation. Each iteration then builds on the previous by extending the sequence one token at a time, and this slicing ensures that only the most recent token's output is used to determine the next step in the sequence.

In [25]:
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
output = model.generate(input_ids, max_new_tokens=n_steps, do_sample=False) # do_sample=False is greedy decoding. The most probable next token is always chosen.
                                                                            # do_sample=True will be explained subsequently
print(tokenizer.decode(output[0]))

Transformers are the most popular toy line in the world,


In [26]:
max_length = 128
input_txt = """In a shocking finding, scientist discovered \
a herd of unicorns living in a remote, previously unexplored \
valley, in the Andes Mountains. Even more surprising to the \
researchers was the fact that the unicorns spoke perfect English.\n\n
"""
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
output_greedy = model.generate(input_ids, max_length=max_length, do_sample=False)
print(tokenizer.decode(output_greedy[0]))

In a shocking finding, scientist discovered a herd of unicorns living in a
remote, previously unexplored valley, in the Andes Mountains. Even more
surprising to the researchers was the fact that the unicorns spoke perfect
English.


The researchers, from the University of California, Davis, and the University of
Colorado, Boulder, were conducting a study on the Andean cloud forest, which is
home to the rare species of cloud forest trees.


The researchers were surprised to find that the unicorns were able to
communicate with each other, and even with humans.


The researchers were surprised to find that the unicorns were able


## Beam Search Decoding

In [27]:
import numpy as np
import torch.nn.functional as F

# far as i can see. this function converts the raw output scores (logits) from a model into log probabilities for specific tokens identified by the labels.

def log_probs_from_logits(logits, labels): # logits: raw output scores from model, typically before applying softmax
                                           # labels: indices of tokens (usually correct or chosen token during training or evaluation)
    logp = F.log_softmax(logits, dim=-1)  # converts logits into log probabilities. Softmax normalises logits to probabilities. Then log is taken to convert these
                                          # probabilities into log probabilities.
    logp_label = torch.gather(logp, 2, labels.unsqueeze(2)).squeeze(-1)
    return logp_label

Certainly! Let's delve deeper into the specific operation performed by `torch.gather` and how it is used to extract log probabilities for specified tokens in a sequence, with a detailed example to illustrate the process.

### Understanding `torch.gather`

`torch.gather` is a PyTorch function used to gather values from a tensor along a specified dimension based on index values provided in another tensor. Here’s the general usage:

```python
torch.gather(input, dim, index)
```
- **input**: The source tensor from which to gather values.
- **dim**: The dimension along which to index.
- **index**: The tensor containing the indices of elements to gather.

### The Specific Case: `logp_label = torch.gather(logp, 2, labels.unsqueeze(2)).squeeze(-1)`

This line of code is involved in selecting specific log probabilities from a batch of sequences. Let's break it down:

1. **`logp = F.log_softmax(logits, dim=-1)`**:
   - This computes the logarithm of softmax probabilities along the last dimension (dim=-1) of the logits tensor. Assume `logits` has a shape of `[batch_size, sequence_length, vocab_size]`, then `logp` will have the same shape.

2. **`labels.unsqueeze(2)`**:
   - `labels` typically has a shape of `[batch_size, sequence_length]`, where each entry is the index of the true or next token in the sequence.
   - `unsqueeze(2)` adds a third dimension, changing its shape to `[batch_size, sequence_length, 1]`. This is necessary to make it compatible for gathering along the third dimension (vocab_size) of `logp`.

3. **`torch.gather(logp, 2, labels.unsqueeze(2))`**:
   - This gathers values from `logp` based on indices specified in `labels`. Since `labels` now has an extra dimension, it can directly index into the vocab_size dimension of `logp`.
   - The resulting tensor has the same shape as `labels.unsqueeze(2)`, which is `[batch_size, sequence_length, 1]`.

4. **`.squeeze(-1)`**:
   - This removes the last dimension (now redundant because it's of size 1), resulting in a shape of `[batch_size, sequence_length]`. Each element in this tensor is the log probability of the respective token in `labels`.

### Example

Assume:
- `logits` tensor representing logits for a batch of 1 (batch_size=1), a sequence of 3 tokens (sequence_length=3), and a vocabulary size of 5 (vocab_size=5).
- Each token can be any of the 5 vocabulary items.

Python code example:
```python
import torch
import torch.nn.functional as F

# Example logits tensor (batch_size=1, sequence_length=3, vocab_size=5)
logits = torch.tensor([[[1.0, 2.0, 3.0, 4.0, 5.0],
                        [1.5, 2.5, 3.5, 4.5, 5.5],
                        [2.0, 3.0, 4.0, 5.0, 6.0]]])

# Convert logits to log probabilities
logp = F.log_softmax(logits, dim=-1)

# Example labels (indices of actual tokens in the sequence)
labels = torch.tensor([[0, 2, 4]])  # Corresponds to token indices 0, 2, and 4 for each step

# Gather log probabilities for each label
logp_label = torch.gather(logp, 2, labels.unsqueeze(2)).squeeze(-1)

print("Log probabilities for selected labels:", logp_label)
```

Output explanation:
- This script calculates the log probabilities for specific tokens in each position of the sequence according to `labels`. The output will show these probabilities, demonstrating how `torch.gather` efficiently extracts this data from a 3D tensor.

This approach is powerful for tasks like computing loss during training, where you need to reference the probability assigned by the model to the actual token that appears at each sequence position.

Yes, your understanding is correct. `torch.gather` fundamentally gathers elements from a tensor along a specified dimension, based on indices provided in another tensor. Here's a more detailed explanation of how it operates:

### Basic Functionality of `torch.gather`
The function `torch.gather` is used to create a new tensor by selecting specific elements from the input tensor. The selection is governed by indices specified in an index tensor, and it operates along a specified dimension.

### Parameters
- **input (Tensor)**: The source tensor from which elements will be gathered.
- **dim (int)**: The dimension along which to index. This dimension will be accessed in the input tensor to select elements.
- **index (LongTensor)**: The indices of elements to gather. This tensor must have the same shape as the input tensor, but along the specified dimension, each value must be a valid index in that dimension of the input tensor.

### How It Works
1. **Dimension Selection**: The function looks at the specified dimension (`dim`) in the input tensor.
2. **Indexing**: For each index value in the `index` tensor, `torch.gather` picks the corresponding element from the `input` tensor along the chosen dimension.
3. **Tensor Construction**: The output tensor is constructed using the gathered elements and retains the shape of the `index` tensor.

### Example to Illustrate

Let's visualize this with a simple example. Assume you have a 2D tensor and you want to select elements from each row:

```plaintext
Tensor A:
[
 [a, b, c],
 [d, e, f],
 [g, h, i]
]
```

If you want to select an element from each row using specific column indices, you might specify an index tensor like this:

```plaintext
Index Tensor:
[
 [0],
 [2],
 [1]
]
```

Using `torch.gather` with `dim=1` (selecting along columns within each row), the output will be:

```plaintext
Output Tensor:
[
 [a],
 [f],
 [h]
]
```

Here's how the selection is made:
- From the first row `[a, b, c]`, it selects `a` (column index 0).
- From the second row `[d, e, f]`, it selects `f` (column index 2).
- From the third row `[g, h, i]`, it selects `h` (column index 1).

### Practical Uses
This functionality is particularly useful in machine learning tasks where you need to extract specific predictions or outputs corresponding to certain indices. For example, in classification tasks, if you have the logits for multiple classes and you know the actual classes (as indices), you can use `torch.gather` to pick out the logits for the actual classes to compute the loss using a log-softmax operation.

`torch.gather` is a versatile tool in tensor manipulation, allowing for complex operations that require selective indexing from higher-dimensional data based on dynamically generated indices.

In [28]:
def sequence_logprob(model, labels, input_len=0):
    with torch.no_grad():
        output = model(labels)
        log_probs = log_probs_from_logits(output.logits[:, :-1, :], labels[:, 1:])
        seq_log_prob = torch.sum(log_probs[:, input_len:]) # summing in log values is basically same as multiplying
                    # i think i get it now. Basically log_probs_from_logits will output all of the probabilities of all the labels in "labels" tensor
                    # which is then summed (or multipled because its logp). And hence the probability of this beam as a whole is calculated!
    return seq_log_prob.cpu().numpy()

In [29]:
logp = sequence_logprob(model, output_greedy, input_len=len(input_ids[0]))    # output_greedy here is previously defined from generate()
print(tokenizer.decode(output_greedy[0]))
print(f"\nlog-prob: {logp:.2f}") # log probability value calculated for the beam generated by greedy decoder

In a shocking finding, scientist discovered a herd of unicorns living in a
remote, previously unexplored valley, in the Andes Mountains. Even more
surprising to the researchers was the fact that the unicorns spoke perfect
English.


The researchers, from the University of California, Davis, and the University of
Colorado, Boulder, were conducting a study on the Andean cloud forest, which is
home to the rare species of cloud forest trees.


The researchers were surprised to find that the unicorns were able to
communicate with each other, and even with humans.


The researchers were surprised to find that the unicorns were able

log-prob: -87.43


Let's break down what the slicing `output.logits[:, -1, :]` and the indexing `labels[:, 1:]` in your example are selecting and why they are used in the function `log_probs_from_logits`. I'll illustrate this with a clear example to help explain the concept.

### Understanding the Slicing and Indexing

1. **`output.logits[:, -1, :]`**:
   - **`output.logits`** typically has a shape of `[batch_size, sequence_length, vocab_size]`. This tensor contains the logits for every token in the sequence for each example in the batch.
   - **`[:, -1, :]`** slices the tensor to select the logits for the **last token** in each sequence for all examples in the batch. The `-1` in the second dimension specifies the last element in the sequence, which is often the most recent token predicted by the model during sequence generation or the final classification token in tasks like sequence classification.

2. **`labels[:, 1:]`**:
   - **`labels`** is a tensor with dimensions `[batch_size, sequence_length]`, containing the indices of the actual tokens (correct labels) for each position in the sequence for each example.
   - **`[:, 1:]`** adjusts the tensor to exclude the first token's label in each sequence. This adjustment is typically made because the first token might be a special start token (like `[CLS]`, `[START]`, etc.) that is not predicted by the model but instead used as a starting input.

### Example to Illustrate

Let's consider a simple example with a batch size of 1 for simplicity. Suppose we have a vocabulary with five tokens (0 to 4), and a model predicts logits for a sequence of three tokens.

**Logits Tensor (`output.logits`)**:
```plaintext
[
  [[-0.1, -1.5,  0.3,  2.0, -0.5],  # Logits for token 1
   [ 0.2,  0.0, -0.2, -1.2,  1.8],  # Logits for token 2
   [ 1.0, -1.0,  0.5,  0.2, -0.4]]  # Logits for token 3
]
```

**Labels Tensor (`labels`)**:
```plaintext
[
  [0, 2, 4]  # Actual correct labels for the sequence
]
```

#### Operation

- **Selecting Logits**: `output.logits[:, -1, :]` results in the logits for the last token in the sequence, which are `[1.0, -1.0, 0.5, 0.2, -0.4]`.

- **Adjusting Labels**: `labels[:, 1:]` results in `[2, 4]`. This skips the label for the first token, focusing on the tokens that follow.

Now, if we want to fetch the log probabilities for these selected labels (token indices `2` and `4` for the last two tokens in the sequence), we would:

1. Apply softmax to the logits to convert them into probabilities.
2. Take the logarithm of these probabilities to obtain log probabilities.
3. Use `torch.gather` to select the log probabilities for the indices `[2, 4]` from the last two tokens.

### Practical Use

This process is critical in tasks like calculating the loss during training, where you need the model's prediction probabilities for the actual correct tokens to compute something like cross-entropy loss. It efficiently aligns model outputs (logits) with the targets (labels), focusing only on the relevant parts of the output for loss computation.

In [30]:
# Now comparing with a sequence that's generated by beam search
        # to activate beam search, neede to specify 'num_beams' parameter
output_beam = model.generate(input_ids, max_length=max_length, num_beams=5,
                             do_sample=False)
logp = sequence_logprob(model, output_beam, input_len=len(input_ids[0]))
print(tokenizer.decode(output_beam[0]))
print(f"\nlog-prob: {logp:.2f}") # can see that log-prob of beam search is much higher than greedy encoder

In a shocking finding, scientist discovered a herd of unicorns living in a
remote, previously unexplored valley, in the Andes Mountains. Even more
surprising to the researchers was the fact that the unicorns spoke perfect
English.


The discovery of the unicorns was made by a team of scientists from the
University of California, Santa Cruz, and the National Geographic Society.


The scientists were conducting a study of the Andes Mountains when they
discovered a herd of unicorns living in a remote, previously unexplored valley,
in the Andes Mountains. Even more surprising to the researchers was the fact
that the unicorns spoke perfect English

log-prob: -55.23


In [31]:
# However, this beam search still suffers from text repetitiveness. We now address this by imposing an n-gram penalty, with
# no_repeat_ngram_size parameter, that tracks which n-grams have been seen and sets the next token probability to 0 if it would produce a previously
# seen n-gram.
output_beam = model.generate(input_ids, max_length=max_length, num_beams=5,
                             do_sample=False, no_repeat_ngram_size=2)
logp = sequence_logprob(model, output_beam, input_len=len(input_ids[0]))
print(tokenizer.decode(output_beam[0]))
print(f"\nlog-prob: {logp:.2f}")

In a shocking finding, scientist discovered a herd of unicorns living in a
remote, previously unexplored valley, in the Andes Mountains. Even more
surprising to the researchers was the fact that the unicorns spoke perfect
English.


The discovery was made by a team of scientists from the University of
California, Santa Cruz, and the National Geographic Society.

According to a press release, the scientists were conducting a survey of the
area when they came across the herd. They were surprised to find that they were
able to converse with the animals in English, even though they had never seen a
unicorn in person before. The researchers were

log-prob: -93.12
