# Decoding Strategies
This notebook aims to better understanding decoding strategies by importing a transformer model to determine a probability distribution of possible tokens to complete or continue the sentence. Each decoding strategy will choose a token from the probability distribution of possible tokens in different ways, leading to different completions of the same initial text.

### Imports

In [1]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer, set_seed
import torch
import torch.nn.functional as F
import numpy as np
model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

### Print Tokenizer Dictionary
Print 20 different tokens that are defined in the dictionary of the tokenizer

In [2]:
np.random.seed(1)
print("Number of tokens in dictionary = %d"%(tokenizer.vocab_size))
for i in range(20):
  index = np.random.randint(tokenizer.vocab_size)
  print("Token: %d "%(index)+tokenizer.decode(torch.tensor(index), skip_special_tokens=True))

Number of tokens in dictionary = 50257
Token: 33003  Mormons
Token: 12172  cam
Token: 5192  trig
Token: 32511 ojure
Token: 50057  gist
Token: 43723  Petition
Token: 7813  sin
Token: 21440  Witness
Token: 32912  Remy
Token: 20609 isure
Token: 49100  creeps
Token: 7751  fasc
Token: 43757  Alc
Token: 31228  messenger
Token: 36230  SYSTEM
Token: 32025  precipitation
Token: 21758  cores
Token: 45413  Forestry
Token: 35730  guru
Token: 8444  Disc


### Define Sampling Function
Define sampling, which randomly chooses a token from the model's probability distribution

In [3]:
def sample_next_token(input_tokens, model, tokenizer):
  # Run the transformer model to get the prediction over the next output
  outputs = model(input_ids = input_tokens['input_ids'], attention_mask = input_tokens['attention_mask'])
  # Compute the probabilities of the prediction
  prob_over_tokens = F.softmax(outputs.logits, dim=-1).detach().numpy()[0,-1]
  # Choose random token according to the probabilities
  next_token = [np.random.choice(tokenizer.vocab_size, p = prob_over_tokens)]

  # Append chosen token to sentence
  output_tokens = input_tokens
  output_tokens["input_ids"] = torch.cat((output_tokens['input_ids'],torch.tensor([next_token])),dim=1)
  output_tokens['attention_mask'] = torch.cat((output_tokens['attention_mask'],torch.tensor([[1]])),dim=1)
  output_tokens['last_token_prob'] = prob_over_tokens[next_token]

  return output_tokens

### Define Input Text
Define an input text for the tokenizer

In [4]:
set_seed(0)
input_txt = "The best thing about Bath is"

### Compute Input Tokens
Convert the input text to tokens using the tokenizer

In [5]:
input_tokens = tokenizer(input_txt, return_tensors='pt')

### Complete Sentence with Sampling
Run the model using sample_next_token to observe how the sentence is completed

In [6]:
for i in range(10):
    input_tokens = sample_next_token(input_tokens, model, tokenizer)
    print(tokenizer.decode(input_tokens["input_ids"][0], skip_special_tokens=True))

The best thing about Bath is that
The best thing about Bath is that they
The best thing about Bath is that they don
The best thing about Bath is that they don't
The best thing about Bath is that they don't even
The best thing about Bath is that they don't even change
The best thing about Bath is that they don't even change or
The best thing about Bath is that they don't even change or shrink
The best thing about Bath is that they don't even change or shrink anymore
The best thing about Bath is that they don't even change or shrink anymore.


### Define Greedy Token Selection
Define greedy token selection, which chooses the token with the highest probability from the model's probability distribution

In [7]:
def get_best_next_token(input_tokens, model, tokenizer):
  # Run the transformer model to get the prediction over the next output
  outputs = model(input_ids = input_tokens['input_ids'], attention_mask = input_tokens['attention_mask'])
  # Compute the probabilities of the prediction
  prob_over_tokens = F.softmax(outputs.logits, dim=-1).detach().numpy()[0,-1]
  # Compute the token index with the maximum probability
  next_token = [np.argmax(prob_over_tokens)]

  # Append chosen token to sentence
  output_tokens = input_tokens
  output_tokens["input_ids"] = torch.cat((output_tokens['input_ids'],torch.tensor([next_token])),dim=1)
  output_tokens['attention_mask'] = torch.cat((output_tokens['attention_mask'],torch.tensor([[1]])),dim=1)
  output_tokens['last_token_prob'] = prob_over_tokens[next_token]
  return output_tokens

### Define Input Text
Define input text for the tokenizer

In [8]:
set_seed(0)
input_txt = "The best thing about Bath is"

### Compute Input Tokens
Convert the input text into tokens using the tokenizer

In [9]:
input_tokens = tokenizer(input_txt, return_tensors='pt')

### Complete Sentence with Greedy Token Selection
Run the model using get_best_next_token to observe how the sentence is completed

In [10]:
for i in range(10):
    input_tokens = get_best_next_token(input_tokens, model, tokenizer)
    print(tokenizer.decode(input_tokens["input_ids"][0], skip_special_tokens=True))

The best thing about Bath is that
The best thing about Bath is that it
The best thing about Bath is that it's
The best thing about Bath is that it's a
The best thing about Bath is that it's a place
The best thing about Bath is that it's a place where
The best thing about Bath is that it's a place where you
The best thing about Bath is that it's a place where you can
The best thing about Bath is that it's a place where you can go
The best thing about Bath is that it's a place where you can go to


### Define Top-K sampling
Define top-k sampling, which randomly chooses the top K most probable tokens from the model's probability distribution as the next token

In [11]:
def get_top_k_token(input_tokens, model, tokenizer, k=20):
  # Run the transformer model to get the prediction over the next output
  outputs = model(input_ids = input_tokens['input_ids'], attention_mask = input_tokens['attention_mask'])
  # Compute the probabilities of the prediction
  prob_over_tokens = F.softmax(outputs.logits, dim=-1).detach().numpy()[0,-1]

  # Sort the probabilities from largest to smallest
  sorted_prob_over_tokens =  np.sort(prob_over_tokens)[::-1]

  # Find the probability at the k'th position
  kth_prob_value = sorted_prob_over_tokens[k]

  # Set all probabilities below the k'th value to zero
  prob_over_tokens[prob_over_tokens<kth_prob_value] = 0

  # Renormalize the non-zero probabilities so that they sum to one
  prob_over_tokens = prob_over_tokens/np.sum(prob_over_tokens)

  # Draw random token
  next_token = np.random.choice(len(prob_over_tokens), 1, replace=False, p=prob_over_tokens)

  # Append token to sentence
  output_tokens = input_tokens
  output_tokens["input_ids"] = torch.cat((output_tokens['input_ids'],torch.tensor([next_token])),dim=1)
  output_tokens['attention_mask'] = torch.cat((output_tokens['attention_mask'],torch.tensor([[1]])),dim=1)
  output_tokens['last_token_prob'] = prob_over_tokens[next_token]
  return output_tokens

### Define Input Text

In [12]:
set_seed(0)
input_txt = "The best thing about Bath is"

### Compute Input Tokens

In [13]:
input_tokens = tokenizer(input_txt, return_tensors='pt')

### Complete Sentence using Top-K Sampling
Run the model using get_top_k_token to observe how the sentence is completed

In [14]:
for i in range(10):
    input_tokens = get_top_k_token(input_tokens, model, tokenizer, k=10)
    print(tokenizer.decode(input_tokens["input_ids"][0], skip_special_tokens=True))

  output_tokens["input_ids"] = torch.cat((output_tokens['input_ids'],torch.tensor([next_token])),dim=1)


The best thing about Bath is that
The best thing about Bath is that you
The best thing about Bath is that you get
The best thing about Bath is that you get to
The best thing about Bath is that you get to see
The best thing about Bath is that you get to see all
The best thing about Bath is that you get to see all the
The best thing about Bath is that you get to see all the beautiful
The best thing about Bath is that you get to see all the beautiful faces
The best thing about Bath is that you get to see all the beautiful faces of


### Define Nucleus Sampling
Define nucleus sampling, which randomly chooses a token from a list of sorted tokens whose cumulative sum doesn't exceed a threshold

In [15]:
def get_nucleus_sampling_token(input_tokens, model, tokenizer, thresh=0.25):
  # Run the transformer model to get the prediction over the next output
  outputs = model(input_ids = input_tokens['input_ids'], attention_mask = input_tokens['attention_mask'])
  # Compute the probabilities of the prediction
  prob_over_tokens = F.softmax(outputs.logits, dim=-1).detach().numpy()[0,-1]

  # Sort the probabilities in decreasing order
  sorted_probs_decreasing = np.sort(prob_over_tokens)[::-1]

  # Compute the cumulative sum of these probabilities
  cum_sum_probs = np.cumsum(sorted_probs_decreasing)

  # Find index where that the cumulative sum is greater than the threshold
  thresh_index = np.argmax(cum_sum_probs>thresh)
  print("Choosing from %d tokens"%(thresh_index))

  # Compute the probability value at the tresh_index
  thresh_prob = sorted_probs_decreasing[thresh_index]

  # Set any probabilities below the tresh_prob to zero
  prob_over_tokens[prob_over_tokens<thresh_prob] = 0

  # Renormalize the probabilities to sum to 1
  prob_over_tokens = prob_over_tokens / np.sum(prob_over_tokens)

  # Draw a random token
  next_token = np.random.choice(len(prob_over_tokens), 1, replace=False, p=prob_over_tokens)

  # Append token to sentence
  output_tokens = input_tokens
  output_tokens["input_ids"] = torch.cat((output_tokens['input_ids'],torch.tensor([next_token])),dim=1)
  output_tokens['attention_mask'] = torch.cat((output_tokens['attention_mask'],torch.tensor([[1]])),dim=1)
  output_tokens['last_token_prob'] = prob_over_tokens[next_token]
  return output_tokens

### Define Input Text

In [16]:
set_seed(0)
input_txt = "The best thing about Bath is"

### Compute Input Tokenizer

In [17]:
input_tokens = tokenizer(input_txt, return_tensors='pt')

### Complete Sentence using Nucleus Sampling
Run the model using get_nucleus_sampling_token to observe how the sentence is completed

In [18]:
for i in range(10):
    input_tokens = get_nucleus_sampling_token(input_tokens, model, tokenizer, thresh = 0.2)
    print(tokenizer.decode(input_tokens["input_ids"][0], skip_special_tokens=True))

Choosing from 0 tokens
The best thing about Bath is that
Choosing from 0 tokens
The best thing about Bath is that it
Choosing from 0 tokens
The best thing about Bath is that it's
Choosing from 2 tokens
The best thing about Bath is that it's not
Choosing from 1 tokens
The best thing about Bath is that it's not a
Choosing from 25 tokens
The best thing about Bath is that it's not a city
Choosing from 2 tokens
The best thing about Bath is that it's not a city that
Choosing from 1 tokens
The best thing about Bath is that it's not a city that has
Choosing from 1 tokens
The best thing about Bath is that it's not a city that has been
Choosing from 11 tokens
The best thing about Bath is that it's not a city that has been around


### Define K Sampling
Define K sampling, which returns the k'th most likely next token from the model's probability distribution

In [19]:
def get_kth_most_likely_token(input_tokens, model, tokenizer, k):
  # Run the transformer model to get the prediction over the next output
  outputs = model(input_ids = input_tokens['input_ids'], attention_mask = input_tokens['attention_mask'])
  # Compute the probabilities of the prediction
  prob_over_tokens = F.softmax(outputs.logits, dim=-1).detach().numpy()[0,-1]

  # Sort the probabilities from largest to smallest
  sorted_prob_over_tokens = np.sort(prob_over_tokens)[::-1]

  # Find the k'th sorted probability
  kth_prob_value = sorted_prob_over_tokens[k]

  # Locate the position of the token with the k'th probability
  next_token = np.where(prob_over_tokens == kth_prob_value)[0]

  # Append token to sentence
  output_tokens = input_tokens
  output_tokens["input_ids"] = torch.cat((output_tokens['input_ids'],torch.tensor([next_token])),dim=1)
  output_tokens['attention_mask'] = torch.cat((output_tokens['attention_mask'],torch.tensor([[1]])),dim=1)
  output_tokens['last_token_prob'] = prob_over_tokens[next_token]
  output_tokens['log_prob'] = output_tokens['log_prob'] + np.log(prob_over_tokens[next_token])
  return output_tokens

### Define Input Text

In [20]:
set_seed(0)
input_txt = "The best thing about Bath is"

### Compute Input Tokens

In [21]:
input_tokens = tokenizer(input_txt, return_tensors='pt')
input_tokens['log_prob'] = 0.0

### Complete Sentence with K Sampling (k = 1)
Run the model using get_kth_most_likely_token, where k = 1, to observe how the sentence is completed

In [22]:
for i in range(10):
    input_tokens = get_kth_most_likely_token(input_tokens, model, tokenizer, k=1)
    print(tokenizer.decode(input_tokens["input_ids"][0], skip_special_tokens=True))

The best thing about Bath is the
The best thing about Bath is the way
The best thing about Bath is the way you
The best thing about Bath is the way you get
The best thing about Bath is the way you get the
The best thing about Bath is the way you get the most
The best thing about Bath is the way you get the most bang
The best thing about Bath is the way you get the most bang out
The best thing about Bath is the way you get the most bang outta
The best thing about Bath is the way you get the most bang outta the


### Define Input Text

In [23]:
input_txt = "The best thing about Bath is"

### Compute Input Tokens

In [24]:
input_tokens = tokenizer(input_txt, return_tensors='pt')
input_tokens['log_prob'] = 0.0

### Complete Sentence with K Sampling (k = 2000)
Run the model using get_kth_most_likely_token, where k = 2000, to observe how the sentence is completed

In [25]:
for i in range(10):
    input_tokens = get_kth_most_likely_token(input_tokens, model, tokenizer, k=2000)
    print(tokenizer.decode(input_tokens["input_ids"][0], skip_special_tokens=True))

The best thing about Bath is mixed
The best thing about Bath is mixed profits
The best thing about Bath is mixed profits partnerships
The best thing about Bath is mixed profits partnerships»
The best thing about Bath is mixed profits partnerships» buy
The best thing about Bath is mixed profits partnerships» buy generic
The best thing about Bath is mixed profits partnerships» buy generic+
The best thing about Bath is mixed profits partnerships» buy generic+drive
The best thing about Bath is mixed profits partnerships» buy generic+drive outlets
The best thing about Bath is mixed profits partnerships» buy generic+drive outlets concentrate


### Define Print Beams Function
Define a function that prints each beam along with its log probability

In [26]:
def print_beams(beams):
  for index,beam in enumerate(beams):
    print("Beam %d, Prob %3.3f: "%(index,beam['log_prob'])+tokenizer.decode(beam["input_ids"][0], skip_special_tokens=True))
  print('---')

### Define Beam Search
Define beam search, which computes n_beams most likely tokens as its initial beams, makes possible continuation of these beams, and keeps only the top n_beams beams

In [27]:
def do_beam_search(input_tokens_in, model, tokenizer, n_beam=5, beam_length=10):
  # Store beams in a list
  input_tokens['log_prob'] = 0.0

  # Initialize the n_beams most likely tokens as the initial beam
  beams = [None] * n_beam
  for c_k in range(n_beam):
    beams[c_k] = dict(input_tokens_in)
    beams[c_k] = get_kth_most_likely_token(beams[c_k], model, tokenizer, c_k)

  print_beams(beams)

  # For each token in the sequence we will add
  for c_pos in range(beam_length-1):
    # For each computed beam, initialize the n_beams most likely tokens as possible continuations of the beam
    beams_all = [None] * (n_beam*n_beam)
    log_probs_all = np.zeros(n_beam*n_beam)
    # For each current hypothesis
    for c_beam in range(n_beam):
      # For each continuation
      for c_k in range(n_beam):
        # Store the continuation and the probability
        beams_all[c_beam * n_beam + c_k] = dict(get_kth_most_likely_token(beams[c_beam], model, tokenizer, c_k))
        log_probs_all[c_beam * n_beam + c_k] = beams_all[c_beam * n_beam + c_k]['log_prob']

    # Keep the best n_beams sequences with the highest probabilities
    sorted_index = np.argsort(np.array(log_probs_all)*-1)
    for c_k in range(n_beam):
      beams[c_k] = dict(beams_all[sorted_index[c_k]])

    # Print the beams
    print_beams(beams)

  return beams[0]

### Define Input Text

In [28]:
set_seed(0)
input_txt = "The best thing about Bath is"

### Compute Input Tokens

In [29]:
input_tokens = tokenizer(input_txt, return_tensors='pt')

### Complete Sentence with Beam Search
Run the model using do_beam_search to observe how the sentence is completed

In [30]:
n_beams = 5
best_beam = do_beam_search(input_tokens,model,tokenizer)
print("Beam search result:")
print(tokenizer.decode(best_beam["input_ids"][0], skip_special_tokens=True))

  print("Beam %d, Prob %3.3f: "%(index,beam['log_prob'])+tokenizer.decode(beam["input_ids"][0], skip_special_tokens=True))
  log_probs_all[c_beam * n_beam + c_k] = beams_all[c_beam * n_beam + c_k]['log_prob']


Beam 0, Prob -0.727: The best thing about Bath is that
Beam 1, Prob -2.161: The best thing about Bath is the
Beam 2, Prob -3.177: The best thing about Bath is it
Beam 3, Prob -3.468: The best thing about Bath is how
Beam 4, Prob -3.536: The best thing about Bath is you
---
Beam 0, Prob -1.899: The best thing about Bath is that it
Beam 1, Prob -3.931: The best thing about Bath is it's
Beam 2, Prob -4.079: The best thing about Bath is that it is
Beam 3, Prob -4.433: The best thing about Bath is the fact
Beam 4, Prob -4.553: The best thing about Bath is you can
---
Beam 0, Prob -2.740: The best thing about Bath is that it's
Beam 1, Prob -4.657: The best thing about Bath is the fact that
Beam 2, Prob -5.331: The best thing about Bath is that it's not
Beam 3, Prob -6.227: The best thing about Bath is it's a
Beam 4, Prob -6.264: The best thing about Bath is that it is a
---
Beam 0, Prob -4.938: The best thing about Bath is that it's a
Beam 1, Prob -6.012: The best thing about Bath is the fac