In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import pandas as pd
import numpy as np
import torch.nn.functional as F

# Text Generation
Vamos a implementar las distintas decoding strategies.

# Greedy Search Decoding
Cargamos la versión de GPT-2 con 1.5-billion-parameters

In [2]:
# device = "cuda" if torch.cuda.is_available() else "cpu"
device="cpu"
model_name = "gpt2-xl"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

**Hugging face Transformers provee una función** `generate()` **para modelos autorregresivos como GPT-2**, pero en este notebook lo implementaremos a mano.

In [3]:
input_txt = "Transformers are the"
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
iterations = []
n_steps = 20
choices_per_step = 5

with torch.no_grad():
    for _ in range(n_steps):
        iteration = dict()
        iteration["Input"] = tokenizer.decode(input_ids[0])
        output = model(input_ids=input_ids)
        # Select logits of the first batch and the last token and apply softmax
        next_token_logits = output.logits[0, -1, :]
        next_token_probs = torch.softmax(next_token_logits, dim=-1)
        sorted_ids = torch.argsort(next_token_probs, dim=-1, descending=True)
        
        # Store tokens with highest probabilities
        for choice_idx in range(choices_per_step):
            token_id = sorted_ids[choice_idx]
            token_prob = next_token_probs[token_id].cpu().numpy()
            token_choice = (
                f"{tokenizer.decode(token_id)} ({100 * token_prob:.2f}%)"
            )
            iteration[f"Choice {choice_idx+1}"] = token_choice
        
        # Append predicted next token to input
        input_ids = torch.cat([input_ids, sorted_ids[None, 0, None]], dim=-1)
        iterations.append(iteration)

pd.DataFrame(iterations)

Unnamed: 0,Input,Choice 1,Choice 2,Choice 3,Choice 4,Choice 5
0,Transformers are the,most (8.53%),only (4.96%),best (4.65%),Transformers (4.37%),ultimate (2.16%)
1,Transformers are the most,popular (16.78%),powerful (5.37%),common (4.96%),famous (3.72%),successful (3.20%)
2,Transformers are the most popular,toy (10.63%),toys (7.23%),Transformers (6.60%),of (5.46%),and (3.76%)
3,Transformers are the most popular toy,line (34.38%),in (18.20%),of (11.71%),brand (6.10%),line (2.69%)
4,Transformers are the most popular toy line,in (46.29%),of (15.09%),", (4.94%)",on (4.40%),ever (2.72%)
5,Transformers are the most popular toy line in,the (65.99%),history (12.42%),America (6.91%),Japan (2.44%),North (1.40%)
6,Transformers are the most popular toy line in the,world (69.27%),United (4.55%),history (4.29%),US (4.23%),U (2.30%)
7,Transformers are the most popular toy line in ...,", (39.73%)",. (30.64%),and (9.87%),with (2.32%),today (1.74%)
8,Transformers are the most popular toy line in ...,and (32.51%),with (18.66%),but (8.06%),so (5.42%),having (2.19%)
9,Transformers are the most popular toy line in ...,the (11.70%),they (7.48%),it (6.24%),for (3.06%),I (2.97%)


In [4]:
list(pd.DataFrame(iterations)["Input"])

['Transformers are the',
 'Transformers are the most',
 'Transformers are the most popular',
 'Transformers are the most popular toy',
 'Transformers are the most popular toy line',
 'Transformers are the most popular toy line in',
 'Transformers are the most popular toy line in the',
 'Transformers are the most popular toy line in the world',
 'Transformers are the most popular toy line in the world,',
 'Transformers are the most popular toy line in the world, and',
 'Transformers are the most popular toy line in the world, and the',
 'Transformers are the most popular toy line in the world, and the Transformers',
 'Transformers are the most popular toy line in the world, and the Transformers are',
 'Transformers are the most popular toy line in the world, and the Transformers are the',
 'Transformers are the most popular toy line in the world, and the Transformers are the most',
 'Transformers are the most popular toy line in the world, and the Transformers are the most popular',
 'T

Vamos a utilizar la función `generate()` para generar secuencias de manera más sofisticada.

In [5]:
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
output = model.generate(input_ids, max_new_tokens=n_steps, do_sample=False)
print(tokenizer.decode(output[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Transformers are the most popular toy line in the world, and the Transformers are the most popular toy line in the world


Probamos con un ejemplo algo más elaborado.

In [6]:
max_length = 128
input_txt = """In a shocking finding, scientist discovered \
a herd of unicorns living in a remote, previously unexplored \
valley, in the Andes Mountains. Even more surprising to the \
researchers was the fact that the unicorns spoke perfect English.\n\n
"""

input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
output_greedy = model.generate(input_ids, max_length=max_length, do_sample=False)

print(tokenizer.decode(output_greedy[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


The researchers, from the University of California, Davis, and the University of Colorado, Boulder, were conducting a study on the Andean cloud forest, which is home to the rare species of cloud forest trees.


The researchers were surprised to find that the unicorns were able to communicate with each other, and even with humans.


The researchers were surprised to find that the unicorns were able


Puede observarse que, a partir de cierto punto, la misma frase empieza a repetirse una y otra vez. Este es uno de los mayores defectos del approach greedy, tiende a producir salidas repetitivas.

# Beam Search Decoding
Comparamos multiplicación con suma de logaritmos.

In [7]:
# Supongamos que tenemos una secuencia de 1024 tokens, cada uno con una probabilidad de 0.5
0.5 ** 1024

5.562684646268003e-309

In [8]:
sum([np.log(0.5)] *1024)

-709.7827128933695

Hugging face Transformers devuelve logits sin normalizar, por lo que implementamos el siguiente código para normalizarlos.

In [9]:
def log_probs_from_logits(logits, labels):
    logp = F.log_softmax(logits, dim=-1)
    logp_label = torch.gather(logp, 2, labels.unsqueeze(2)).squeeze(-1)
    return logp_label

Ese código devuelve la log probability de un solo token, el siguiente código devuelve la log probability de la secuencia completa.

In [10]:
def sequence_logprob(model, labels, input_len=0):
    with torch.no_grad():
        output = model(labels)
        log_probs = log_probs_from_logits(
            output.logits[:, :-1, :], labels[:, 1:]
        )
        seq_log_prob = torch.sum(log_probs[:, input_len:])
    return seq_log_prob.cpu().numpy()

Usamos estas funciones para calcular la log probability de la secuencia obtenida usando el greedy decoder.

In [11]:
logp = sequence_logprob(model, output_greedy, input_len=len(input_ids[0]))
print(tokenizer.decode(output_greedy[0]))
print(f"\nlog-prob: {logp:.2f}")

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


The researchers, from the University of California, Davis, and the University of Colorado, Boulder, were conducting a study on the Andean cloud forest, which is home to the rare species of cloud forest trees.


The researchers were surprised to find that the unicorns were able to communicate with each other, and even with humans.


The researchers were surprised to find that the unicorns were able

log-prob: -87.43


Comparamos con una secuencia generada usando beam search. Para usar beam_search, simplemente tenemos que indicar el número de beams usando el parámetro `num_beams`. El resultado será generalmente mejor cuantas más beams usemos, pero será computacionalmente más caro.

In [12]:
output_beam = model.generate(input_ids, max_length=max_length, num_beams=5, do_sample=False)

logp = sequence_logprob(model, output_beam, input_len=len(input_ids[0]))

print(tokenizer.decode(output_beam[0]))
print(f"\nlog-prob: {logp:.2f}")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


The discovery of the unicorns was made by a team of scientists from the University of California, Santa Cruz, and the National Geographic Society.


The scientists were conducting a study of the Andes Mountains when they discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English

log-prob: -55.23


Podemos ver que obtenemos una mejor log probability (un valor más alto es mejor), pero siguen existiendo elementos repetitivos. Podemos configurar el generador de Hugging face para que penalice n-grams repetidos usando el parámetro `no_repeat_ngram_size`

In [13]:
output_beam = model.generate(input_ids, max_length=max_length, num_beams=5, do_sample=False, no_repeat_ngram_size=2)
logp = sequence_logprob(model, output_beam, input_len=len(input_ids[0]))

print(tokenizer.decode(output_beam[0]))
print(f"\nlog-prob: {logp:.2f}")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


The discovery was made by a team of scientists from the University of California, Santa Cruz, and the National Geographic Society.

According to a press release, the scientists were conducting a survey of the area when they came across the herd. They were surprised to find that they were able to converse with the animals in English, even though they had never seen a unicorn in person before. The researchers were

log-prob: -93.12


El score obtenido es menor, pero el texto se mantiene coherente. Este parámetro es útil para encontrar un buen trade-off entre tokens con altas probabilidades y reducir repeticiones.

# Sampling methods

Ejemplo con $T=2$

In [14]:
output_temp = model.generate(input_ids, max_length=max_length, do_sample=True, temperature=2.0, top_k=0)
print(tokenizer.decode(output_temp[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


Heloff released death assumed Indigenous 2003 Belg work Tweet monkey Secondary 457 centuries cmShort partingEv phil KILL Started worker steps Sm stock stalk Showbirds SpikePref suspension symmmopolitan < morphology CPUs Brand storyhow plang Orchestra Wait Professional specialty thirstIm counted hexoma� mappingHost Third latitudeChe biotech longtime Picacles Crit immature influx 141Advertisements want growing bursts 239Leaks To empoweredonna Nit Possible repe;rum Brand 142 downstream


Una temperatura tan alta ha resultado en un texto carente de sentido.

Probamos con $T=0.5$

In [15]:
output_temp = model.generate(input_ids, max_length=max_length, do_sample=True, temperature=0.5, top_k=0)
print(tokenizer.decode(output_temp[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


In fact, the scientists found that the unicorns were able to communicate with each other in English and even taught one another a few words. The scientists also discovered that the unicorns were able to use their voices to communicate with each other.

The scientists also discovered that the unicorns are not only able to communicate with each other, but also with humans. This means that the unicorns are able


Este texto es bastante más coherente.

# Top-k y Nucleus Sampling
Simplemente usamos el argumento `top_k`

In [16]:
output_topk = model.generate(input_ids, max_length=max_length, do_sample=True, top_k=50)

print(tokenizer.decode(output_topk[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


"We discovered that the unicorns had learned English as infants, which is unusual in the wild. We discovered a herd living in a very isolated location in the mountains, where they could not be contacted by humans," said researcher Dr. Roberta Belda-Dobler, from the University of the Andes in Chile.


The scientists observed the adult unicorns during their entire breeding cycle,


Este es posiblemente el texto más humano generado hasta ahora.

Para utilizar Nucleus, simplemente empleamos el parámetro `top_p`

In [17]:
output_topp = model.generate(input_ids, max_length=max_length, do_sample=True, top_p=0.9)

print(tokenizer.decode(output_topp[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


"They seemed to have more human qualities and they spoke in an almost perfect English, "says Juan Carlos Mola, researcher at the University of Oaxaca. "In my whole life I have never seen anything like this before in human beings."


According to Mola, it wasn't too long ago that these mythical creatures only existed in legends and myths. "They have been known as


Es posible combinar los parámetros `top_k` y `top_p` en un mismo generador.