# Conditional Text Generation with GPT-2

See: https://huggingface.co/gpt2

In [1]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Import textwrap library to display context
import textwrap
wrapper = textwrap.TextWrapper(width=80) 

In [2]:
MODEL_NAME = 'gpt2-large'

## Tokenizer

The texts are tokenized using a byte-level version of Byte Pair Encoding (BPE) (for unicode characters) and a vocabulary size of 50,257. The inputs are sequences of 1024 consecutive tokens.

In [3]:
tokenizer = GPT2Tokenizer.from_pretrained(MODEL_NAME)

In [4]:
sample_text = "Hello world!"
tokenizer.encode(sample_text, return_tensors='pt')

tensor([[15496,   995,     0]])

## Model

In [5]:
model = GPT2LMHeadModel.from_pretrained(MODEL_NAME)

Some weights of GPT2LMHeadModel were not initialized from the model checkpoint at gpt2-large and are newly initialized: ['h.0.attn.masked_bias', 'h.1.attn.masked_bias', 'h.2.attn.masked_bias', 'h.3.attn.masked_bias', 'h.4.attn.masked_bias', 'h.5.attn.masked_bias', 'h.6.attn.masked_bias', 'h.7.attn.masked_bias', 'h.8.attn.masked_bias', 'h.9.attn.masked_bias', 'h.10.attn.masked_bias', 'h.11.attn.masked_bias', 'h.12.attn.masked_bias', 'h.13.attn.masked_bias', 'h.14.attn.masked_bias', 'h.15.attn.masked_bias', 'h.16.attn.masked_bias', 'h.17.attn.masked_bias', 'h.18.attn.masked_bias', 'h.19.attn.masked_bias', 'h.20.attn.masked_bias', 'h.21.attn.masked_bias', 'h.22.attn.masked_bias', 'h.23.attn.masked_bias', 'h.24.attn.masked_bias', 'h.25.attn.masked_bias', 'h.26.attn.masked_bias', 'h.27.attn.masked_bias', 'h.28.attn.masked_bias', 'h.29.attn.masked_bias', 'h.30.attn.masked_bias', 'h.31.attn.masked_bias', 'h.32.attn.masked_bias', 'h.33.attn.masked_bias', 'h.34.attn.masked_bias', 'h.35.attn.mas

## Text generation

In [38]:
def generate_text(initial_text, model, tokenizer, display=False):
    # Generate text
    encoded_input = tokenizer.encode(initial_text, return_tensors='pt')
    outputs = model.generate(
        encoded_input,
        do_sample=True,
        max_length=100,
        top_k=20,
        top_p=1.,
        temperature=1,
        num_return_sequences=1)
    
    generated_text = []
    for i, token_id in enumerate(outputs):
        generated_text.append(tokenizer.decode(token_id, skip_special_tokens=True))

    generated_text = ''.join(generated_text)

    # Display
    if display:
        print('='*21)
        print('='*6, 'INITIAL', '='*6)
        print(initial_text)

        print('='*21)
        print('='*8, 'TEXT', '='*7)
        print(wrapper.fill(generated_text))
    else:
        return generated_text

In [39]:
initial_text = "Last week, many people reported that they saw a unicorn at the park."
generate_text(initial_text, model, tokenizer, display=True)

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


Last week, many people reported that they saw a unicorn at the park.
Last week, many people reported that they saw a unicorn at the park. That,
however, was a misidentification of a unicorn that was actually an antelope. It
was not the first time that people have mistaken a unicorn for a unicorn, but
it's one that people are not quite used to seeing.  But the confusion was caused
by a misunderstanding about what the unicorn means.  A unicorn is a mythical
creature who lives in the forests of the eastern Himalayas. It


In [40]:
initial_text = "Q: What do you think about the death penalty? A: "
generate_text(initial_text, model, tokenizer, display=True)

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


Q: What do you think about the death penalty? A: 
Q: What do you think about the death penalty? A:  I am not an advocate for the
death penalty. I am not someone who believes that this life should be taken and
then taken over again, which of course is the case with every case when we see a
murder. It is my belief that a fair penalty should be used, and that if the
evidence was not sufficient to prove that the person committed the actual crime,
then there should be a life sentence. Q: Do you


In [41]:
initial_text = "In the last months, a new virus spread worldwide and changed our daily lifes. This new virus, called"
generate_text(initial_text, model, tokenizer, display=True)

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


In the last months, a new virus spread worldwide and changed our daily lifes. This new virus, called
In the last months, a new virus spread worldwide and changed our daily lifes.
This new virus, called Ebola, appeared in three countries, and has killed more
than 3,000 people in the four countries: Guinea, Liberia and Sierra Leone. The
virus is spread through direct contact with the fluids of infected persons,
which can include body fluids, such as blood, sweat and tears…  It is also
spread through contaminated food and water. When Ebola arrives in a country, a
lot of people


In [45]:
initial_text = "The RMS Titanic was visited by divers for the first time in 14 years."
generate_text(initial_text, model, tokenizer, display=True)

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


The RMS Titanic was visited by divers for the first time in 14 years.
The RMS Titanic was visited by divers for the first time in 14 years. The
British vessel was damaged to the bow in 1912 and the ship was found to be
completely submerged.  The Titanic is also the only ship that has sunk in
international waters.  But, according to experts, the damage to the Titanic
would have been even worse, if the ship had not been in the water. In fact,
according to reports, the damage could have been caused by a catastrophic
failure of the
