<a href="http://colab.research.google.com/github/dipanjanS/nlp_workshop_odsc19/blob/master/Module05%20-%20NLP%20Applications/Project08%20-%20Text%20Generation%20with%20Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Generation with Transformers

It turns out we don’t need an entire Transformer to adopt transfer learning and a fine-tunable language model for NLP tasks. We can do with just the decoder of the transformer. The decoder is a good choice because it’s a natural choice for language modeling (predicting the next word) since it’s built to mask future tokens – a valuable feature when it’s generating a translation word by word.

Here we will use the GPT-2 Model to generate text based on an input sequence of text.

![](https://i.imgur.com/z4k1IzU.png)

# Install Dependencies

In [1]:
!pip install pytorch-transformers

Collecting pytorch-transformers
[?25l  Downloading https://files.pythonhosted.org/packages/40/b5/2d78e74001af0152ee61d5ad4e290aec9a1e43925b21df2dc74ec100f1ab/pytorch_transformers-1.0.0-py3-none-any.whl (137kB)
[K     |██▍                             | 10kB 20.7MB/s eta 0:00:01[K     |████▊                           | 20kB 6.5MB/s eta 0:00:01[K     |███████▏                        | 30kB 9.2MB/s eta 0:00:01[K     |█████████▌                      | 40kB 5.8MB/s eta 0:00:01[K     |████████████                    | 51kB 7.1MB/s eta 0:00:01[K     |██████████████▎                 | 61kB 8.4MB/s eta 0:00:01[K     |████████████████▊               | 71kB 9.5MB/s eta 0:00:01[K     |███████████████████             | 81kB 10.6MB/s eta 0:00:01[K     |█████████████████████▌          | 92kB 11.7MB/s eta 0:00:01[K     |███████████████████████▉        | 102kB 9.4MB/s eta 0:00:01[K     |██████████████████████████▎     | 112kB 9.4MB/s eta 0:00:01[K     |██████████████████████████

# Load GPT2 Model

In [0]:
import torch
from pytorch_transformers import GPT2Tokenizer, GPT2LMHeadModel

In [3]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

100%|██████████| 1042301/1042301 [00:00<00:00, 2064188.98B/s]
100%|██████████| 456318/456318 [00:00<00:00, 1331602.62B/s]


# Next Word Generation with GPT-2

GPT-2 is a successor of GPT, the original NLP framework by OpenAI. The full GPT-2 model has 1.5 billion parameters, which is almost 10 times the parameters of GPT. GPT-2 give State-of-the Art results as you might have surmised already (and will soon see when we get into Python).

The pre-trained model contains data from 8 million web pages collected from outbound links from Reddit. 

![](https://i.imgur.com/TbnGbjX.png)

The architecture of GPT-2 is based on the very famous Transformers concept that was proposed by Google in their paper “Attention is all you need”. The Transformer provides a mechanism based on encoder-decoders to detect input-output dependencies.

At each step, the model consumes the previously generated symbols as additional input when generating the next output.

![](https://i.imgur.com/0XSSXBd.png)

Modifications in GPT-2 include:

- The model uses larger context and vocabulary size
- After the final self-attention block, an additional normalization layer is added
- Similar to a residual unit of type “building block”, layer normalization is moved to the input of each sub-block. It has batch normalization applied before weight layers, which is different from the original type “bottleneck”

In [4]:
text = "Welcome to the open data science conference it is"
indexed_tokens = tokenizer.encode(text)
indexed_tokens

[14618, 284, 262, 1280, 1366, 3783, 4495, 340, 318]

In [5]:
tokens_tensor = torch.tensor([indexed_tokens])
tokens_tensor

tensor([[14618,   284,   262,  1280,  1366,  3783,  4495,   340,   318]])

In [6]:
model = GPT2LMHeadModel.from_pretrained('gpt2')
model.eval()
model

100%|██████████| 176/176 [00:00<00:00, 43316.37B/s]
100%|██████████| 548118077/548118077 [00:18<00:00, 29635831.08B/s]


GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1)
    (h): ModuleList(
      (0): Block(
        (ln_1): BertLayerNorm()
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1)
          (resid_dropout): Dropout(p=0.1)
        )
        (ln_2): BertLayerNorm()
        (mlp): MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1)
        )
      )
      (1): Block(
        (ln_1): BertLayerNorm()
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1)
          (resid_dropout): Dropout(p=0.1)
        )
        (ln_2): BertLayerNorm()
        (mlp): MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1)
        )
      )
      (2): Block(
        (ln_1): BertLayerNorm()
        (att

In [7]:
tokens_tensor = tokens_tensor.to('cuda')
model.to('cuda')

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1)
    (h): ModuleList(
      (0): Block(
        (ln_1): BertLayerNorm()
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1)
          (resid_dropout): Dropout(p=0.1)
        )
        (ln_2): BertLayerNorm()
        (mlp): MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1)
        )
      )
      (1): Block(
        (ln_1): BertLayerNorm()
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1)
          (resid_dropout): Dropout(p=0.1)
        )
        (ln_2): BertLayerNorm()
        (mlp): MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1)
        )
      )
      (2): Block(
        (ln_1): BertLayerNorm()
        (att

In [0]:
with torch.no_grad():
    outputs = model(tokens_tensor)
    predictions = outputs[0]

In [14]:
predictions.shape

torch.Size([1, 9, 50257])

In [15]:
predicted_index = torch.argmax(predictions[0, -1, :]).item()
predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])
predicted_text

'Welcome to the open data science conference it is a'

In [0]:
start = 'Natural Language Processing is slowly becoming'
indexed_tokens = tokenizer.encode(start)

for i in range(75):
  tokens_tensor = torch.tensor([indexed_tokens])
  tokens_tensor = tokens_tensor.to('cuda')
  with torch.no_grad():
    outputs = model(tokens_tensor)
    predictions = outputs[0]
    predicted_index = torch.argmax(predictions[0, -1, :]).item()
    indexed_tokens = indexed_tokens + [predicted_index]

In [42]:
predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])
print(predicted_text)

Natural Language Processing is slowly becoming a reality.

The first step is to create a language processing system that can be used to create a language. This is done by using a language processing system that is built on top of a language processing system.

The language processing system is a set of tools that can be used to create a language. The language processing system is a set of tools that can can


# Paragraph Generation with GPT-2

Refer to this [source code](https://github.com/huggingface/pytorch-transformers/blob/master/examples/run_generation.py#L106-L129) to deep dive.

- `length`: It represents the number of tokens in the generated text. If the length is None, then the number of tokens is decided by model hyperparameters
- `temperature`: This controls randomness in Boltzmann distribution. Lower temperature results in less random completions. As the temperature approaches zero, the model will become deterministic and repetitive. Higher temperature results in more random completions
- `top_k`: This parameter controls diversity. If the value of top_k is set to 1, this means that only 1 word is considered for each step (token). If top_k is set to 40, that means 40 words are considered at each step. 0 (default) is a special setting meaning no restrictions. top_k = 40 generally is a good value

In [43]:
!git clone https://github.com/huggingface/pytorch-transformers.git

Cloning into 'pytorch-transformers'...
remote: Enumerating objects: 5844, done.[K
remote: Total 5844 (delta 0), reused 0 (delta 0), pack-reused 5844[K
Receiving objects: 100% (5844/5844), 3.16 MiB | 5.45 MiB/s, done.
Resolving deltas: 100% (4167/4167), done.


In [45]:
!python pytorch-transformers/examples/run_generation.py \
    --model_type=gpt2 \
    --length=500 \
    --model_name_or_path=gpt2 \

08/06/2019 07:29:38 - INFO - pytorch_transformers.tokenization_utils -   loading file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json from cache at /root/.cache/torch/pytorch_transformers/f2808208f9bec2320371a9f5f891c184ae0b674ef866b79c58177067d15732dd.1512018be4ba4e8726e41b9145129dc30651ea4fec86aa61f4b9f40bf94eac71
08/06/2019 07:29:38 - INFO - pytorch_transformers.tokenization_utils -   loading file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt from cache at /root/.cache/torch/pytorch_transformers/d629f792e430b3c76a1291bb2766b0a047e36fae0588f9dbc1ae51decdff691b.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda
08/06/2019 07:29:39 - INFO - pytorch_transformers.modeling_utils -   loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-config.json from cache at /root/.cache/torch/pytorch_transformers/4be02c5697d91738003fb1685c9872f284166aa32e061576bbe6aaeb95649fcf.085d5f6a8e7812ea05ff0e6ed0645ab2e75d8038