<a href="https://colab.research.google.com/github/Frinkles/ChatGPT-Android/blob/main/Text_Generation_with_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Generation with Transformers

It turns out we don’t need an entire Transformer to adopt transfer learning and a fine-tunable language model for NLP tasks. We can do with just the decoder of the transformer. The decoder is a good choice because it’s a natural choice for language modeling (predicting the next word) since it’s built to mask future tokens – a valuable feature when it’s generating a translation word by word.

Here we will use the GPT-2 Model to generate text based on an input sequence of text.

![](https://i.imgur.com/z4k1IzU.png)

# トランスフォーマーによるテキスト生成

NLPタスクに転移学習と微調整可能な言語モデルを採用するために、トランスフォーマー全体が必要なわけではないことがわかった。トランスフォーマーのデコーダーだけでいいのだ。デコーダーは、未来のトークンをマスクするように作られているため、言語モデリング（次の単語の予測）には自然な選択だからだ。

ここでは、GPT-2モデルを使って、入力された一連のテキストに基づいてテキストを生成します。

![](https://i.imgur.com/z4k1IzU.png)

# Install Dependencies

In [None]:
!pip install pytorch-transformers

Collecting pytorch-transformers
  Downloading pytorch_transformers-1.2.0-py3-none-any.whl (176 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m176.4/176.4 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
Collecting boto3 (from pytorch-transformers)
  Downloading boto3-1.34.132-py3-none-any.whl (139 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.2/139.2 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
Collecting sacremoses (from pytorch-transformers)
  Downloading sacremoses-0.1.1-py3-none-any.whl (897 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m897.5/897.5 kB[0m [31m31.7 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.0.0->pytorch-transformers)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.0.0->pytorch-transformers)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylin

# Load GPT2 Model

In [None]:
import torch
from pytorch_transformers import GPT2Tokenizer, GPT2LMHeadModel

In [None]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

100%|██████████| 1042301/1042301 [00:01<00:00, 802992.65B/s]
100%|██████████| 456318/456318 [00:00<00:00, 517970.69B/s]


# Next Word Generation with GPT-2

GPT-2 is a successor of GPT, the original NLP framework by OpenAI. The full GPT-2 model has 1.5 billion parameters, which is almost 10 times the parameters of GPT. GPT-2 give State-of-the Art results as you might have surmised already (and will soon see when we get into Python).

The pre-trained model contains data from 8 million web pages collected from outbound links from Reddit.

![](https://i.imgur.com/TbnGbjX.png)

The architecture of GPT-2 is based on the very famous Transformers concept that was proposed by Google in their paper “Attention is all you need”. The Transformer provides a mechanism based on encoder-decoders to detect input-output dependencies.

At each step, the model consumes the previously generated symbols as additional input when generating the next output.

![](https://i.imgur.com/0XSSXBd.png)

Modifications in GPT-2 include:

- The model uses larger context and vocabulary size
- After the final self-attention block, an additional normalization layer is added
- Similar to a residual unit of type “building block”, layer normalization is moved to the input of each sub-block. It has batch normalization applied before weight layers, which is different from the original type “bottleneck”

# GPT-2による次の単語生成

GPT-2は、OpenAIのオリジナルNLPフレームワークであるGPTの後継モデルです。GPT-2の完全なモデルは15億のパラメータを持ち、GPTのほぼ10倍のパラメータを持ちます。GPT-2は、すでにお気づきかもしれませんが（Pythonに入ればすぐにわかります）、最先端の結果をもたらします。

事前に訓練されたモデルには、Redditからのアウトバウンドリンクから収集された800万ウェブページのデータが含まれています。

![](https://i.imgur.com/TbnGbjX.png)

GPT-2のアーキテクチャは、Googleが論文「Attention is all you need」で提案した非常に有名なトランスフォーマーのコンセプトに基づいている。トランスフォーマーは、入出力の依存関係を検出するために、エンコーダー・デコーダーに基づいたメカニズムを提供します。

各ステップにおいて、モデルは次の出力を生成する際に、以前に生成されたシンボルを追加入力として消費する。

![](https://i.imgur.com/0XSSXBd.png)

GPT-2での変更点は以下の通り：

- モデルはより大きな文脈と語彙サイズを使用する。
- 最後の自己注意ブロックの後に、正規化層が追加される。
- ビルディング・ブロック "タイプの残差ユニットと同様に、レイヤーの正規化は各サブブロックの入力に移される。ウェイト層の前にバッチ正規化が適用され、オリジナルの "ボトルネック "タイプとは異なる。



In [None]:
text = "Welcome to the open data science conference it is"
indexed_tokens = tokenizer.encode(text)
indexed_tokens

[19134, 284, 262, 1280, 1366, 3783, 4495, 340, 318]

In [None]:
tokens_tensor = torch.tensor([indexed_tokens])
tokens_tensor

tensor([[19134,   284,   262,  1280,  1366,  3783,  4495,   340,   318]])

In [None]:
model = GPT2LMHeadModel.from_pretrained('gpt2')
model.eval()
model

100%|██████████| 665/665 [00:00<00:00, 1543559.58B/s]
100%|██████████| 548118077/548118077 [00:44<00:00, 12265805.77B/s]


GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

In [None]:
tokens_tensor = tokens_tensor.to('cuda')
model.to('cuda')

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

In [None]:
with torch.no_grad():
    outputs = model(tokens_tensor)
    predictions = outputs[0]

In [None]:
predictions.shape

torch.Size([1, 9, 50257])

In [None]:
predicted_index = torch.argmax(predictions[0, -1, :]).item()
predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])
predicted_text

' Welcome to the open data science conference it is a'

In [None]:
start = 'Natural Language Processing is slowly becoming'
indexed_tokens = tokenizer.encode(start)

for i in range(75):
  tokens_tensor = torch.tensor([indexed_tokens])
  tokens_tensor = tokens_tensor.to('cuda')
  with torch.no_grad():
    outputs = model(tokens_tensor)
    predictions = outputs[0]
    predicted_index = torch.argmax(predictions[0, -1, :]).item()
    indexed_tokens = indexed_tokens + [predicted_index]

In [None]:
predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])
print(predicted_text)

 Natural Language Processing is slowly becoming more and more popular.

The first step is to create a language processing program that can be used to create a language. This is called a language processing program.

The language processing program is a program that can be used to create a language. This is called a language processing program.

The program is a program that can be used to create a language language


# Paragraph Generation with GPT-2

Refer to this [source code](https://github.com/huggingface/pytorch-transformers/blob/master/examples/run_generation.py#L106-L129) to deep dive.

- `length`: It represents the number of tokens in the generated text. If the length is None, then the number of tokens is decided by model hyperparameters
- `temperature`: This controls randomness in Boltzmann distribution. Lower temperature results in less random completions. As the temperature approaches zero, the model will become deterministic and repetitive. Higher temperature results in more random completions
- `top_k`: This parameter controls diversity. If the value of top_k is set to 1, this means that only 1 word is considered for each step (token). If top_k is set to 40, that means 40 words are considered at each step. 0 (default) is a special setting meaning no restrictions. top_k = 40 generally is a good value

In [None]:
!git clone https://github.com/huggingface/pytorch-transformers.git

Cloning into 'pytorch-transformers'...
remote: Enumerating objects: 217070, done.[K
remote: Counting objects: 100% (684/684), done.[K
remote: Compressing objects: 100% (313/313), done.[K
remote: Total 217070 (delta 404), reused 508 (delta 307), pack-reused 216386[K
Receiving objects: 100% (217070/217070), 227.00 MiB | 13.60 MiB/s, done.
Resolving deltas: 100% (157192/157192), done.


In [None]:
!python pytorch-transformers/examples/run_generation.py \
    --model_type=gpt2 \
    --length=500 \
    --model_name_or_path=gpt2 \

python3: can't open file '/content/pytorch-transformers/examples/run_generation.py': [Errno 2] No such file or directory
