<a href="https://colab.research.google.com/github/KRiver28/TIL/blob/master/8_13_KoGPT2(gen_text).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# KoGPT2의 TFGPT2LMHeadModel 모델을 이용한 언어 생성
# 참고 : https://github.com/SKT-AI/KoGPT2
# TFGPT2LMHeadModel : The GPT2 Model transformer with a language modeling head on top 
#                     (linear layer with weights tied to the input embeddings).
!pip install --upgrade mxnet>=1.6.0
!pip install gluonnlp
!pip install transformers
!pip install sentencepiece
!pip install wget

Collecting gluonnlp
  Downloading gluonnlp-0.10.0.tar.gz (344 kB)
[?25l[K     |█                               | 10 kB 16.6 MB/s eta 0:00:01[K     |██                              | 20 kB 21.9 MB/s eta 0:00:01[K     |██▉                             | 30 kB 27.0 MB/s eta 0:00:01[K     |███▉                            | 40 kB 21.0 MB/s eta 0:00:01[K     |████▊                           | 51 kB 16.1 MB/s eta 0:00:01[K     |█████▊                          | 61 kB 18.3 MB/s eta 0:00:01[K     |██████▋                         | 71 kB 18.3 MB/s eta 0:00:01[K     |███████▋                        | 81 kB 19.2 MB/s eta 0:00:01[K     |████████▋                       | 92 kB 20.7 MB/s eta 0:00:01[K     |█████████▌                      | 102 kB 19.3 MB/s eta 0:00:01[K     |██████████▌                     | 112 kB 19.3 MB/s eta 0:00:01[K     |███████████▍                    | 122 kB 19.3 MB/s eta 0:00:01[K     |████████████▍                   | 133 kB 19.3 MB/s eta 0:00:01

In [3]:
import gluonnlp as nlp
from gluonnlp.data import SentencepieceTokenizer, SentencepieceDetokenizer
from transformers import TFGPT2LMHeadModel
import tensorflow as tf
import wget
import zipfile

wget.download('https://github.com/NLP-kr/tensorflow-ml-nlp-tf2/releases/download/v1.0/gpt_ckpt.zip')

with zipfile.ZipFile('gpt_ckpt.zip') as z:
    z.extractall()





In [8]:
# gpt_ckpt.zip 파일을 압축 해제 후 업로드.
MY_PATH = '/content/'
MODEL_PATH = MY_PATH + 'gpt_ckpt'
TOKENIZER_PATH = MY_PATH + 'gpt_ckpt/gpt2_kor_tokenizer.spiece'

# 참고 : https://nlp.gluon.ai/api/modules/data.html
#        https://opensourcelibs.com/lib/kogpt2#mxnet-gluon
#        https://github.com/SKT-AI/KoGPT2#user-contributed-examples
# alpha = 1.0 (default)으로 설정하면, '안녕 하세요' --> ['▁', '안', '녕', '하', '세', '요']로 분해됨.
# alpha = 0으로 설정하면 ['▁안녕', '▁하세요']로 분해됨. 한글은 alpha = 0으로 설정함.
tokenizer = SentencepieceTokenizer(TOKENIZER_PATH, num_best=0, alpha=0)
detokenizer = SentencepieceDetokenizer(TOKENIZER_PATH)
vocab = nlp.vocab.BERTVocab.from_sentencepiece(TOKENIZER_PATH,
                                               mask_token = None,
                                               sep_token = None,
                                               cls_token = None,
                                               unknown_token = '<unk>',
                                               padding_token = '<pad>',
                                               bos_token = '<s>',
                                               eos_token = '</s>')
# vocab --> Vocab(size=50000, unk="<unk>", reserved="['<pad>', '<s>', '</s>']")

In [9]:
# tokenizer 연습
toked = tokenizer('안녕 하세요')
print(toked)


['▁안녕', '▁하세요']


In [10]:
toked_idx = vocab(toked)
print(toked_idx)

[14998, 24155]


In [11]:
toked = vocab.to_tokens(toked_idx)
print(toked)

['▁안녕', '▁하세요']


In [12]:
detoked = detokenizer(toked)
print(detoked)

안녕 하세요


In [13]:
''.join(toked).replace('▁', ' ')[1:]

'안녕 하세요'

In [14]:
print(len(vocab))
print(vocab.padding_token, ':', vocab[vocab.padding_token])
print(vocab.bos_token, ': ', vocab[vocab.bos_token])
print(vocab.eos_token, ': ', vocab[vocab.eos_token])
print(vocab.unknown_token, ': ', vocab[vocab.unknown_token])



50000
<pad> : 3
<s> :  0
</s> :  1
<unk> :  5


In [15]:
# vocabulry = vocab.token_to_idx
word2idx = {k:v for k, v in vocab.token_to_idx.items()}
idx2word = {v:k for k, v in word2idx.items()}
idx2word[5000]

'▁전세'

In [16]:
print(vocab.token_to_idx)



In [17]:
model = TFGPT2LMHeadModel.from_pretrained(MODEL_PATH)
model.summary()

All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at /content/gpt_ckpt.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


Model: "tfgpt2lm_head_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 transformer (TFGPT2MainLaye  multiple                 124242432 
 r)                                                              
                                                                 
Total params: 124,242,432
Trainable params: 124,242,432
Non-trainable params: 0
_________________________________________________________________


In [18]:
# 모델의 seed 입력 문장 생성
tok = tokenizer('이때')   # tok = ['▁이때']
tok_idx = [vocab[vocab.bos_token]] + vocab[tok]     # tok_idx = [0, 4499]
input_ids = tf.convert_to_tensor(tok_idx)[None, :]  # 텐서로 변환

input_ids

<tf.Tensor: shape=(1, 2), dtype=int32, numpy=array([[   0, 4499]], dtype=int32)>

In [19]:
# 모델의 출력
output = model.generate(input_ids, max_length=50)

output

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


<tf.Tensor: shape=(1, 50), dtype=int32, numpy=
array([[    0,  4499,  2592,   847,  4558,   181,  1914,  9858,   167,
        47481, 47465, 47443,   528, 47623, 47444, 16684, 17450,  2238,
        26291,   699,  6334,  2041, 47654,   445,  5304, 47440,     1,
            0,   104,   533,   167,  2162, 47443,   809, 47623, 47444,
        15134,   167,  2162, 47623, 47444,   167,  2162, 47623,   107,
         5504,   421,  8327,  6329,  3299]], dtype=int32)>

In [26]:
# 모델의 출력을 문자열로 변환
out_tok_idx = output.numpy().tolist()[0]   # output token 인덱스
out_tok = vocab.to_tokens(out_tok_idx)     # token 인덱스를 token 문자로 변환
out_text = detokenizer(out_tok)            # 출력 문자열로 decode
print(out_text)

이때까지 ‘늑대소년’은 흥행성적으로만 따지면, 700만 관객을 돌파했다. 올해 초 열린 ‘제17회 부산국제영화제’에서는 ‘피에타’의 주연배우 조민수, 이정진, 류승룡, 권


In [27]:
# Beam search
output = model.generate(input_ids, max_length=50, num_beams=5, early_stopping=True)

out_tok_idx = output.numpy().tolist()[0]   # output token 인덱스
out_tok = vocab.to_tokens(out_tok_idx)     # token 인덱스를 token 문자로 변환
out_text = detokenizer(out_tok)            # 출력 문자열로 decode
print(out_text)

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


이때문에 일부 네티즌들은 “한효주, 한효주, 설경구, 정우성, 한효주, 설경구, 정우성, 한효주, 설경구, 정우성, 한효주, 설경구


In [22]:
# 연속된 단어가 나오는 것을 방지함. no_repeat_ngram_size = 2
output = model.generate(input_ids, max_length=50, num_beams=5, no_repeat_ngram_size=2, early_stopping=True)

out_tok_idx = output.numpy().tolist()[0]   # output token 인덱스
out_tok = vocab.to_tokens(out_tok_idx)     # token 인덱스를 token 문자로 변환
out_text = detokenizer(out_tok)            # 출력 문자열로 decode
print(out_text)

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


이때문에 일부 네티즌들은 “이효리가 이효리의 뒤를 이을 것 같다”는 전망을 내놓기도 했다. 한편, 이날 방송에서는 ‘전설의 주먹’의 황정민, 유준상, 윤제문, 정웅인,


In [23]:
# top_k sampling. 확률이 높은 상위 k개에서 랜덤 샘플링.
output = model.generate(input_ids, max_length=50, do_sample = True, top_k=100, temperature=0.8)

out_tok_idx = output.numpy().tolist()[0]   # output token 인덱스
out_tok = vocab.to_tokens(out_tok_idx)     # token 인덱스를 token 문자로 변환
out_text = detokenizer(out_tok)            # 출력 문자열로 decode
print(out_text)

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


이때까지 이 회장이 관여한 사안이 없다는 사실만 밝혀내면, 검찰이 사실상 수사를 포기하고 무혐의 처분할 공산이 크다. 하지만, 재판부는 “김씨의 범행은 김씨가 자살하려고 자신의 집에 간 사이 일어난데다, 김씨가 자살을 시도하려 한 정황이 없는


In [24]:
# top_p sampling. 확률이 높은 순서로 누적 확률이 top_p인 단어들을 랜덤 샘플링.
output = model.generate(input_ids, max_length=50, do_sample = True, top_p=0.9, temperature=0.8)

out_tok_idx = output.numpy().tolist()[0]   # output token 인덱스
out_tok = vocab.to_tokens(out_tok_idx)     # token 인덱스를 token 문자로 변환
out_text = detokenizer(out_tok)            # 출력 문자열로 decode
print(out_text)

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


이때 문득 이 두 사람의 마음이 맞아떨어져서인지 두 사람은 서로를 향해 미소 지었다. 이 날 경기에서 넥센과 롯데는 밴헤켄과 유먼을 선발로 내세웠다. 올 시즌 우천취소된 경기는 모두 13


In [25]:
# top_k & top_p sampling
output = model.generate(input_ids, max_length=50, do_sample = True, top_k=100, top_p=0.9, temperature=0.8)

out_tok_idx = output.numpy().tolist()[0]   # output token 인덱스
out_tok = vocab.to_tokens(out_tok_idx)     # token 인덱스를 token 문자로 변환
out_text = detokenizer(out_tok)            # 출력 문자열로 decode
print(out_text)

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


이때까지 ‘늑대소년’은 흥행성적으로만 따지면, 700만 관객을 돌파했다. 올해 초 열린 ‘제17회 부산국제영화제’에서는 ‘피에타’의 주연배우 조민수, 이정진, 류승룡, 권
