<a href="https://colab.research.google.com/github/HyeonhoonLee/KoNLP/blob/master/GPT/tf_KoGPT2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Korean Language Modeling using GPT2

- The contents of this notebook is modified from [NLP-Kr](https://github.com/NLP-kr/tensorflow-ml-nlp-tf2)
- The source codes are based on [HuggingFace Transformers](https://github.com/huggingface/transformers) and [SKT KoGPT2](https://github.com/SKT-AI/KoGPT2)
- To understand the pricinpale of GPT2, see this kindful page of [illustrated GPT-2](https://jalammar.github.io/illustrated-gpt2/).
- The codes below use Tensorflow 2.3.0

##Libraries and modules

In [None]:
!pip install gluonnlp
!pip install transformers

In [None]:
!pip install mxnet

In [3]:
import os

import numpy as np
import tensorflow as tf

In [4]:
# To use the word vocab API (https://nlp.gluon.ai/api/vocab.html).
import gluonnlp as nlp  

# Do not use the tokenizer of Transformers. KoGPT2 model used this type of Tokenizer.
# This tokenizer supports subword tokenization such as BPE(Byte pair Encoding).
# (https://nlp.gluon.ai/api/data.html)
from gluonnlp.data import SentencepieceTokenizer

# To use Open AI GPT2 Model transformer with a language modeling head on top (linear layer with weights tied to the input embeddings).
# (https://huggingface.co/transformers/model_doc/gpt2.html#gpt2lmheadmodel)
from transformers import TFGPT2LMHeadModel

##Korean GPT2 Model

In [None]:
# # Download the pretrained parameters.
# !wget https://github.com/NLP-kr/tensorflow-ml-nlp-tf2/releases/download/v1.0/gpt_ckpt.zip -O gpt_ckpt.zip
# !unzip -o gpt_ckpt.zip

In [6]:
# To define the GPT2Model class.
class GPT2Model(tf.keras.Model):
  def __init__(self, dir_path):
    super(GPT2Model, self).__init__()
    self.gpt2 = TFGPT2LMHeadModel.from_pretrained(dir_path)
  
  # object gpt2 includes 4 outputs of tuple (last_hidden_states, past, hidden_state, attentions)
  def call(self, inputs):
    return self.gpt2(inputs)[0]

In [7]:
# We used Colab.
BASE_MODEL_PATH = '/content/drive/My Drive/ModelCollection/gpt_ckpt/'
# Create pre-trained gpt2 model.
gpt_model = GPT2Model(BASE_MODEL_PATH)

All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at /content/drive/My Drive/ModelCollection/gpt_ckpt/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


## Tokenizer

In [8]:
TOKENIZER_PATH = os.path.join(BASE_MODEL_PATH + 'gpt2_kor_tokenizer.spiece')

# To make tokenizer for KoGPT2.
tokenizer = SentencepieceTokenizer(TOKENIZER_PATH)

# To define the word dictionary.
vocab = nlp.vocab.BERTVocab.from_sentencepiece(TOKENIZER_PATH,
                                               mask_token=None,
                                               sep_token=None,
                                               cls_token=None,
                                               unknown_token='<unk>',
                                               padding_token='<pad>',
                                               bos_token='<s>',
                                               eos_token='</s>')

## Sentence Generator

In [9]:
def tf_top_k_top_p_filtering(logits, top_k=0, top_p=0.0, filter_value=-99999):
    _logits = logits.numpy()
    top_k = min(top_k, logits.shape[-1])  
    
    # top-k sampling method
    if top_k > 0:
        indices_to_remove = logits < tf.math.top_k(logits, top_k)[0][..., -1, None]
        _logits[indices_to_remove] = filter_value
    
    # Nuclus sampling method
    if top_p > 0.0:
        sorted_logits = tf.sort(logits, direction='DESCENDING')
        sorted_indices = tf.argsort(logits, direction='DESCENDING')
        cumulative_probs = tf.math.cumsum(tf.nn.softmax(sorted_logits, axis=-1), axis=-1)

        sorted_indices_to_remove = cumulative_probs > top_p
        sorted_indices_to_remove = tf.concat([[False], sorted_indices_to_remove[..., :-1]], axis=0)
        indices_to_remove = sorted_indices[sorted_indices_to_remove].numpy().tolist()
        
        _logits[indices_to_remove] = filter_value
    return tf.constant([_logits])

In [10]:
def generate_sent(seed_word, model, max_step=100, greedy=False, top_k=0, top_p=0.):
    sent = seed_word   # Input sentence or word
    toked = tokenizer(sent)  # Tokenizing
    
    for _ in range(max_step):  # max_step is the maximum size of sentence generating
        input_ids = tf.constant([vocab[vocab.bos_token],]  + vocab[toked])[None, :] 
        outputs = model(input_ids)[:, -1, :] # Output is the last subword of sentence.
        if greedy:  # greedy search with Maximum Likelihood Estimation.
            gen = vocab.to_tokens(tf.argmax(outputs, axis=-1).numpy().tolist()[0])
        else:    # Using top_k & Nucleus sampling.
            output_logit = tf_top_k_top_p_filtering(outputs[0], top_k=top_k, top_p=top_p)
            gen = vocab.to_tokens(tf.random.categorical(output_logit, 1).numpy().tolist()[0])[0]
        if gen == '</s>': # Stop generating when meeting this special token.
            break
        sent += gen.replace('▁', ' ')
        toked = tokenizer(sent)

    return sent

## Practice of sentence generating

In [35]:
generate_sent('의학', gpt_model, greedy=True)

'의학전문대학원, 의과대학원과 같은 학원을 설립할 수 있는가에 대한 것이다.'

In [37]:
generate_sent('의학', gpt_model, top_k=0, top_p=0.95)

'의학, 흡연, 신장 연구의학을 전문적으로 하면서이지.'

## Fine tuning (Korean novel)

### Data preprocessing

In [48]:
#Hyperparameters
BATCH_SIZE = 16
NUM_EPOCHS = 10
MAX_LEN = 30

In [42]:
# Download text file (Korean novel)
://raw.githubusercontent.com/NLP-kr/tensorflow-ml-nlp-tf2/master/7.PRETRAIN_METHOD/data_in/KOR/finetune_data.txt

--2020-11-03 07:42:38--  https://raw.githubusercontent.com/NLP-kr/tensorflow-ml-nlp-tf2/master/7.PRETRAIN_METHOD/data_in/KOR/finetune_data.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 24570 (24K) [text/plain]
Saving to: ‘finetune_data.txt’


2020-11-03 07:42:38 (1.80 MB/s) - ‘finetune_data.txt’ saved [24570/24570]



In [45]:
TEXT_DATA_PATH = '/content/drive/My Drive/DataCollection/NLP/finetune_data.txt'

sents = [s[:-1] for s in open(TEXT_DATA_PATH).readlines()]

print('total number of sents :',len(sents))
print('1st sents :',sents[0])

total number of sents : 284
1st sents : 그때에 김첨지는 대수롭지 않은듯이,


In [56]:
# Tokenizing the text data and making input_data and output_data with special tokens.
from tensorflow.keras.preprocessing.sequence import pad_sequences

input_data = []
output_data = []

for s in sents:
    tokens = [vocab[vocab.bos_token],]  + vocab[tokenizer(s)] + [vocab[vocab.eos_token],]
    input_data.append(tokens[:-1])
    output_data.append(tokens[1:])

input_data = pad_sequences(input_data, MAX_LEN, value=vocab[vocab.padding_token]) # default: padding='pre'
output_data = pad_sequences(output_data, MAX_LEN, value=vocab[vocab.padding_token])

input_data = np.array(input_data, dtype=np.int64)
output_data = np.array(output_data, dtype=np.int64)

print(sents[0])
print(input_data[0])
print(output_data[0])

그때에 김첨지는 대수롭지 않은듯이,
[    3     3     3     3     3     3     3     3     3     0 47437 47522
 47675 47442 47437 47633 48120 47445 47441 47437 47455 47467 48139 47445
 47437 47676 47459 48090 47438 47453]
[    3     3     3     3     3     3     3     3     3 47437 47522 47675
 47442 47437 47633 48120 47445 47441 47437 47455 47467 48139 47445 47437
 47676 47459 48090 47438 47453     1]


### Fine tuning with pre-trained model

In [57]:
# Loss function and metric(accuracy)
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, reduction='none')

train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='accuracy')

def loss_function(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, vocab[vocab.padding_token]))
    loss_ = loss_object(real, pred)

    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask

    return tf.reduce_mean(loss_)

def accuracy_function(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, vocab[vocab.padding_token]))
    mask = tf.expand_dims(tf.cast(mask, dtype=pred.dtype), axis=-1)
    pred *= mask    
    acc = train_accuracy(real, pred)

    return tf.reduce_mean(acc)

In [59]:
# compile model
gpt_model.compile(loss=loss_function,
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=[accuracy_function])

In [72]:
# train model...
history = gpt_model.fit(input_data, output_data, 
                    batch_size=BATCH_SIZE, epochs=NUM_EPOCHS,
                    validation_split=0.1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [73]:
# Save the model
DATA_OUT_PATH = '/content/drive/My Drive/DataCollection/NLP'
model_name = "tf2_gpt2_finetuned_model"

save_path = os.path.join(DATA_OUT_PATH, model_name)

if not os.path.exists(save_path):
    os.makedirs(save_path)

gpt_model.gpt2.save_pretrained(save_path)

In [74]:
# Load the model
loaded_gpt_model = GPT2Model(save_path)

All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at /content/drive/My Drive/DataCollection/NLP/tf2_gpt2_finetuned_model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


### Practice with fine-tuned model

In [75]:
generate_sent('김첨지', gpt_model, greedy=True)

'김첨지는                                                                                                   '

In [78]:
generate_sent('김첨지', gpt_model, top_k=3, top_p=0.0)

'김첨지의                                                                                                   '