## Tutorial of DialoGPT

- 해당 과제는 발표자료를 참고하여 하였으며, `MMI`에 대한 설명을 위해 공식 문서 및 third-party example을 활용하여 작성하였습니다.

### Overview

- 데이터 세트는 2005년부터 2017년까지 Reddit에서 스크랩한 comment chains에서 추출


- filtering : 데이터 세트의 품질이 좋은지 확인하기 위해 상당한 노력을 기울임
    - filtering 이후 데이터 세트는 147,116,725개의 대화 인스턴스, 총 18억 단어로 구성됨
![image.png](https://github.com/DeepHaeJoong/SGU_2022_NLP/blob/master/image/dialoGPT1.png?raw=true)
- Model 구조는 GPT-2 구조를 그대로 따름

- 기존 Generative model들은 Auto-regressive 방식을 활용하여 다음 단어를 예측하였다.

![image.png](https://github.com/DeepHaeJoong/SGU_2022_NLP/blob/master/image/dialogGPT3.png?raw=true)

- 하지만, 위 방식으로 주어진 데이터들을 그대로 사용하여 학습한 결과 "I don’t know", “I‘m Ok"와 같이 notorious for generating bland, uninformative sample은 답변을 많이 얻을 수 있다고 합니다. 이유는 사람 대화에서 흔히 발생하는 답변이며 이에 대한 likelihood를 높이기 위해 학습이 진행된다면 당연한 결과임을 알 수 있다.

- 이러한 문제점을 해결하기 위해 본 논문에서는 "Mutual Information Maximization"를 접목한 새로운 object function인 "MMI"를 제안하고 있다.

    - Mutual Information Maximization 는 단어/대화들 사이 서로 독립이 아닌 경우를 가정했을때, 이를 최대화하는 식을 의미함

![image.png](https://github.com/DeepHaeJoong/SGU_2022_NLP/blob/master/image/dialoGPT2.png?raw=true)

### Tokenizer 정의
- GPT-2의 학습과정에서 사용된것 그대로 활용
- 목적 : text 전처리
- [`DialoGPT/configs/345M/`](https://github.com/microsoft/DialoGPT/tree/master/configs/345M)에서 다운받은 `medium/345M/merges.txt`, `medium/345M/vocab.json` 의 경로로 설정한다.

In [23]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel, GPT2Config

# vocab.json 및 merges.txt 의 경로 설정 주의 (이는 Microsoft 공식 DialoGPT의 git에서 가져옴)
tokenizer = GPT2Tokenizer('medium/345M/vocab.json', 'medium/345M/merges.txt')

In [24]:
tokenizer

PreTrainedTokenizer(name_or_path='', vocab_size=50257, model_max_len=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'eos_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'unk_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True)})

### DialoGPT  정의

#### forward model (pre-trained model)

- Fine-tuned from GPT-2에서 [`DialoGPT 345M model : medium_ft.pkl`](https://github.com/microsoft/DialoGPT)을 다운받아 `medium/medium_ft.pkl` 경로 설정한다.
- [`DialoGPT/configs/345M/`](https://github.com/microsoft/DialoGPT/tree/master/configs/345M)에서 다운받은 모든 파일은 `medium/345M/config.json`, 

In [25]:
# ! pip -q install transformers

In [26]:
import torch
import torch.nn.functional as F
from config import device_f, device_r, num_samples, MMI_temperature, top_k

In [12]:
torch.set_grad_enabled(False)

# 사전학습된 weight 모두 불러오기
weights = torch.load('medium/medium_ft.pkl')
# fix misused key value
weights["lm_head.weight"] = weights["lm_head.decoder.weight"]
weights.pop("lm_head.decoder.weight", None)

# "config.json" 에 저장된 hyper-parameter는 아래와 같다.
# {
#   "attn_pdrop": 0.1,
#   "embd_pdrop": 0.1,
#   "initializer_range": 0.02,
#   "layer_norm_epsilon": 1e-05,
#   "n_ctx": 1024,
#   "n_embd": 1024,
#   "n_head": 16,
#   "n_layer": 24,
#   "n_positions": 1024,
#   "n_special": 0,
#   "predict_special_tokens": true,
#   "resid_pdrop": 0.1,
#   "vocab_size": 50257
# }
# cfg
cfg = GPT2Config.from_json_file('medium/345M/config.json')
model: GPT2LMHeadModel = GPT2LMHeadModel(cfg)
model.load_state_dict(weights,strict=False)

if device_f == 'cuda':
    model.half()
# GPU 할당
model.to(device_f)
# test를 위함 (parameter auto grad 막기)
model.eval()

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 1024)
    (wpe): Embedding(1024, 1024)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0): GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout)

In [38]:
cfg

GPT2Config {
  "activation_function": "gelu_new",
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 1024,
  "n_head": 16,
  "n_inner": null,
  "n_layer": 24,
  "n_positions": 1024,
  "n_special": 0,
  "predict_special_tokens": true,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "transformers_version": "4.25.1",
  "use_cache": true,
  "vocab_size": 50257
}

In [33]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 354,823,168 trainable parameters


#### backward model (pre-trained model)

- Fine-tuned from GPT-2에서 [`DialoGPT 345M model (reverse, for MMI) : small_reverse.pkl`](https://github.com/microsoft/DialoGPT)을 다운받아 `medium/small_reverse.pkl` 경로 설정한다.

In [34]:
# 사전학습된 weight 모두 불러오기
weights = torch.load('medium/small_reverse.pkl')
# fix misused key value
weights["lm_head.weight"] = weights["lm_head.decoder.weight"]
weights.pop("lm_head.decoder.weight", None)

reverse_model: GPT2LMHeadModel = GPT2LMHeadModel(cfg)
reverse_model.load_state_dict(weights,strict=False)
if device_r == 'cuda':
    reverse_model.half()
reverse_model.to(device_r)
# test를 위함 (parameter auto grad 막기)
reverse_model.eval()

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 1024)
    (wpe): Embedding(1024, 1024)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0): GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout)

In [35]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(reverse_model):,} trainable parameters')

The model has 354,823,168 trainable parameters


### Try to chat with DialoGPT without fine-tuning.

In [39]:
# end_token 정의 : 최종 에측된 output이 end_token인 경우 대화 종료를 위함
end_token = torch.tensor([[50256]], dtype=torch.long)

- `maximum mutual information` 을 활용한 Inference 과정에 필요한 함수

In [36]:
def _get_response(output_token, past):
    out = torch.tensor([[]], dtype=torch.long, device=device_f)
    

    while True:
        util = model.forward(output_token, past_key_values=past)
        output_token, past = util['logits'],util['past_key_values']
        output_token = output_token[:, -1, :].float()
        # Top-k sampling :
        indices_to_remove = output_token < torch.topk(output_token, top_k)[0][..., -1, None]
        # Top-k sampling : 해당하지 않는 경우 모두 -'inf'로 masking
        output_token[indices_to_remove] = -float('Inf')
        
        # softmax에 집어넣고 하나씩 추출을 하기 위함
        # 위에서 -'inf'로 했을 경우 이는 softmax 결과로 거의 0에 가깝게 유도
        output_token = torch.multinomial(F.softmax(output_token, dim=-1), num_samples=1)

        out = torch.cat((out, output_token), dim=1)
        # 마지막 sequence이면 문장이 완성되었으므로 중단!
        if output_token.item() == end_token.item():
            break

    return out, past


def _score_response(output_token, correct_token):
    inputs = torch.cat((output_token, correct_token), dim=1)
    mask = torch.full_like(output_token, -100, dtype=torch.long)
    labels = torch.cat((mask, correct_token), dim=1)
    # loss를 최소화 하는 방향으로 
    score = -reverse_model(inputs, labels=labels)['loss'].float()

    return score


def append_messages(old_list: list, new_list: list, truncate_length=64):
    # message를 모두 탐색하면서 의미없는 message가 아닌 경우 old_list에 추가하는 과정
    for message in new_list:
        if message != '':
            input_token = tokenizer.encode(message, return_tensors='pt')
            input_token = torch.cat((input_token, end_token), dim=1)
            old_list.append(input_token)

    if len(old_list) == 0:
        old_list.append(end_token)

    # truncate
    # 너무 오래된 message는 FIFO 처럼 지우는 과정
    # 효율성과 사람의 사람은 대화 과정에서 오래된 시간이 지난 대화는 잊어버리는 것과 유사함
    total_length = 0
    for i, message in enumerate(reversed(old_list)):
        total_length += message.shape[1]
        if total_length > truncate_length:
            old_list[:] = old_list[-i:]


def generate_message(message_list: list, focus_last_message=True):
    total_input = torch.cat(message_list, dim=1).to(device_f)
    if focus_last_message:
        total_input_reversed = message_list[-1]
    else:
        total_input_reversed = torch.cat(list(reversed(message_list)), dim=1)

    past = None
    if total_input.shape[1] > 1:
        past = model(total_input[:, :-1])

    results = []
    for i in range(num_samples):
        result = _get_response(total_input[:, -1:], past['past_key_values'])
        score = _score_response(result[0].to(device_r), total_input_reversed.to(device_r))
        results.append(result + (score,))

    scores = torch.stack([x[2] for x in results], dim=0)
    winner = torch.multinomial(F.softmax(scores / MMI_temperature, dim=0), num_samples=1).item()
    # winner = torch.argmax(scores, dim=0)

    out = results[winner][0]

    return tokenizer.decode(out.tolist()[0], skip_special_tokens=True)

### DialoGPT를 가지고 Chatbot과 대화를 진행하는 예제

In [37]:
my_message_list = []
while True:
    print("usr >> ",end="")
    my_message = input()
    if my_message=="quit":
        print("bot >> Quit. Chating End")
        break
    append_messages(my_message_list, [my_message])
    my_response = generate_message(my_message_list)
    print('bot >>', my_response)

    append_messages(my_message_list, [my_response])

usr >> I'm Tired
bot >> Hi tired. Where did you grow up?
usr >> I'm from South Korea
bot >> South Korea isn't an island though.
usr >> that's right. In a way, it's a peninsula
bot >> In a way, it's a peninsula
usr >>  How did you know it wasn't an island?
bot >> How did you know it wasn't an island?
usr >> can you answer my question?
bot >> I'm not allowed to answer questions on this subreddit.
usr >>  Let's stop, it was a pleasure talking to you
bot >> You know you want to talk to me.
usr >>  I'll go then bye
bot >> waves Bye!
usr >> 

KeyboardInterrupt: Interrupted by user

### Reference

- https://github.com/microsoft/DialoGPT (official)
- https://github.com/LHolten/DialoGPT-MMI-decoder (third-parth / DialoGPT + MMI)