## Self Attention with Relative Position Representations 논문 실습

- 본 논문은 Attention is all you need (NIPS 2017) 에서 제안한 Transformer Architecture를 기반으로 실습합니다.
- Attention is all you need 에서 제안한 아키텍처 상에서 Self-Attention 모듈만 개선함으로써 성능 개선을 실습합니다.

#### 데이터 전처리 (PreProcessing)
- 허깅페이스 API를 이용해서 대표적인 영어-독어 데이터셋인 **Multi30k** 를 불러옵니다.

In [11]:
from datasets import load_dataset

dataset = load_dataset("bentrevett/multi30k")

train_dataset, validation_dataset, test_dataset = dataset['train'], dataset['validation'], dataset['test']

In [19]:
print(train_dataset[0])

{'en': 'Two young, White males are outside near many bushes.', 'de': 'Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.'}


- **Tokenizer** 및 **Vocab** 생성

In [22]:
from tokenizers import Tokenizer
from tokenizers.models import WordLevel
from tokenizers.trainers import WordLevelTrainer
from tokenizers.pre_tokenizers import Whitespace

In [32]:
unknown_token = "<unk>"

def initialize_tokenizer() -> Tokenizer:
    tokenizer = Tokenizer(WordLevel(unk_token=unknown_token))
    tokenizer.pre_tokenizer = Whitespace()
    return tokenizer

de_tokenizer, en_tokenizer = [initialize_tokenizer() for _ in range(2)]

In [38]:
# 학습용 trainer 생성
pad_token, sos_token, eos_token = "<pad>", "<sos>", "<eos>"
special_tokens = [unknown_token, pad_token, sos_token, eos_token]

trainer = WordLevelTrainer(special_tokens=special_tokens, min_frequency=2)

In [40]:
# tokenizer 학습
train_de, train_en = train_dataset['de'], train_dataset['en']

de_tokenizer.train_from_iterator(train_de, trainer=trainer)
en_tokenizer.train_from_iterator(train_en, trainer=trainer)

In [56]:
# tokenizer 학습 결과 확인

print("[DE] vocab size: {}".format(de_tokenizer.get_vocab_size()))
print("[EN] vocab size: {}".format(en_tokenizer.get_vocab_size()))

print("[DE] Sample DE vocab tokens: {}".format(list(de_tokenizer.get_vocab().keys())[:10]))
print("[EN] Sample EN vocab tokens: {}".format(list(en_tokenizer.get_vocab().keys())[:10]))

[DE] vocab size: 8060
[EN] vocab size: 6203
[DE] Sample DE vocab tokens: ['braunhaariger', 'Hochstart', 'gescheckter', 'Picknick', 'Ellenbogen', 'kürzlich', 'zuhört', 'anschneiden', 'plantscht', 'sucht']
[EN] Sample EN vocab tokens: ['final', 'pastor', 'brown', 'crowded', 'classroom', 'practicing', 'peace', 'blazer', 'pancake', 'designing']


In [60]:
# 특수 토큰 체크
for special_token in special_tokens:
    print("[DE] special token: {}, index: {}".format(special_token, de_tokenizer.get_vocab()[special_token]))
    print("[EN] special token: {}, index: {}".format(special_token, en_tokenizer.get_vocab()[special_token]))

[DE] special token: <unk>, index: 0
[EN] special token: <unk>, index: 0
[DE] special token: <pad>, index: 1
[EN] special token: <pad>, index: 1
[DE] special token: <sos>, index: 2
[EN] special token: <sos>, index: 2
[DE] special token: <eos>, index: 3
[EN] special token: <eos>, index: 3


- 하이퍼 파라미터 정의

In [65]:
class ModelConfiguration:
    def __init__(self, 
                 max_len: int = 768, 
                 batch_size: int = 32, 
                 hidden_size: int = 512, 
                 ffn_size: int = 2048,
                 num_heads: int = 8, 
                 num_layers: int = 6, 
                 dropout_pb: float = 0.1, 
                 src_vocab_size: int = 0, 
                 trg_vocab_size: int = 0
                ):
        self.max_len = max_len
        self.batch_size = batch_size
        self.hidden_size = hidden_size
        self.ffn_size = ffn_size
        self.num_heads = num_heads
        self.num_layers = num_layers
        self.dropout_pb = dropout_pb
        self.src_vocab_size = src_vocab_size
        self.trg_vocab_size = trg_vocab_size

model_config = ModelConfiguration(src_vocab_size=de_tokenizer.get_vocab_size(), trg_vocab_size=en_tokenizer.get_vocab_size())

- 데이터 전처리
    - 데이터 패딩 등...

In [71]:
de_pad_id, en_pad_id = de_tokenizer.token_to_id(pad_token), en_tokenizer.token_to_id(pad_token)
de_sos_id, en_sos_id = de_tokenizer.token_to_id(sos_token), en_tokenizer.token_to_id(sos_token)
de_eos_id, en_eos_id = de_tokenizer.token_to_id(eos_token), en_tokenizer.token_to_id(eos_token)

In [77]:
# input: {"en" : "example_en", "de" : "example_de"}
# output: {"encoder_input_ids": [], "encoder_attention_mask": [], "decoder_input_ids": [], "decoder_attention_mask": [], "labels": []}
def preprocess(dataset: dict) -> dict:
    max_len = model_config.max_len

    # 토큰 id로 변환
    src_input_ids = de_tokenizer.encode(dataset['de']).ids
    trg_input_ids = en_tokenizer.encode(dataset['en']).ids

    # decoder input
    decoder_input = [en_sos_id] + trg_input_ids
    labels = trg_input_ids + [en_eos_id]

    # padding
    encoder_input = src_input_ids[:max_len] + [de_pad_id] * max(0, max_len - len(src_input_ids))
    decoder_input = decoder_input[:max_len] + [en_pad_id] * max(0, max_len - len(decoder_input))
    labels = labels[:max_len] + [en_pad_id] * max(0, max_len - len(labels))
    # Optional. loss 계산시 pad_id를 계산하지 않도록 ignore_index 적용
    labels = [token if token != en_pad_id else -100 for token in labels]

    # Attention mask
    encoder_attention_mask = [1 if token != de_pad_id else 0 for token in encoder_input]
    decoder_attention_mask = [1 if token != en_pad_id else 0 for token in decoder_input]

    return {
        "encoder_input_ids" : encoder_input,
        "encoder_attention_mask" : encoder_attention_mask,
        "decoder_input_ids" : decoder_input,
        "decoder_attention_mask" : decoder_attention_mask,
        "labels" : labels
    }

In [79]:
train_dataset = train_dataset.map(preprocess, remove_columns=['en', 'de'])
validation_dataset = validation_dataset.map(preprocess, remove_columns=['en', 'de'])
test_dataset = test_dataset.map(preprocess, remove_columns=['en', 'de'])

Map: 100%|███████████████████████| 29000/29000 [00:08<00:00, 3587.86 examples/s]
Map: 100%|█████████████████████████| 1014/1014 [00:00<00:00, 3741.02 examples/s]
Map: 100%|█████████████████████████| 1000/1000 [00:00<00:00, 3695.27 examples/s]


- DataLoader 설정

In [84]:
import torch

def collate_function(batch):
    return {
        key: torch.tensor([data[key] for data in batch], dtype=torch.long) for key in batch[0]
    }

In [86]:
from torch.utils.data import DataLoader

batch_size = model_config.batch_size

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_function)
validation_loader = DataLoader(validation_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_function)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_function)

In [None]:
# 배치 샘플 확인
batch = next(iter(train_loader))

for key, value in batch.items():
    print("{}: shape=")