## Attention is all you need 실습
- 본 코드는 기본적으로 Attention is all you need (NIPS 2017) 논문의 내용을 최대한 따릅니다.
- 다만 일부 코드의 경우 Attention is all you need 내용을 따르지 않고 최신 아키텍처를 반영한 부분도 존재합니다. (가령, Positional Embedding 부분)

**트랜스포머 도식**
<img src="./resources/attention-is-all-you-need/transformer-architecture.webp" alt="트랜스포머 아키텍처" width="400" height="400">

### 데이터 전처리

In [2]:
from datasets import load_dataset

# HuggingFace 에서 데이터셋 로드
dataset = load_dataset("bentrevett/multi30k")

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
train_dataset, validation_dataset, test_dataset = dataset['train'], dataset['validation'], dataset['test']

print(train_dataset[0])

{'en': 'Two young, White males are outside near many bushes.', 'de': 'Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.'}


- **Tokenizer** 및 **Vocab** 생성

In [5]:
from tokenizers import Tokenizer
from tokenizers.models import WordLevel
from tokenizers.trainers import WordLevelTrainer
from tokenizers.pre_tokenizers import Whitespace

In [6]:
# Word-level tokenizer 초기화
unknown_token = "<unk>"
pad_token, sos_token, eos_token = "<pad>", "<sos>", "<eos>"
special_tokens = [unknown_token, pad_token, sos_token, eos_token]

def generate_tokenizer() -> Tokenizer:
    tokenizer = Tokenizer(WordLevel(unk_token=unknown_token))
    tokenizer.pre_tokenizer = Whitespace()

    return tokenizer

en_tokenizer, de_tokenizer = generate_tokenizer(), generate_tokenizer()

In [7]:
# 학습용 trainer 설정 (vocab 생성)
# 단어 단위로 학습 trainer를 구성하고, 적어도 2개 이상 등장하는 단어들을 학습하도록 구성한다
trainer = WordLevelTrainer(special_tokens = special_tokens, min_frequency = 2)

In [8]:
# Tokenizer 학습
train_en, train_de = train_dataset['en'], train_dataset['de']

en_tokenizer.train_from_iterator(train_en, trainer=trainer)
de_tokenizer.train_from_iterator(train_de, trainer=trainer)

In [9]:
# Vocab Size 확인
print("[EN] vocab size: {}".format(en_tokenizer.get_vocab_size()))
print("[DE] vocab size: {}".format(de_tokenizer.get_vocab_size()))

# vocab token 예시 출력
print("[EN] Sample EN vocab tokens: {}".format(list(en_tokenizer.get_vocab().keys())[:10]))
print("[DE] Sample DE vocab tokens: {}".format(list(de_tokenizer.get_vocab().keys())[:10]))

[EN] vocab size: 6203
[DE] vocab size: 8060
[EN] Sample EN vocab tokens: ['sandwiches', 'freight', 'thing', 'bistro', 'hairnet', 'slip', 'convention', 'curb', 'bodies', 'tight']
[DE] Sample DE vocab tokens: ['Heimweg', 'lächelnde', 'liegt', 'daneben', 'Notebook', 'Elektronik', 'Einsatzkräfte', 'Fahrradständer', 'grün', 'Geburtstagsfeier']


In [10]:
# 특수 토큰 인덱스 체크
for special_token in special_tokens:
    print("[EN] special token: {}, index: {}".format(special_token, en_tokenizer.get_vocab()[special_token]))
    print("[DE] special token: {}, index: {}".format(special_token, de_tokenizer.get_vocab()[special_token]))
    print("---")

[EN] special token: <unk>, index: 0
[DE] special token: <unk>, index: 0
---
[EN] special token: <pad>, index: 1
[DE] special token: <pad>, index: 1
---
[EN] special token: <sos>, index: 2
[DE] special token: <sos>, index: 2
---
[EN] special token: <eos>, index: 3
[DE] special token: <eos>, index: 3
---


- **하이퍼 파라미터** 정의

In [12]:
class ModelConfiguration:
    def __init__(self, 
                 max_len: int = 768, 
                 batch_size: int = 16,
                 hidden_size: int = 512, 
                 ffn_size: int = 2048,
                 num_heads: int = 8, 
                 num_layers: int = 6, 
                 dropout_pb: float = 0.1, 
                 src_vocab_size: int = 0, 
                 trg_vocab_size: int = 0
                ):
        self.max_len = max_len
        self.batch_size = batch_size
        self.hidden_size = hidden_size
        self.ffn_size = ffn_size
        self.num_heads = num_heads
        self.num_layers = num_layers
        self.dropout_pb = dropout_pb
        self.src_vocab_size = src_vocab_size
        self.trg_vocab_size = trg_vocab_size

model_config = ModelConfiguration(src_vocab_size=de_tokenizer.get_vocab_size(), trg_vocab_size=en_tokenizer.get_vocab_size())

- **데이터 전처리**

In [14]:
_, en_pad_id, en_sos_id, en_eos_id = map(lambda special_token: en_tokenizer.token_to_id(special_token), special_tokens)
_, de_pad_id, de_sos_id, de_eos_id = map(lambda special_token: de_tokenizer.token_to_id(special_token), special_tokens)

In [15]:
# input: {"en" : "example_en", "de" : "example_de"}
def preprocess(dataset: dict) -> dict:
    max_len = model_config.max_len
    batch_size = model_config.batch_size
    
    # 토큰 id로 변환
    src_input_ids = de_tokenizer.encode(dataset['de']).ids
    trg_input_ids = en_tokenizer.encode(dataset['en']).ids

    # decoder 출력 부분에 special tokens 추가
    # I am a student 라는 문장이 있다면, 출력은 <sos> -> I, I -> am, ... 순으로 예측을 하기 때문
    decoder_input = [en_sos_id] + trg_input_ids
    labels = trg_input_ids + [en_eos_id]

    # padding
    encoder_input = src_input_ids[:max_len] + [de_pad_id] * max(0, max_len - len(src_input_ids))
    decoder_input = decoder_input[:max_len] + [en_pad_id] * max(0, max_len - len(decoder_input))
    labels = labels[:max_len] + [en_pad_id] * max(0, max_len - len(labels))

    # Attention mask (1 if real token else 0)
    encoder_attention_mask = [1 if token != de_pad_id else 0 for token in encoder_input]
    decoder_attention_mask = [1 if token != en_pad_id else 0 for token in decoder_input]

    return {
        "encoder_input_ids" : encoder_input,
        "encoder_attention_mask" : encoder_attention_mask,
        "decoder_input_ids" : decoder_input,
        "decoder_attention_mask" : decoder_attention_mask,
        "labels" : labels
    }

In [16]:
# 전처리 적용
train_dataset = train_dataset.map(preprocess, remove_columns=['en', 'de'])
validation_dataset = validation_dataset.map(preprocess, remove_columns=['en', 'de'])
test_dataset = test_dataset.map(preprocess, remove_columns=['en', 'de'])

In [17]:
print(train_dataset[10])

{'encoder_input_ids': [14, 5654, 10, 810, 28, 8, 19, 4270, 276, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

In [18]:
import torch

# collate function: 배치에 존재하는 값들의 텐서를 하나의 텐서로 병합하는 함수수
def collate_function(batch):
    return {
        key: torch.tensor([data[key] for data in batch], dtype=torch.long) for key in batch[0]
    }

In [19]:
from torch.utils.data import DataLoader

# DataLoader 설정
batch_size = model_config.batch_size

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_function)
validation_loader = DataLoader(validation_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_function)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_function)

In [20]:
# 배치 샘플 확인
batch = next(iter(train_loader))

for key, value in batch.items():
    print("{}: shape={}".format(key, value.shape))

encoder_input_ids: shape=torch.Size([16, 768])
encoder_attention_mask: shape=torch.Size([16, 768])
decoder_input_ids: shape=torch.Size([16, 768])
decoder_attention_mask: shape=torch.Size([16, 768])
labels: shape=torch.Size([16, 768])


## 토큰 임베딩
- Attention is all you need 에서 소개한 토큰 임베딩은 크게 token 자체 임베딩, 위치 임데딩 둘을 합쳐서 구현
- 이 때 토큰 임베딩 로직은 encoder, decoder 둘이 공유해야합니다

In [22]:
# 학습 device 정의
import torch

device = torch.device('mps' if torch.backends.mps.is_available() else 'cpu')

In [23]:
import torch.nn as nn

class Embeddings(nn.Module):
    def __init__(self, vocab_size: int, hidden_size: int, max_len: int, dropout_prob: float):
        super().__init__()
        self.token_embedding = nn.Embedding(vocab_size, hidden_size)
        self.positional_embedding = nn.Embedding(max_len, hidden_size)
        self.layer_norm = nn.LayerNorm(hidden_size, eps=1e-12)
        self.dropout = nn.Dropout(dropout_prob)

    # Input은 collate 형식의 배치가 들어감
    # input_ids = (batch_size, max_len)
    def forward(self, input_ids: torch.Tensor) -> torch.Tensor:
        # 입력 sequence에 대한 positional ids 생성
        # positional_ids = (max_len) -> (1, max_len) -> (batch_size, max_len)
        # arange 하는 경우 0 부터 max_len - 1 까지의 텐서가 생성
        sequence_len = input_ids.size(1)
        positional_ids = torch.arange(sequence_len, device=device)
        positional_ids = positional_ids.unsqueeze(0)
        positional_ids = positional_ids.expand_as(input_ids)

        # 임베딩 : (batch_size, max_len) -> (batch_size, max_len, hidden_size)
        token_embeddings = self.token_embedding(input_ids)
        positional_embeddings = self.positional_embedding(positional_ids)

        # Add/Norm + Dropout
        embeddings = token_embeddings + positional_embeddings
        embeddings = self.layer_norm(embeddings)
        embeddings = self.dropout(embeddings)

        return embeddings

In [24]:
# 임베딩 검증
embedding_layer = Embeddings(
    vocab_size=model_config.src_vocab_size,
    hidden_size=model_config.hidden_size,
    max_len=model_config.max_len,
    dropout_prob=model_config.dropout_pb
).to(device)

batch = next(iter(train_loader))
input_ids = batch['encoder_input_ids'].to(device)

embeddings = embedding_layer(input_ids)

# 결과 확인
print("Input shape: {}".format(input_ids.shape))
print("Embedding shape: {}".format(embeddings.shape))

Input shape: torch.Size([16, 768])
Embedding shape: torch.Size([16, 768, 512])


## Multi Head Attention 구현
- Transformer 아키텍처의 핵심인 멀티 헤드 어텐션을 구현합니다.

In [26]:
import torch.nn.functional as F

# Scaled dot product attention 구현
# 각 원소는 (batch_size, max_len, hidden_size)
def scaled_dot_product_attention(query: torch.Tensor,
                                 key: torch.Tensor,
                                 value: torch.Tensor,
                                 mask: torch.Tensor = None
                                ) -> torch.Tensor:
    # hidden_size
    dim_k = query.size(-1)

    # Attention score 계산 (batch_size, max_len, max_len)
    scores = torch.bmm(query, key.transpose(1, 2)) / (dim_k ** 0.5)

    # mask가 존재하면 -1e9를 더하여 무한소로 발산시킴
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))

    # softmax
    attention_weights = F.softmax(scores, dim=-1)

    # attention * value
    # (max_len, max_len) * (max_len, hidden_size) -> (max_len, hidden_size)
    # output.shape=(batch_size, max_len, hidden_size)
    output = torch.bmm(attention_weights, value)

    return output, attention_weights

- **Attention Head** 구현

In [28]:
class AttentionHead(nn.Module):
    def __init__(self, hidden_dim: int, head_dim: int):
        super().__init__()
        self.query_projection = nn.Linear(hidden_dim, head_dim)
        self.key_projection = nn.Linear(hidden_dim, head_dim)
        self.value_projection = nn.Linear(hidden_dim, head_dim)

    def forward(self,
                query: torch.Tensor,
                key: torch.Tensor,
                value: torch.Tensor,
                mask: torch.Tensor = None
               ) -> torch.Tensor:
        Q = self.query_projection(query)
        K = self.key_projection(key)
        V = self.value_projection(value)

        # mask 전달
        attention_output, attewntion_weights = scaled_dot_product_attention(Q, K, V, mask)

        # (batch_size, max_len, head_dim)
        return attention_output

- **Multi Head Attention 구현**

In [30]:
class MultiHeadAttention(nn.Module):
    def __init__(self, hidden_dim: int, num_heads: int):
        super().__init__()
        head_dim = hidden_dim // num_heads

        self.head_list = nn.ModuleList([AttentionHead(hidden_dim, head_dim) for _ in range(num_heads)])
        self.output_linear = nn.Linear(hidden_dim, hidden_dim)

    def forward(self,
               query: torch.Tensor,
               key: torch.Tensor,
               value: torch.Tensor,
               mask: torch.Tensor = None,
               ) -> torch.Tensor:
        # (batch_size, max_len, head_dim)
        head_output_list = [head(query, key, value, mask) for head in self.head_list]
        # (batch_size, max_len, hidden_dim)
        concat_attention = torch.concat(head_output_list, dim=-1)
        # (batch_size, max_len, hidden_dim)
        linear_output = self.output_linear(concat_attention)

        return linear_output

- 멀티 헤드 어텐션 테스트

In [32]:
multi_head_attention = MultiHeadAttention(hidden_dim=model_config.hidden_size, num_heads=model_config.num_heads).to(device)

# 이전 테스트에서 임베딩 된 값을 입력으로 제공
attn_output = multi_head_attention(embeddings, embeddings, embeddings)

print(attn_output.size())

torch.Size([16, 768, 512])


## Positionwise Feed Forward 계층 설계
- 트랜스포머 인코더와 디코더 계층 사이에 존재하는 완전 연결 신경망 계층

In [34]:
class PositionwiseFeedForward(nn.Module):
    def __init__(self, hidden_dim: int, ffn_dim: int, dropout_pb: float):
        super().__init__()
        self.linear1 = nn.Linear(hidden_dim, ffn_dim)
        self.linear2 = nn.Linear(ffn_dim, hidden_dim)
        self.dropout = nn.Dropout(dropout_pb)
        self.activation = nn.GELU()

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.linear1(x)
        x = self.activation(x)
        x = self.linear2(x)
        x = self.dropout(x)

        return x

- 피드 포워드 테스트

In [36]:
feed_forward_layer = PositionwiseFeedForward(hidden_dim=model_config.hidden_size, ffn_dim=model_config.ffn_size, dropout_pb=model_config.dropout_pb).to(device)

ff_outputs = feed_forward_layer(attn_output)

print(ff_outputs.size())

torch.Size([16, 768, 512])


## Mask 생성 함수 구현
- Transformer의 경우 masking 작업이 필요하고, 크게 아래의 두 가지 마스크가 필요하다
  - **Masked Self Attention** 에서 사용되는 mask : 미래 단어를 look ahead 하여 부정 예측을 방지
  - **Encoder Decoder Attention** 에서 사용되는 mask : encoder에서 들어오는 토큰을 예측에 반영하지 못하도록 막기 위함

In [38]:
# look ahead를 방지하는 Mask 생성 함수
def create_look_ahead_mask(seq_len: int, device: torch.device) -> torch.Tensor:
    # (seq_len, seq_len) 사이즈의 1로 filled된 하삼각 행렬 생성
    return torch.tril(torch.ones(seq_len, seq_len, device=device)).bool()

# <pad>를 무시하도록 padding mask 생성 함수
# input_ids : (batch_size, seq_len)
# output : (batch_size, 1, seq_len)
def create_padding_mask(input_ids: torch.Tensor, pad_token_id: int) -> torch.Tensor:
    return (input_ids != pad_token_id).unsqueeze(1)

# output : (batch_size, 1, src_len)
def create_memory_mask(src_input_ids: torch.Tensor, src_pad_id: int, device: torch.device) -> torch.Tensor:
    return create_padding_mask(src_input_ids, src_pad_id)

# output : (batch_size, trg_len, trg_len)
def create_trg_mask(trg_input_ids: torch.Tensor, trg_pad_id: int, device: torch.device) -> torch.Tensor:
    # Decoder padding mask
    # (batch_size, 1, trg_len)
    trg_padding_mask = create_padding_mask(trg_input_ids, trg_pad_id)

    # Decoder Look-ahead mask
    # (1, trg_len, trg_len)
    look_ahead_mask = create_look_ahead_mask(trg_input_ids.size(-1), device).unsqueeze(0)

    # (1, trg_len, trg_len) & (batch_size, 1, trg_len) -> (batch_size, trg_len, trg_len)
    return look_ahead_mask & trg_padding_mask

## Transformer Encoder 설계

**트랜스포머 도식**
<img src="./resources/attention-is-all-you-need/transformer-architecture.webp" alt="트랜스포머 아키텍처" width="400" height="400">

In [40]:
class EncoderLayer(nn.Module):
    def __init__(self, hidden_dim: int, ffn_dim: int, num_heads: int, dropout_pb: float):
        super().__init__()
        self.self_attention = MultiHeadAttention(hidden_dim, num_heads)
        self.feed_forward_layer = PositionwiseFeedForward(hidden_dim, ffn_dim, dropout_pb)

        self.norm1, self.norm2 = [nn.LayerNorm(hidden_dim) for _ in range(2)]
        self.dropout1, self.dropout2 = [nn.Dropout(dropout_pb) for _ in range(2)]

    def forward(self, x: torch.Tensor, mask: torch.Tensor = None) -> torch.Tensor:
        x_norm1 = self.norm1(x)
        attention_output = self.self_attention(x_norm1, x_norm1, x_norm1, mask)

        # Dropout -> Residual Addition
        x = x + self.dropout1(attention_output)

        x_norm2 = self.norm2(x)
        ffn_output = self.feed_forward_layer(x_norm2)

        # Dropout -> Residual Addition
        x = x + self.dropout2(ffn_output)

        return x

In [41]:
class TransformerEncoder(nn.Module):
    def __init__(self,
                 num_layers: int,
                 hidden_dim: int,
                 ffn_dim: int,
                 num_heads: int,
                 dropout_pb: float
                ):
        super().__init__()
        self.encoder_layers = nn.ModuleList([
            EncoderLayer(hidden_dim, ffn_dim, num_heads, dropout_pb) for _ in range(num_layers)
        ])

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        for layer in self.encoder_layers:
            x = layer(x)

        return x

- 트랜스포머 인코더 테스트

In [43]:
example_encoder = TransformerEncoder(model_config.num_layers, model_config.hidden_size, model_config.ffn_size, model_config.num_heads, model_config.dropout_pb).to(device)
encoder_output = example_encoder(embeddings)

print(encoder_output.size())

torch.Size([16, 768, 512])


## Transformer Decoder 설계
- Masked Multi Head Attention 정의
- Decoder Stack 정의
- Decoder Stack 여러개 쌓아서 Transformer Decoder 설계
- 출력 계층 정의

In [45]:
class DecoderLayer(nn.Module):
    def __init__(self, hidden_dim: int, ffn_dim: int, num_heads: int, dropout_pb: float):
        super().__init__()
        self.self_attention = MultiHeadAttention(hidden_dim, num_heads)
        self.encoder_decoder_attention = MultiHeadAttention(hidden_dim, num_heads)
        self.feed_forward_layer = PositionwiseFeedForward(hidden_dim, ffn_dim, dropout_pb)

        self.norm1, self.norm2, self.norm3 = [nn.LayerNorm(hidden_dim) for _ in range(3)]
        self.dropout1, self.dropout2, self.dropout3 = [nn.Dropout(dropout_pb) for _ in range(3)]

    def forward(self, 
                x: torch.Tensor,
                encoder_output: torch.Tensor,
                trg_mask: torch.Tensor = None,
                memory_mask: torch.Tensor = None
               ) -> torch.Tensor:
        # Masked Self Attention
        # 정답에 해당하는 타겟에 마스킹을 적용해야하고, 동시에 padding mask도 적용해야함
        x_norm = self.norm1(x)
        self_attention_output = self.self_attention(x_norm, x_norm, x_norm, trg_mask)
        x = x + self.dropout1(self_attention_output)

        # Encoder-Decoder Attention
        # encoder-decoder attention 수행 과정에서 <pad> 부분을 추론에 이용하지 못하도록 memory_mask 적용
        # query 부분이 끝까지 흘러들어가기 때문에, query 부분에 target 관련 어텐션 스코어가 들어가야함
        x_norm = self.norm2(x)
        encoder_decoder_attention_output = self.encoder_decoder_attention(x_norm, encoder_output, encoder_output, memory_mask)
        x = x + self.dropout2(encoder_decoder_attention_output)

        # Feed Forward
        x_norm = self.norm3(x)
        feed_forward_layer_output = self.feed_forward_layer(x_norm)
        x = x + self.dropout3(feed_forward_layer_output)

        return x

In [46]:
class TransformerDecoder(nn.Module):
    def __init__(self,
                 num_layers: int,
                 hidden_dim: int,
                 ffn_dim: int,
                 num_heads: int,
                 dropout_pb: float
                ):
        super().__init__()
        self.decoder_layers = nn.ModuleList([
            DecoderLayer(hidden_dim, ffn_dim, num_heads, dropout_pb) for _ in range(num_layers)
        ])

    def forward(self,
                x: torch.Tensor,
                encoder_output: torch.Tensor,
                trg_mask: torch.Tensor = None,
                memory_mask: torch.Tensor = None
               ) -> torch.Tensor:
        for layer in self.decoder_layers:
            x = layer(x, encoder_output, trg_mask, memory_mask)

        return x

- Transformer Decoder 테스트

In [48]:
# 임베딩 검증
embedding_layer = Embeddings(
    vocab_size=model_config.src_vocab_size,
    hidden_size=model_config.hidden_size,
    max_len=model_config.max_len,
    dropout_prob=model_config.dropout_pb
).to(device)

batch = next(iter(train_loader))

encoder_input_ids = batch['encoder_input_ids'].to(device)
decoder_input_ids = batch['decoder_input_ids'].to(device)

encoder_embeddings = embedding_layer(encoder_input_ids)
decoder_embeddings = embedding_layer(decoder_input_ids)

# 결과 확인
print("Encoder input shape: {}".format(encoder_input_ids.shape))
print("Encoder embedding shape: {}".format(encoder_embeddings.shape))

print("Decoder input shape: {}".format(decoder_input_ids.shape))
print("Decoder embedding shape: {}".format(decoder_embeddings.shape))

Encoder input shape: torch.Size([16, 768])
Encoder embedding shape: torch.Size([16, 768, 512])
Decoder input shape: torch.Size([16, 768])
Decoder embedding shape: torch.Size([16, 768, 512])


In [49]:
memory_mask = create_memory_mask(encoder_input_ids, de_pad_id, device)
trg_mask = create_trg_mask(decoder_input_ids, en_pad_id, device)

example_encoder = TransformerEncoder(model_config.num_layers, model_config.hidden_size, model_config.ffn_size, model_config.num_heads, model_config.dropout_pb).to(device)
encoder_output = example_encoder(encoder_embeddings)

example_decoder = TransformerDecoder(model_config.num_layers, model_config.hidden_size, model_config.ffn_size, model_config.num_heads, model_config.dropout_pb).to(device)
decoder_output = example_decoder(decoder_embeddings, encoder_output, trg_mask, memory_mask)

print(decoder_output.size())

torch.Size([16, 768, 512])


## Transformer Encoder, Decoder 통합하여 모델 완성

In [51]:
class Transformer(nn.Module):
    def __init__(self):
        super().__init__()

        # Tokenizer
        self.src_tokenizer = de_tokenizer
        self.trg_tokenizer = en_tokenizer

        # Embedding Layers
        self.src_embedding = Embeddings(model_config.src_vocab_size, model_config.hidden_size, model_config.max_len, model_config.dropout_pb)
        self.trg_embedding = Embeddings(model_config.trg_vocab_size, model_config.hidden_size, model_config.max_len, model_config.dropout_pb)

        # model_config
        self.model_config = ModelConfiguration(src_vocab_size=self.src_tokenizer.get_vocab_size(),
                                               trg_vocab_size=self.trg_tokenizer.get_vocab_size()
                                              )

        # Encoder
        self.encoder = TransformerEncoder(self.model_config.num_layers,
                                          self.model_config.hidden_size,
                                          self.model_config.ffn_size,
                                          self.model_config.num_heads,
                                          self.model_config.dropout_pb)

        # Decoder
        self.decoder = TransformerDecoder(model_config.num_layers, model_config.hidden_size, model_config.ffn_size, model_config.num_heads, model_config.dropout_pb)

        # Final Linear Projection
        self.output_linear = nn.Linear(model_config.hidden_size, model_config.trg_vocab_size)

    def forward(self, src_input_ids: torch.Tensor, trg_input_ids: torch.Tensor) -> torch.Tensor:
        # Embedding
        encoder_input = self.src_embedding(src_input_ids)
        decoder_input = self.trg_embedding(trg_input_ids)

        # Mask
        memory_mask = create_memory_mask(src_input_ids, de_pad_id, device)
        trg_mask = create_trg_mask(trg_input_ids, en_pad_id, device)

        # Encoder
        encoder_output = self.encoder(encoder_input)

        # Decoder
        decoder_output = self.decoder(decoder_input, encoder_output, trg_mask, memory_mask)

        # Final Linear
        logits = self.output_linear(decoder_output)

        return logits

- Transformer 모델 테스트

In [53]:
batch = next(iter(train_loader))

encoder_input_ids = batch['encoder_input_ids'].to(device)
decoder_input_ids = batch['decoder_input_ids'].to(device)

example_transformer = Transformer().to(device)
example_output = example_transformer(encoder_input_ids, decoder_input_ids)

In [97]:
# 예측 확률을 확률 분포로 변환
probs = F.softmax(example_output, dim=-1)

# 가장 높은 확률을 가진 토큰 ID 선택 (greedy decoding)
predicted_ids = torch.argmax(probs, dim=-1)  # shape: (batch_size, trg_seq_len)

# batch 내 각 샘플에 대해 디코딩
for i in range(predicted_ids.size(0)):
    token_ids = predicted_ids[i].tolist()
    
    # 특수 토큰 필터링 (en_tokenizer 기준으로 정의된 특수 토큰 ID들)
    filtered_ids = [tok_id for tok_id in token_ids if tok_id not in [en_pad_id, en_sos_id, en_eos_id]]

    # ID → 텍스트 디코딩
    decoded_text = en_tokenizer.decode(filtered_ids)
    print(f"Sample {i + 1} prediction: {decoded_text}")
    print()

Sample 1 prediction: than backs bushy peppers than town empanadas come marked grinding lemonade Station Station hid paintbrush licks Station shops Station Station hid Station seminar Station adventurous hid Someone lemonade tutu lemonade black seminar lemonade beagle hills expression hid expression hid worked hid seminar racket hid scales row backs lemonade Time Asia Station expression hid Station job london Station lemonade Station seminar Station plucking Station Station hid row infant two job ring Station likes Station expression Station Station Station cone City groups job Station two Station gives checkout tutu skyline Station hid backs lemonade dryer lemonade Station Station Time sizes Station Station expression directly Station present Station hills lemonade gather barriers Station Station trays dogs magic tutu seminar Station Station groups hills Station Station Station Station Station directly Station attention tutu Station possession Station seminar tutu Station Station beagl