## Colab 환경 구축


### 활용 라이브러리 (고정)

*   [torch==1.9.0](https://pytorch.org/)
*   [transformers](https://pypi.org/project/transformers/)

In [1]:
!pip3 install torch==1.9.0 torchvision torchaudio
!pip3 install transformers



In [2]:
import math
import torch
from torch import nn
from transformers import BertTokenizer, BertConfig, set_seed

1. Reproducibility- 재구현을 위한 random_seed 42로 설정

```
set_seed(42)
```


2. BERT [tokenizer](https://huggingface.co/transformers/model_doc/bert.html#berttokenizer), [configuration(config)](https://huggingface.co/transformers/model_doc/bert.html#bertconfig) 설정

```
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
bert_configuraiton = BertConfig.from_pretrained('bert-base-cased')
```


3. input sequence 토크나이징

```
input_texts = ['I love cats!', 'He hates pineapple pizza.']

input_sequences = tokenizer(text=input_texts, add_special_tokens=True, padding=True, truncation=True, return_tensors='pt')
```



In [3]:
# Set seed for reproducibility
set_seed(42)

# Create BertTokenizer, Configuration
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
bert_configuraiton = BertConfig.from_pretrained('bert-base-cased')

# Create input sequence using tokenizer
input_texts = ['I love cats!', 'He hates pineapple pizza.']
labels = [1, 0] # labels는 이번 실습에 사용되지 않음 (신경쓰지 말 것)
input_sequences = tokenizer(text=input_texts, add_special_tokens=True, padding=True, truncation=True, return_tensors='pt')

# Since input sequences is a dictionary we can also add to labels to it
# want to make sure all values at tensors
input_sequences.update({'labels':torch.tensor(labels)})
print(input_sequences)

{'input_ids': tensor([[  101,   146,  1567, 11771,   106,   102,     0,     0,     0],
        [  101,  1124, 18457, 10194, 11478,  7136, 13473,   119,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'labels': tensor([1, 0])}


### Input Sequence에 대해 Embedding 진행 (Huggingface 코드)
- 변경 필요 없음, 그대로 사용

1. BertEmbedding에 대한 configuration 생성
2. bert_embedding block에 대해 forward 수행
3. Embedding Output Shape : [batch_size, seq_len, hidden_size(768)]

```
# Create Bert embedding layer
bert_embeddings_block = BertEmbeddings(bert_configuraiton)

# Perform a forward pass
embedding_output = bert_embeddings_block.forward(input_ids=input_sequences['input_ids'], token_type_ids=input_sequences['token_type_ids'])
```



In [4]:
class BertEmbeddings(nn.Module):
    """Construct the embeddings from word, position and token_type embeddings."""

    def __init__(self, config):
        super().__init__()
        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)

        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
        # any TensorFlow checkpoint file
        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

        # position_ids (1, len position emb) is contiguous in memory and exported when serialized
        self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)))
        self.position_embedding_type = getattr(config, "position_embedding_type", "absolute")

    def forward(
        self, input_ids=None, token_type_ids=None, position_ids=None, inputs_embeds=None, past_key_values_length=0
    ):
        # print('============== BertEmbeddings ==============')
        if input_ids is not None:
            input_shape = input_ids.size()
        else:
            input_shape = inputs_embeds.size()[:-1]

        seq_length = input_shape[1]

        if position_ids is None:
            position_ids = self.position_ids[:, past_key_values_length : seq_length + past_key_values_length]       
        #print('Created Tokens Positions IDs: ', position_ids) # ADDED
        

        if token_type_ids is None:
            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=self.position_ids.device)

        if inputs_embeds is None:
            inputs_embeds = self.word_embeddings(input_ids)
        token_type_embeddings = self.token_type_embeddings(token_type_ids)

        # ADDED
        print('Tokens IDs: ', input_ids.shape)
        print('Tokens Type IDs: ', token_type_ids.shape)
        print('Word Embeddings: ', inputs_embeds.shape)

        embeddings = inputs_embeds + token_type_embeddings
        if self.position_embedding_type == "absolute":
            position_embeddings = self.position_embeddings(position_ids)
            # print('Position Embeddings: ', position_embeddings.shape) # ADDED

            embeddings += position_embeddings

        # ADDED
        # print('Token Types Embeddings: ', token_type_embeddings.shape)
        # print('Sum Up All Embeddings: ', embeddings.shape)

        embeddings = self.LayerNorm(embeddings)
        # print('Embeddings Layer Nromalization: ', embeddings.shape) # ADDED

        embeddings = self.dropout(embeddings)
        # print('Embeddings Dropout Layer: ', embeddings.shape) # ADDED
        
        return embeddings

# Create Bert embedding layer
bert_embeddings_block = BertEmbeddings(bert_configuraiton)

# Perform a forward pass
embedding_output = bert_embeddings_block.forward(input_ids=input_sequences['input_ids'], token_type_ids=input_sequences['token_type_ids'])
print('Embedding Output: ', embedding_output.shape) # ADDED

Tokens IDs:  torch.Size([2, 9])
Tokens Type IDs:  torch.Size([2, 9])
Word Embeddings:  torch.Size([2, 9, 768])
Embedding Output:  torch.Size([2, 9, 768])


### Self-Attention 수행 (Huggingface 코드)

- 결과로 나오는 output shape : [batch_size, seq_len, 768] 

In [5]:
class BertSelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
            raise ValueError(
                "The hidden size (%d) is not a multiple of the number of attention "
                "heads (%d)" % (config.hidden_size, config.num_attention_heads)
            )

        self.num_attention_heads = config.num_attention_heads
        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
        self.all_head_size = self.num_attention_heads * self.attention_head_size

        # ADDED
        # print('============== BertSelfAttention ==============')
        # print('Attention Head Size: ', self.attention_head_size)
        # print('Combined Attentions Head Size: ', self.all_head_size)

        self.query = nn.Linear(config.hidden_size, self.all_head_size)
        self.key = nn.Linear(config.hidden_size, self.all_head_size)
        self.value = nn.Linear(config.hidden_size, self.all_head_size)

        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
        self.position_embedding_type = getattr(config, "position_embedding_type", "absolute")
        if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
            self.max_position_embeddings = config.max_position_embeddings
            self.distance_embedding = nn.Embedding(2 * config.max_position_embeddings - 1, self.attention_head_size)

        self.is_decoder = config.is_decoder

    def transpose_for_scores(self, x):
        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
        x = x.view(*new_x_shape)
        return x.permute(0, 2, 1, 3)

    def forward(
        self,
        hidden_states,
        attention_mask=None,
        head_mask=None,
        encoder_hidden_states=None,
        encoder_attention_mask=None,
        past_key_value=None,
        output_attentions=False,
    ):
        
        print('Hidden States: ', hidden_states.shape) # ADDED

        mixed_query_layer = self.query(hidden_states)

        # If this is instantiated as a cross-attention module, the keys
        # and values come from an encoder; the attention mask needs to be
        # such that the encoder's padding tokens are not attended to.
        is_cross_attention = encoder_hidden_states is not None

        if is_cross_attention and past_key_value is not None:
            # ADDED
            # print('Query Linear Layer: ', mixed_query_layer.shape)
            # print('Key Linear Layer: ', past_key_value[0].shape)
            # print('Value Linear Layer: ', past_key_value[1].shape)

            # reuse k,v, cross_attentions
            key_layer = past_key_value[0]
            value_layer = past_key_value[1]
            attention_mask = encoder_attention_mask
        
        elif is_cross_attention:
            # ADDED
            # print('Query Linear Layer: ', mixed_query_layer.shape)
            # print('Key Linear Layer: ', self.key(encoder_hidden_states).shape)
            # print('Value Linear Layer: ', self.value(encoder_hidden_states).shape)

            key_layer = self.transpose_for_scores(self.key(encoder_hidden_states))
            value_layer = self.transpose_for_scores(self.value(encoder_hidden_states))
            attention_mask = encoder_attention_mask
        
        elif past_key_value is not None:
            # ADDED
            # print('Query Linear Layer: ', mixed_query_layer.shape)
            # print('Key Linear Layer: ', self.key(hidden_states).shape)
            # print('Value Linear Layer: ', self.value(hidden_states).shape)

            key_layer = self.transpose_for_scores(self.key(hidden_states))
            value_layer = self.transpose_for_scores(self.value(hidden_states))
            key_layer = torch.cat([past_key_value[0], key_layer], dim=2)
            value_layer = torch.cat([past_key_value[1], value_layer], dim=2)
        
        else:
            # ADDED
            # print('Query Linear Layer: ', mixed_query_layer.shape)
            # print('Key Linear Layer: ', self.key(hidden_states).shape)
            # print('Value Linear Layer: ', self.value(hidden_states).shape)

            key_layer = self.transpose_for_scores(self.key(hidden_states))
            value_layer = self.transpose_for_scores(self.value(hidden_states))

        query_layer = self.transpose_for_scores(mixed_query_layer)

        # ADDED
        # print('Query: ', query_layer.shape)
        # print('Key: ', key_layer.shape)
        # print('Value: ', value_layer.shape)

        if self.is_decoder:
            # if cross_attention save Tuple(torch.Tensor, torch.Tensor) of all cross attention key/value_states.
            # Further calls to cross_attention layer can then reuse all cross-attention
            # key/value_states (first "if" case)
            # if uni-directional self-attention (decoder) save Tuple(torch.Tensor, torch.Tensor) of
            # all previous decoder key/value_states. Further calls to uni-directional self-attention
            # can concat previous decoder key/value_states to current projected key/value_states (third "elif" case)
            # if encoder bi-directional self-attention `past_key_value` is always `None`
            past_key_value = (key_layer, value_layer)

        # ADDED
        # print('Key Transposed: ', key_layer.transpose(-1, -2).shape)

        # Take the dot product between "query" and "key" to get the raw attention scores.
        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))

        # ADDED
        # print('Attention Scores: ', attention_scores.shape)

        if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
            seq_length = hidden_states.size()[1]
            position_ids_l = torch.arange(seq_length, dtype=torch.long, device=hidden_states.device).view(-1, 1)
            position_ids_r = torch.arange(seq_length, dtype=torch.long, device=hidden_states.device).view(1, -1)
            distance = position_ids_l - position_ids_r
            positional_embedding = self.distance_embedding(distance + self.max_position_embeddings - 1)
            positional_embedding = positional_embedding.to(dtype=query_layer.dtype)  # fp16 compatibility

            if self.position_embedding_type == "relative_key":
                relative_position_scores = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
                attention_scores = attention_scores + relative_position_scores
            elif self.position_embedding_type == "relative_key_query":
                relative_position_scores_query = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
                relative_position_scores_key = torch.einsum("bhrd,lrd->bhlr", key_layer, positional_embedding)
                attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key

        attention_scores = attention_scores / math.sqrt(self.attention_head_size) # root 계산
        # print('Attention Scores Divided by Scalar: ', attention_scores.shape) # ADDED

        if attention_mask is not None:
            # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
            attention_scores = attention_scores + attention_mask

        # Normalize the attention scores to probabilities.
        attention_probs = nn.Softmax(dim=-1)(attention_scores)
        # print('Attention Probabilities Softmax Layer: ', attention_probs.shape) # ADDED

        # This is actually dropping out entire tokens to attend to, which might
        # seem a bit unusual, but is taken from the original Transformer paper.
        attention_probs = self.dropout(attention_probs)
        # print('Attention Probabilities Dropout Layer: ', attention_probs.shape) # ADDED

        # Mask heads if we want to
        if head_mask is not None:
            attention_probs = attention_probs * head_mask

        context_layer = torch.matmul(attention_probs, value_layer)
        # print('Context: ', context_layer.shape) # ADDED

        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
        # print('Context Permute: ', context_layer.shape) # ADDED

        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
        context_layer = context_layer.view(*new_context_layer_shape)
        # print('Context Reshaped: ', context_layer.shape) # ADDED
        
        outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)

        if self.is_decoder:
            outputs = outputs + (past_key_value,)
        return outputs

In [6]:
def get_extended_attention_mask(attention_mask, input_shape, device):
        """
        Makes broadcastable attention and causal masks so that future and masked tokens are ignored.
        Arguments:
            attention_mask (:obj:`torch.Tensor`):
                Mask with ones indicating tokens to attend to, zeros for tokens to ignore.
            input_shape (:obj:`Tuple[int]`):
                The shape of the input to the model.
            device: (:obj:`torch.device`):
                The device of the input to the model.
        Returns:
            :obj:`torch.Tensor` The extended attention mask, with a the same dtype as :obj:`attention_mask.dtype`.
        """
        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
        # ourselves in which case we just need to make it broadcastable to all heads.
        if attention_mask.dim() == 3:
            extended_attention_mask = attention_mask[:, None, :, :]
        elif attention_mask.dim() == 2:
            # Provided a padding mask of dimensions [batch_size, seq_length]
            # - if the model is a decoder, apply a causal mask in addition to the padding mask
            # - if the model is an encoder, make the mask broadcastable to [batch_size, num_heads, seq_length, seq_length]
            
            extended_attention_mask = attention_mask[:, None, None, :]
        else:
            raise ValueError(
                f"Wrong shape for input_ids (shape {input_shape}) or attention_mask (shape {attention_mask.shape})"
            )

        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
        # masked positions, this operation will create a tensor which is 0.0 for
        # positions we want to attend and -10000.0 for masked positions.
        # Since we are adding it to the raw scores before the softmax, this is
        # effectively the same as removing these entirely.
        
        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
        return extended_attention_mask.to(device)


### Custom Multi-Head Attention

- 직접 구현해야 하는 부분
- 결과로 나오는 output shape : [batch_size, seq_len, 768] 

In [11]:
class MultiHeadAttention(nn.Module):
    def __init__(self, hid_dim, n_heads):
        super().__init__()
        
        assert hid_dim % n_heads == 0
        
        self.hid_dim = hid_dim 
        self.n_heads = n_heads 
        self.head_dim = hid_dim // n_heads 
        
        self.query = nn.Linear(hid_dim, hid_dim)
        self.key = nn.Linear(hid_dim, hid_dim)
        self.value = nn.Linear(hid_dim, hid_dim)

        self.dropout = nn.Dropout(0.1)
        
        
    def forward(self, hidden_states, attention_mask=None):
      
        print('Hidden States: ', hidden_states.shape) # ADDED

        Q = self.query(hidden_states)
        K = self.key(hidden_states)
        V = self.value(hidden_states)
        print('Q.size',Q.size())
        print('K.size',K.size())
        print('V.size',V.size())
       
        batch_size = hidden_states.shape[0]
               
        Q = Q.view(batch_size, -1, self.n_heads, self.head_dim).permute(0,2,1,3)
        K = K.view(batch_size, -1, self.n_heads, self.head_dim).permute(0,2,1,3)
        V = V.view(batch_size, -1, self.n_heads, self.head_dim).permute(0,2,1,3)
        print('Q.size',Q.size())
        print('K.size',K.size())
        print('V.size',V.size())
        
        d_k = self.head_dim # d_k
        print('dk',d_k)
        print('transpose k', K.transpose(-2,-1).size())
        attention_score = torch.matmul(Q, K.transpose(-1,-2)) # Q x K^T
        attention_score = attention_score / math.sqrt(d_k) 
        print('attention score: ', attention_score.size())
        
        if attention_mask is not None:
          attention_score = attention_score + attention_mask
        
        attention = nn.functional.softmax(attention_score, dim=-1) 
        print('softmax attention score: ', attention.size())
        
        attention = self.dropout(attention)
        
        output = torch.matmul(attention,V) 
        print('score*v',output.size())

        output = output.permute(0, 2, 1, 3) 
        print('permute output',output.size())

        output = output.reshape(2,9,768)
        print('reshape output: ', output.size())

        return output
        

- Attention Layer 선언
- Huggingface BertSelfAttention 파라미터를 custom attention으로 가져옴
- Randomness 최소화
  - eval() 모드
  - with torch.no_grad() 

- 두 개의 Attention Layer로부터 Forward 결과로 나오는 output 값 비교

In [12]:
# Create bert self attention layer
bert_selfattention_block_huggingface = BertSelfAttention(bert_configuraiton)
bert_selfattention_block = MultiHeadAttention(hid_dim=768, n_heads=12) # <-- custom attention

# huggingface attention의 parameter 가져옴
bert_selfattention_block.query.load_state_dict( bert_selfattention_block_huggingface.query.state_dict())
bert_selfattention_block.key.load_state_dict( bert_selfattention_block_huggingface.key.state_dict())
bert_selfattention_block.value.load_state_dict( bert_selfattention_block_huggingface.value.state_dict())

# eval mode 설정
bert_selfattention_block_huggingface.eval()
bert_selfattention_block.eval() 

MultiHeadAttention(
  (query): Linear(in_features=768, out_features=768, bias=True)
  (key): Linear(in_features=768, out_features=768, bias=True)
  (value): Linear(in_features=768, out_features=768, bias=True)
  (dropout): Dropout(p=0.1, inplace=False)
)

In [13]:
# Perform a forward pass
with torch.no_grad():
    input_shape= input_sequences.input_ids.shape
    attention_mask = get_extended_attention_mask(input_sequences.attention_mask, input_shape, input_sequences.input_ids.device)
    
    context_embedding_huggingface = bert_selfattention_block_huggingface.forward(hidden_states=embedding_output, attention_mask=attention_mask)
    context_embedding_custom = bert_selfattention_block.forward(hidden_states=embedding_output, attention_mask=attention_mask)

print('[Huggingface] Context Embedding : ', context_embedding_huggingface[0])
print('[Custom] Context Embedding : ', context_embedding_custom)

Hidden States:  torch.Size([2, 9, 768])
Hidden States:  torch.Size([2, 9, 768])
Q.size torch.Size([2, 9, 768])
K.size torch.Size([2, 9, 768])
V.size torch.Size([2, 9, 768])
Q.size torch.Size([2, 12, 9, 64])
K.size torch.Size([2, 12, 9, 64])
V.size torch.Size([2, 12, 9, 64])
dk 64
transpose k torch.Size([2, 12, 64, 9])
attention score:  torch.Size([2, 12, 9, 9])
softmax attention score:  torch.Size([2, 12, 9, 9])
score*v torch.Size([2, 12, 9, 64])
permute output torch.Size([2, 9, 12, 64])
reshape output:  torch.Size([2, 9, 768])
[Huggingface] Context Embedding :  tensor([[[-0.0682,  0.2734,  0.0319,  ..., -0.1064, -0.0046,  0.1235],
         [ 0.0223,  0.3108,  0.0203,  ..., -0.0182, -0.0749,  0.1181],
         [ 0.0030,  0.2641,  0.0293,  ..., -0.0758, -0.0325,  0.1213],
         ...,
         [ 0.0562,  0.3384,  0.0308,  ..., -0.1732, -0.0441,  0.1613],
         [ 0.0049,  0.3068,  0.0321,  ..., -0.0700, -0.0311,  0.1123],
         [ 0.0239,  0.3569,  0.0377,  ..., -0.0650, -0.0229,  

In [14]:
print(torch.eq(context_embedding_huggingface[0], context_embedding_custom))

tensor([[[True, True, True,  ..., True, True, True],
         [True, True, True,  ..., True, True, True],
         [True, True, True,  ..., True, True, True],
         ...,
         [True, True, True,  ..., True, True, True],
         [True, True, True,  ..., True, True, True],
         [True, True, True,  ..., True, True, True]],

        [[True, True, True,  ..., True, True, True],
         [True, True, True,  ..., True, True, True],
         [True, True, True,  ..., True, True, True],
         ...,
         [True, True, True,  ..., True, True, True],
         [True, True, True,  ..., True, True, True],
         [True, True, True,  ..., True, True, True]]])
