# Transformer with Huggingface
- 이전까지는 모델에 신경썼지만, 이제는 학습된 가중치를 어떻게 사용하느냐
- BERT: transformer 인코더 > MLM & NSP: pre=train 후 fine-tuning 재학습
- 모델을 어떻게 학습시키느냐와 사전학습 weight(더 중요)
- KoBERT: 한국어 자료로 사전학습
- 또한 상황에 따라 모델 크기 다르게 사용 GPT small, large...(디코더레이어 수와 hidden dim size가 다름), 커질수록 더많은 GPU자원 필요
- 모델/가중치마다 tokenizer 다르게 사용
- Architecture, Pretrained Weight, Config(scale, dim), Tokenizer

huggingface는 이러한 과정을 단순화
- models에서 weight variation 탐색 가능

In [None]:
pip install transformers

In [14]:
from transformers import BertConfig, BertForMaskedLM
# 실제 코드 확인하고 싶다면
from transformers.models.bert.modeling_bert import BertForMaskedLM

In [None]:
# 직접 학습시킬 경우
config = BertConfig(vocab_size=40000, hidden_size=256, num_hidden_layers=4, num_attention_heads=4, intermediate_size=1024, max_position_embeddings=1024)
model = BertForMaskedLM(config)
print(model)
# 최종적으로 vocab_size=40000 -> out_features=40000 

In [None]:
# 사전 학습 활용할 경우

# 몇 층 쌓았고, vocab_size, hidden_dim_size 등등을 가지고옴
# 12층, attentionhead=12, pad_token = 0, vocab_size=30522, GELU... 등등
# 가중치 가져오지 않으면, random_initailize됨
# uncased: 대소문자 구분X
config = BertConfig.from_pretrained("bert-base-uncased")
model = BertForMaskedLM(config)
print(model)

In [20]:
# weight까지 들고오는 경우
model = BertForMaskedLM.from_pretrained("bert-base-uncased")
print(model)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=Tr

In [21]:
from transformers import BertTokenizerFast

In [23]:
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
print(tokenizer)

PreTrainedTokenizerFast(name_or_path='bert-base-uncased', vocab_size=30522, model_max_len=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})


In [25]:
text = "Hello, My name is Seokjin Oh"
print(tokenizer.tokenize(text))
print(tokenizer(text))

['hello', ',', 'my', 'name', 'is', 'seo', '##k', '##jin', 'oh']
{'input_ids': [101, 7592, 1010, 2026, 2171, 2003, 27457, 2243, 14642, 2821, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
