# BERT
`kcbert`를 기반으로 파악해보기
- `kcbert`
    - hugginface.co: https://huggingface.co/beomi/kcbert-base
    - github: https://github.com/Beomi/KcBERT

- `config.json`
```json
{
max_position_embeddings: 300,
hidden_dropout_prob: 0.1,
pooler_size_per_head: 128,
hidden_act: "gelu",
initializer_range: 0.02,
num_hidden_layers: 12,
pooler_num_attention_heads: 12,
type_vocab_size: 2,
vocab_size: 30000,
hidden_size: 768,
attention_probs_dropout_prob: 0.1,
directionality: "bidi",
num_attention_heads: 12,
pooler_fc_size: 768,
pooler_type: "first_token_transform",
pooler_num_fc_layers: 3,
intermediate_size: 3072,
architectures: [
"BertForMaskedLM"
],
model_type: "bert"
}
```
- `tokenizer_config.json`
```json
{
do_lower_case: false,
model_max_length: 300
}
```

## loading BERT
예시는 [KcBERT](https://github.com/Beomi/KcBERT)로 

In [14]:
from transformers import AutoTokenizer, AutoModelWithLMHead

tokenizer = AutoTokenizer.from_pretrained("beomi/kcbert-base")
model = AutoModelWithLMHead.from_pretrained("beomi/kcbert-base")

Some weights of the model checkpoint at beomi/kcbert-base were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [15]:
# ops 별 이름과 ops를 구성하는 weight의 dimension을 확인
[[ops[0], ops[1].size()] for ops in model.named_parameters()]

[['bert.embeddings.word_embeddings.weight', torch.Size([30000, 768])],
 ['bert.embeddings.position_embeddings.weight', torch.Size([300, 768])],
 ['bert.embeddings.token_type_embeddings.weight', torch.Size([2, 768])],
 ['bert.embeddings.LayerNorm.weight', torch.Size([768])],
 ['bert.embeddings.LayerNorm.bias', torch.Size([768])],
 ['bert.encoder.layer.0.attention.self.query.weight', torch.Size([768, 768])],
 ['bert.encoder.layer.0.attention.self.query.bias', torch.Size([768])],
 ['bert.encoder.layer.0.attention.self.key.weight', torch.Size([768, 768])],
 ['bert.encoder.layer.0.attention.self.key.bias', torch.Size([768])],
 ['bert.encoder.layer.0.attention.self.value.weight', torch.Size([768, 768])],
 ['bert.encoder.layer.0.attention.self.value.bias', torch.Size([768])],
 ['bert.encoder.layer.0.attention.output.dense.weight',
  torch.Size([768, 768])],
 ['bert.encoder.layer.0.attention.output.dense.bias', torch.Size([768])],
 ['bert.encoder.layer.0.attention.output.LayerNorm.weight', tor

## tokenizer

In [73]:
print(f"max_len={tokenizer.max_len}")
print(f"do_basic_tokenize={tokenizer.do_basic_tokenize}")

example_0 = "안녕하세요, 반갑습니다."
print(example_0)

max_len=300
do_basic_tokenize=True
안녕하세요, 반갑습니다.


In [91]:
tokenized_by_basic =tokenizer.basic_tokenizer.tokenize(cleaned)
print([(tokenizer.wordpiece_tokenizer.tokenize(token)) for token in tokenized_by_basic])
print(tokenizer.tokenize("안녕하세요. 반갑습니다."))

[['안녕', '##하세요'], [','], ['반', '##갑', '##습니다'], ['.']]
['안녕', '##하세요', '.', '반', '##갑', '##습니다', '.']


In [85]:
tokenizer.all_special_tokens_extended.get()

AttributeError: 'list' object has no attribute 'get'

In [82]:
tokenizer.prepare_for_tokenization(cleaned)

('안녕하세요, 반갑습니다.', {})

In [53]:
example_text = "모두연 자연어처리랩에 오신 것을 환영합니다."
tokenized = tokenizer.tokenize(example_text)
tokenized2indices = tokenizer.convert_tokens_to_ids(tokenized)
encoded = tokenizer.encode(example_text)
decoded = tokenizer.decode(encoded)
print(f"tokenized={tokenized}")
print(f"tokenized2indices={tokenized2indices}")
print(f"encoded={encoded}")
print(tokenizer.convert_ids_to_tokens(encoded))
print(f"decoded={decoded}")

tokenized=['모두', '##연', '자연', '##어', '##처리', '##랩', '##에', '오신', '것을', '환영합니다', '.']
tokenized2indices=[8248, 4132, 10459, 4071, 11385, 5116, 4113, 28914, 9153, 29502, 17]
encoded=[2, 8248, 4132, 10459, 4071, 11385, 5116, 4113, 28914, 9153, 29502, 17, 3]
['[CLS]', '모두', '##연', '자연', '##어', '##처리', '##랩', '##에', '오신', '것을', '환영합니다', '.', '[SEP]']
decoded=[CLS] 모두연 자연어처리랩에 오신 것을 환영합니다. [SEP]


<transformers.tokenization_bert.WordpieceTokenizer at 0x7f4a869da810>

In [36]:
tokenizer.encode_plus("안녕하세요")

{'input_ids': [2, 19017, 8482, 3], 'token_type_ids': [0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1]}

In [23]:
tokenizer.convert_ids_to_tokens(encoded)
tokenizer.convert_tokens_to_ids()

['[CLS]',
 '모두',
 '##연',
 '자연',
 '##어',
 '##처리',
 '##랩',
 '##에',
 '오신',
 '것을',
 '환영합니다',
 '.',
 '[SEP]']