Code from https://ratsgo.github.io/nlpbook/docs/language_model/tutorial/

# 패키지 설치
pip 명령어로 의존성 있는 패키지를 설치

In [1]:
!pip install ratsnlp

Collecting ratsnlp
  Downloading ratsnlp-1.0.53-py3-none-any.whl.metadata (741 bytes)
Collecting pytorch-lightning==1.6.1 (from ratsnlp)
  Downloading pytorch_lightning-1.6.1-py3-none-any.whl.metadata (33 kB)
Collecting transformers==4.28.1 (from ratsnlp)
  Downloading transformers-4.28.1-py3-none-any.whl.metadata (109 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.0/110.0 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting Korpora>=0.2.0 (from ratsnlp)
  Downloading Korpora-0.2.0-py3-none-any.whl.metadata (26 kB)
Collecting flask>=1.1.4 (from ratsnlp)
  Using cached flask-3.0.3-py3-none-any.whl.metadata (3.2 kB)
Collecting flask-ngrok>=0.0.25 (from ratsnlp)
  Downloading flask_ngrok-0.0.25-py3-none-any.whl.metadata (1.8 kB)
Collecting flask-cors>=3.0.10 (from ratsnlp)
  Downloading Flask_Cors-4.0.1-py2.py3-none-any.whl.metadata (5.5 kB)
Collecting torchmetrics>=0.4.1 (from pytorch-lightning==1.6.1->ratsnlp)
  Downloading torchmetrics-1.4.1-py3-none-a

## 토크나이저 초기화

BERT(`kcbert-base`) 모델이 쓰는 토크나이저를 선언

In [2]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained(
    "beomi/kcbert-base",
    do_lower_case=False,
)



vocab.txt:   0%|          | 0.00/250k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/619 [00:00<?, ?B/s]

## 모델 초기화

BERT(`kcbert-base`) 모델을 읽어들임

In [3]:
from transformers import BertConfig, BertModel
pretrained_model_config = BertConfig.from_pretrained(
    "beomi/kcbert-base"
)
model = BertModel.from_pretrained(
    "beomi/kcbert-base",
    config=pretrained_model_config,
)



model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

Some weights of the model checkpoint at beomi/kcbert-base were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


`pretrained_model_config`의 내용 확인

In [4]:
pretrained_model_config

BertConfig {
  "_name_or_path": "beomi/kcbert-base",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 300,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "pooler_fc_size": 768,
  "pooler_num_attention_heads": 12,
  "pooler_num_fc_layers": 3,
  "pooler_size_per_head": 128,
  "pooler_type": "first_token_transform",
  "position_embedding_type": "absolute",
  "transformers_version": "4.28.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30000
}

## 모델 입력값 만들기

문장 2개를 모델 입력값으로 지정

In [5]:
sentences = ["안녕하세요", "안녕하십니까", "안녕하십니까."]
features = tokenizer(
    sentences,
    max_length=10,
    padding="max_length",
    truncation=True,
)

`features`의 내용을 확인

In [6]:
features.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

In [7]:
features['input_ids']

[[2, 19017, 8482, 3, 0, 0, 0, 0, 0, 0],
 [2, 19017, 22796, 3, 0, 0, 0, 0, 0, 0],
 [2, 19017, 22796, 17, 3, 0, 0, 0, 0, 0]]

In [8]:
features['attention_mask']

[[1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
 [1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
 [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]]

In [9]:
features['token_type_ids']

[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

## BERT 임베딩 추출

위에서 만든 `features`를 파이토치 텐서(tensor)로 변환

In [10]:
import torch
features = {k: torch.tensor(v) for k, v in features.items()}

BERT 모델에 `features`를 입력하여 계산

In [11]:
outputs = model(**features)

BERT 마지막 레이어의 단어 수준 출력 벡터 확인

In [12]:
outputs.last_hidden_state

tensor([[[-0.6969, -0.8248,  1.7512,  ..., -0.3732,  0.7399,  1.1907],
         [-1.4803, -0.4398,  0.9444,  ..., -0.7405, -0.0211,  1.3064],
         [-1.4299, -0.5033, -0.2069,  ...,  0.1285, -0.2611,  1.6057],
         ...,
         [-1.4406,  0.3431,  1.4043,  ..., -0.0565,  0.8450, -0.2170],
         [-1.3625, -0.2404,  1.1757,  ...,  0.8876, -0.1054,  0.0734],
         [-1.4244,  0.1518,  1.2920,  ...,  0.0245,  0.7572,  0.0080]],

        [[-0.4433, -0.9240,  1.7883,  ..., -0.8659,  0.9727,  0.9249],
         [-1.6224, -0.0682,  0.1769,  ..., -0.8680,  0.7786,  1.9937],
         [-1.1414, -0.1960, -0.7010,  ...,  0.6878,  0.3851,  0.4984],
         ...,
         [-1.0095,  0.0397,  1.5963,  ...,  0.8457,  0.3166, -0.0534],
         [-1.2240,  0.4880,  1.3206,  ...,  1.4305,  0.7170,  0.7261],
         [-1.2351, -0.0513,  1.4715,  ...,  0.6764,  0.1578, -0.1697]],

        [[-0.1236, -1.0882,  1.5019,  ..., -1.0845,  0.5248,  0.6436],
         [-1.4286, -0.1186,  0.1144,  ..., -0

In [13]:
print(outputs.last_hidden_state.shape)
print(outputs.last_hidden_state[0, 1, :30])
print(outputs.last_hidden_state[1, 1, :30])
print(outputs.last_hidden_state[2, 1, :30])

torch.Size([3, 10, 768])
tensor([-1.4803, -0.4398,  0.9444,  0.4441,  0.6319,  0.8464,  0.5650,  0.2841,
         0.3534,  2.0140, -1.3429, -0.4704, -0.6285, -0.6812,  0.5594,  0.0941,
         0.6915,  0.0828,  0.4902, -0.3403,  0.3471,  0.7107, -0.4473, -0.3360,
         0.8616,  0.0462,  0.2277,  0.8386,  0.3912, -0.4810],
       grad_fn=<SliceBackward0>)
tensor([-1.6224, -0.0682,  0.1769,  0.4605,  0.1461,  0.4836,  1.0702,  0.3585,
         0.5402,  1.9740, -1.0956, -0.7817, -0.8804, -0.5016, -0.1314,  0.3771,
         1.0445, -0.4988, -0.0756, -0.0102,  0.9213,  0.3548, -0.7024,  1.2910,
         1.3515, -0.2282,  0.4545,  0.5006, -0.3627, -0.8802],
       grad_fn=<SliceBackward0>)
tensor([-1.4286, -0.1186,  0.1144,  0.5986,  0.2569,  0.4758,  0.8051,  0.3198,
         0.7381,  2.0553, -1.1412, -0.7723, -1.1986, -0.5407, -0.1609,  0.3903,
         0.9213, -0.5395, -0.0479,  0.1581,  0.9003,  0.3982, -0.5360,  1.4631,
         1.3810, -0.3041,  0.6067,  0.4754, -0.0529, -0.6580],


BERT 마지막 레이어의 CLS 벡터를 확인

In [14]:
outputs.pooler_output

tensor([[-0.1594,  0.0547,  0.1101,  ...,  0.2684,  0.1596, -0.9828],
        [ 0.1988,  0.0735,  0.0183,  ...,  0.2965, -0.1216, -0.9984],
        [ 0.1642,  0.0842,  0.0423,  ...,  0.3627,  0.0049, -0.9981]],
       grad_fn=<TanhBackward0>)