# BERT 임베딩 생성하기

In [1]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.15.0-py3-none-any.whl (3.4 MB)
[K     |████████████████████████████████| 3.4 MB 16.0 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 37.3 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 3.4 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 33.4 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 45.7 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  

In [2]:
from transformers import BertModel, BertTokenizer
import torch

## BERT의 모든 인코더 레이어에서 임베딩 추출

사전 학습된 BERT 모델 및 토크나이저 다운로드

- 사전 학습된 BERT 모델을 다운로드할 때 `ouput_hidden_states = True`로 설정 : `True`는 모든 인코더 레이어에서 임베딩을 얻는 데 필요

In [3]:
model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states = True)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

### 입력 전처리하기

입력 전처리 과정은 BERT의 최상위 인코더 계층에서만 임베딩을 추출할 때와 동일

In [4]:
sentence = 'I love Paris'
tokens = tokenizer.tokenize(sentence)
tokens = ['[CLS]'] + tokens + ['[SEP]']

In [5]:
tokens = tokens + ['[PAD]'] + ['[PAD]']
attention_mask = [1 if i!= '[PAD]' else 0 for i in tokens]

In [6]:
token_ids = tokenizer.convert_tokens_to_ids(tokens)

In [7]:
token_ids = torch.tensor(token_ids).unsqueeze(0)
attention_mask = torch.tensor(attention_mask).unsqueeze(0)

### 임베딩 추출하기

In [8]:
model(token_ids, attention_mask = attention_mask)

BaseModelOutputWithPoolingAndCrossAttentions([('last_hidden_state',
                                               tensor([[[-0.0719,  0.2163,  0.0047,  ..., -0.5865,  0.2262,  0.1981],
                                                        [ 0.2236,  0.6536, -0.2294,  ..., -0.3547,  0.5517, -0.2367],
                                                        [ 1.0410,  0.7755,  1.0335,  ..., -0.5621,  0.5218, -0.0852],
                                                        ...,
                                                        [ 0.6156,  0.1036, -0.1875,  ..., -0.3799, -0.7008, -0.3500],
                                                        [ 0.0791,  0.4287,  0.4147,  ..., -0.2417,  0.2403,  0.0378],
                                                        [-0.0165,  0.2459,  0.4566,  ..., -0.2179,  0.1876,  0.0228]]],
                                                      grad_fn=<NativeLayerNormBackward0>)),
                                              ('pooler_output',
     

In [9]:
output = model(token_ids, attention_mask = attention_mask)

`last_hidden_state` 와 `pooler_output` 의 값은 최상위 인코더 계층에서만 임베딩을 얻는 경우와 동일하고, `hidden_states`가 추가

In [10]:
last_hidden_state = output[0]
pooler_output = output[1]
hidden_states = output[2]

In [11]:
last_hidden_state.shape

torch.Size([1, 7, 768])

In [12]:
pooler_output.shape

torch.Size([1, 768])

 `hidden_states`는 모든 인코더 계층에서 얻은 모든 토큰의 표현 포함

- 입력 임베딩 레이어 *h_0*에서 *h_12*까지 모든 인코더 레이어의 표현을 포함하는 13개의 값을 갖는 튜플

  *hidden_states[i]는 i번째 레이어 h_i에서 얻은 모든 토큰의 표현 벡터 => hidden_states[12]==last_hidden_state*

In [13]:
len(hidden_states)

13

In [14]:
print(hidden_states[0].shape)
print(hidden_states[1].shape)

torch.Size([1, 7, 768])
torch.Size([1, 7, 768])


In [15]:
print(last_hidden_state == hidden_states[12])

tensor([[[True, True, True,  ..., True, True, True],
         [True, True, True,  ..., True, True, True],
         [True, True, True,  ..., True, True, True],
         ...,
         [True, True, True,  ..., True, True, True],
         [True, True, True,  ..., True, True, True],
         [True, True, True,  ..., True, True, True]]])
