## 在 BERT 模型中加入新的token
原因
- 因為BERT預訓練的場景可能與下游任務不同，為了更符合我們的下游任務，可以加入該領域的token幫助進一步優化embedding，提升模型效果。
步驟
1. 透過add_token加入新的token於tokenizer，但此時模型還沒有擴充其embedding table
2. 透過resize_token_embedding 通知模型，更新embedding table大小

In [3]:
from transformers import AutoModelForMaskedLM, AutoTokenizer


model_name = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModelForMaskedLM.from_pretrained(model_name)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [4]:
# 英文版本

print(tokenizer.tokenize('COVID'))
print(tokenizer.tokenize('hospitalization'))

['CO', '##VI', '##D']
['hospital', '##ization']


In [7]:
print(tokenizer)
print(len(tokenizer))  # vocab_size

PreTrainedTokenizerFast(name_or_path='bert-base-cased', vocab_size=28996, model_max_len=512, is_fast=True, padding_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})
28996


> 可以發現COVID 如果我們把其想當作一個token，我們可以去新增!

In [8]:
num_added_toks = tokenizer.add_tokens(['COVID', 'hospitalization'])
print('我們新增的', num_added_toks, 'tokens')
model.resize_token_embeddings(len(tokenizer))

我們新增的 2 tokens


Embedding(28998, 768)

> 可以看到已經新增了新的token, vocab_size+2

In [9]:
# 可以看到已經完成了!
tokenizer.tokenize('COVID')

['COVID']

**接下來嘗試中文的部分**
- 假設我要新增一些domain-specific 的token
    - 政大經濟研究所

In [10]:
model_name = 'bert-base-chinese'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

print(tokenizer)

Some weights of the model checkpoint at bert-base-chinese were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


PreTrainedTokenizerFast(name_or_path='bert-base-chinese', vocab_size=21128, model_max_len=512, is_fast=True, padding_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})


In [11]:
tokenizer.tokenize('我是一個資料科學家，生活在台北，就讀政大經濟研究所')

['我',
 '是',
 '一',
 '個',
 '資',
 '料',
 '科',
 '學',
 '家',
 '，',
 '生',
 '活',
 '在',
 '台',
 '北',
 '，',
 '就',
 '讀',
 '政',
 '大',
 '經',
 '濟',
 '研',
 '究',
 '所']

In [12]:
num_added_toks = tokenizer.add_tokens(['政大經濟研究所', '資料科學家'])
print('我們新增的', num_added_toks, 'tokens')
model.resize_token_embeddings(len(tokenizer))

我們新增的 2 tokens


Embedding(21130, 768)

In [17]:
tokenizer.tokenize('我是一個資料科學家，生活在台北，就讀政大經濟研究所')

['我',
 '是',
 '一',
 '個',
 '資',
 '料',
 '科',
 '學',
 '家',
 '，',
 '生',
 '活',
 '在',
 '台',
 '北',
 '，',
 '就',
 '讀',
 '政',
 '大',
 '經',
 '濟',
 '研',
 '究',
 '所']

> 可以看到中文雖然tokenizer加入了新的token，但沒有發生效果，這是因為BERT 中文是用字為單位，所以基本上是不會有出現沒看過的字的情況，故比較不需要此種應用

- 不過有人建立了 word-based BERT，詳細可以[參考](https://www.jiqizhixin.com/articles/2020-09-25-2)
    - 也有比較其優劣之處