### tokenizer理解

In [None]:
!huggingface-cli download --resume-download google-bert/bert-base-uncased --local-dir ../model/bert-base-uncased --local-dir-use-symlinks False

In [1]:
from transformers import AutoModel,AutoTokenizer

  from .autonotebook import tqdm as notebook_tqdm


In [20]:
model_name = "../model/bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

sentences = "this is a test sentence"

### model
1.模型详细结构
- model

2.模型整体结构
- model.config

3.模型参数
- model.num_parameters()
- eps 相当于$\epsilon$
- elementwise_affine 相当于bias
$$\mathrm{output}=\mathrm{weight}\cdot\frac{\mathrm{input}-\mu}{\sqrt{\sigma^2+\epsilon}}+\mathrm{bias}$$

### model(**tokens)
1.model(**tokens) 解析
- 输出得到last_hidden_state、pooler_output
- last_hidden_state维度:(batch_size,sequence_length,hidden_size)
- pooler_output维度:(batch_size,hidden_size)
- 可将两者用于下游任务，例如分类等任务

In [31]:
tokens = tokenizer(sentences, truncation = True,padding=True,max_length=256,return_tensors="pt")
tokens

{'input_ids': tensor([[ 101, 2023, 2003, 1037, 3231, 6251,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}

In [14]:
tokenizer.tokenize(sentences)
tokenizer.encode(sentences)
tokenizer.decode([101, 2023, 2003, 1037, 8915, 13462, 6251, 102])
tokenizer.convert_ids_to_tokens([101, 2023, 2003, 1037, 8915, 13462, 6251, 102])

['[CLS]', 'this', 'is', 'a', 'te', '##set', 'sentence', '[SEP]']

In [40]:
tokenizer(sentences)

{'input_ids': [101, 2023, 2003, 1037, 3231, 6251, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

In [19]:
tokenizer.convert_tokens_to_ids([ 'this', 'is', 'a', 'test',  'sentence'])

[2023, 2003, 1037, 3231, 6251]

In [21]:
tokenizer.convert_tokens_to_ids(sentences.split())

[2023, 2003, 1037, 3231, 6251]

In [26]:
tokenizer.special_tokens_map.values()
tokenizer.convert_tokens_to_ids(tokenizer.special_tokens_map.values())

[100, 102, 0, 101, 103]

In [54]:
import torch

In [None]:
output = model(**tokens)
with torch.no_grad():
    output = model(**tokens)
    print(output)

In [59]:
output.pooler_output.shape

torch.Size([1, 768])

In [65]:
from transformers import BertForSequenceClassification

sequence_model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)

with torch.no_grad():
    output = sequence_model(**tokens)
    softmax = output.logits.softmax(dim=-1)
    test = torch.softmax(output.logits, dim=-1)
    
    print(output)
    print(softmax)
    print(test)



Some weights of BertForSequenceClassification were not initialized from the model checkpoint at ../model/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


SequenceClassifierOutput(loss=None, logits=tensor([[0.0987, 0.1810]]), hidden_states=None, attentions=None)
tensor([[0.4794, 0.5206]])
tensor([[0.4794, 0.5206]])


### tokenizer
1.查看分词情况:
- tokenizer.tokenize(sentences)

2.编码解码
- tokenizer.encode(sentences) = tokenizer.convert_tokens_to_ids()
- tokenizer.decode(senteencs) = tokenizer.convert_ids_to_tokens()

3.特殊编码
- tokenizer.special_tokens_map

4.词汇表
- tokenizer.vocab