# 序列标注：Token Classification

序列标注这类任务包含了所有可以表述为"为句子中的每个单词分配标签"的问题，例如：
- 命名实体识别（NER，Named Entity Recognition）
- 词性标注（POS，Part-Of-Speech tagging）
- 分块（Chunking）：找出同属于一个实体的词语

在本笔记接下来的篇幅里，我们将会微调一个Bert模型来完成NER任务。

In [40]:
!pip install datasets evaluate transformers[sentencepiece]
!pip install --upgrade datasets



## 1 准备数据

我们将会使用conll2003数据集，该数据集涉及语言无关的命名实体识别，我们将专注于四种类型的命名实体：人物，地点，组织以及不属于前面三中的其余实体名称。

In [41]:
from pprint import pprint
from datasets import load_dataset

raw_datasets = load_dataset("conll2003")
pprint(raw_datasets)

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})


In [42]:
for k, v in raw_datasets["train"][0].items():
  print(f"{k}: {v}")

id: 0
tokens: ['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']
pos_tags: [22, 42, 16, 21, 35, 37, 16, 21, 7]
chunk_tags: [11, 21, 11, 12, 21, 22, 11, 12, 0]
ner_tags: [3, 0, 7, 0, 0, 0, 7, 0, 0]


In [43]:
ner_features = raw_datasets["train"].features["ner_tags"]
ner_label_names = ner_features.feature.names
print(ner_label_names)

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']


上面的NER标签的含义为：
- O 表示该词与任何实体都不对应；
- B-PER / I-PER 表示该词对应于人物实体的开头/内部；
- B-ORG / I-ORG 表示该词对应于组织实体的开头/内部；
- B-LOC / I-LOC 表示该词对应于地点实体的开头/内部；
- B-MISC / I-MISC 表示该词对应于杂项实体的开头/内部.

### 1.1 可视化NER标签

In [44]:
words = raw_datasets["train"][4]["tokens"]
labels = raw_datasets["train"][4]["ner_tags"]
print(words)
print(labels)
words_line = ""
labels_line = ""
for word, label in zip(words, labels):
    full_label = ner_label_names[label]
    max_length = max(len(word), len(full_label))
    words_line += word + " " * (max_length - len(word) + 1)
    labels_line += full_label + " " * (max_length - len(full_label) + 1)
print(words_line)
print(labels_line)

['Germany', "'s", 'representative', 'to', 'the', 'European', 'Union', "'s", 'veterinary', 'committee', 'Werner', 'Zwingmann', 'said', 'on', 'Wednesday', 'consumers', 'should', 'buy', 'sheepmeat', 'from', 'countries', 'other', 'than', 'Britain', 'until', 'the', 'scientific', 'advice', 'was', 'clearer', '.']
[5, 0, 0, 0, 0, 3, 4, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0]
Germany 's representative to the European Union 's veterinary committee Werner Zwingmann said on Wednesday consumers should buy sheepmeat from countries other than Britain until the scientific advice was clearer . 
B-LOC   O  O              O  O   B-ORG    I-ORG O  O          O         B-PER  I-PER     O    O  O         O         O      O   O         O    O         O     O    B-LOC   O     O   O          O      O   O       O 


### 1.2 可视化词性

In [45]:
def visualize_token_with_labels(words, label_indice, label_names):
    words_line = ""
    labels_line = ""
    for word, label_index in zip(words, label_indice):
        label_name = label_names[label_index]
        max_length = max(len(word), len(label_name))
        words_line += word + " " * (max_length - len(word) + 1)
        labels_line += label_name + " " * (max_length - len(label_name) + 1)
    print(words_line)
    print(labels_line)

In [46]:
pos_features = raw_datasets["train"].features["pos_tags"]
pos_label_names = pos_features.feature.names
print(pos_label_names)

['"', "''", '#', '$', '(', ')', ',', '.', ':', '``', 'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'NN|SYM', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB']


In [47]:
words = raw_datasets["train"][4]["tokens"]
label_indice = raw_datasets["train"][4]["pos_tags"]
print(words)
print(label_indice)
visualize_token_with_labels(words, label_indice, pos_label_names)

['Germany', "'s", 'representative', 'to', 'the', 'European', 'Union', "'s", 'veterinary', 'committee', 'Werner', 'Zwingmann', 'said', 'on', 'Wednesday', 'consumers', 'should', 'buy', 'sheepmeat', 'from', 'countries', 'other', 'than', 'Britain', 'until', 'the', 'scientific', 'advice', 'was', 'clearer', '.']
[22, 27, 21, 35, 12, 22, 22, 27, 16, 21, 22, 22, 38, 15, 22, 24, 20, 37, 21, 15, 24, 16, 15, 22, 15, 12, 16, 21, 38, 17, 7]
Germany 's  representative to the European Union 's  veterinary committee Werner Zwingmann said on Wednesday consumers should buy sheepmeat from countries other than Britain until the scientific advice was clearer . 
NNP     POS NN             TO DT  NNP      NNP   POS JJ         NN        NNP    NNP       VBD  IN NNP       NNS       MD     VB  NN        IN   NNS       JJ    IN   NNP     IN    DT  JJ         NN     VBD JJR     . 


词性标注的标签含义如下：

以下是这些 **词性标注（Part-of-Speech, POS）标签** 的详细解释，基于 **Penn Treebank**（宾夕法尼亚大学树库）的标注标准，这是自然语言处理（NLP）中最权威的词性标注体系之一：

---

**1. 标点符号与特殊符号**
| 标签 | 全称/含义 | 示例 |
|------|-----------|------|
| `"`  | 双引号（左/右） | `"Hello"` |
| `''` | 单引号（右）或右双引号 | `'word'` 或 `"end"` |
| `#`  | 数字符号（如哈希标签） | `#NLP` |
| `$`  | 货币符号 | `$100` |
| `(`  | 左括号 | `(example)` |
| `)`  | 右括号 | `(example)` |
| `,`  | 逗号 | `A, B, C` |
| `.`  | 句号/结束符 | `The end.` |
| `:`  | 冒号/分隔符 | `Reason: ...` |
| ```` | 左单引号或左双引号 | `‘word’` 或 `“start”` |

---

**2. 主要词性标签**
**连接词与限定词**
| 标签 | 全称 | 含义 | 示例 |
|------|------|------|------|
| `CC`  | Coordinating Conjunction | 并列连词 | `and`, `or`, `but` |
| `DT`  | Determiner | 限定词 | `the`, `a`, `this` |
| `PDT` | Predeterminer | 前位限定词 | `all`, `both` (`all the students`) |
| `IN`  | Preposition/Subordinating Conjunction | 介词/从属连词 | `in`, `on`, `because` |

**名词类**
| 标签 | 全称 | 含义 | 示例 |
|------|------|------|------|
| `NN`   | Noun, Singular | 单数名词 | `cat`, `book` |
| `NNS`  | Noun, Plural | 复数名词 | `cats`, `books` |
| `NNP`  | Proper Noun, Singular | 单数专有名词 | `John`, `London` |
| `NNPS` | Proper Noun, Plural | 复数专有名词 | `Americans`, `Alps` |
| `NN|SYM` | 名词或符号混合 | 特殊用例（如 `$` 作名词） | `50%` (可能标注为 `NN|SYM`) |

**代词类**
| 标签 | 全称 | 含义 | 示例 |
|------|------|------|------|
| `PRP`  | Personal Pronoun | 人称代词 | `I`, `you`, `he` |
| `PRP$` | Possessive Pronoun | 物主代词 | `my`, `your`, `his` |
| `WP`   | Wh-Pronoun | 疑问代词 | `who`, `what` |
| `WP$`  | Possessive Wh-Pronoun | 物主疑问代词 | `whose` |

**动词类**
| 标签 | 全称 | 含义 | 示例 |
|------|------|------|------|
| `VB`  | Verb, Base Form | 动词原形 | `be`, `run` |
| `VBD` | Verb, Past Tense | 过去式 | `was`, `ran` |
| `VBG` | Verb, Gerund/Present Participle | 动名词/现在分词 | `being`, `running` |
| `VBN` | Verb, Past Participle | 过去分词 | `been`, `eaten` |
| `VBP` | Verb, Non-3rd Person Singular Present | 非第三人称单数现在时 | `am`, `run` |
| `VBZ` | Verb, 3rd Person Singular Present | 第三人称单数现在时 | `is`, `runs` |
| `MD`  | Modal Verb | 情态动词 | `can`, `must` |

**形容词与副词**
| 标签 | 全称 | 含义 | 示例 |
|------|------|------|------|
| `JJ`   | Adjective | 形容词 | `happy`, `big` |
| `JJR`  | Adjective, Comparative | 比较级形容词 | `happier`, `bigger` |
| `JJS`  | Adjective, Superlative | 最高级形容词 | `happiest`, `biggest` |
| `RB`   | Adverb | 副词 | `quickly`, `very` |
| `RBR`  | Adverb, Comparative | 比较级副词 | `faster`, `more` |
| `RBS`  | Adverb, Superlative | 最高级副词 | `fastest`, `most` |
| `RP`   | Particle | 小品词（动词后） | `give up` 中的 `up` |

**其他**
| 标签 | 全称 | 含义 | 示例 |
|------|------|------|------|
| `CD`  | Cardinal Number | 基数词 | `one`, `1` |
| `EX`  | Existential `There` | 存在性 `there` | `There is...` |
| `FW`  | Foreign Word | 外来词 | `bonjour` |
| `LS`  | List Marker | 列表标记 | `1.`, `A.` |
| `POS` | Possessive Ending | 所有格标记 | `'s` (`John's`) |
| `SYM` | Symbol | 符号 | `%`, `&` |
| `TO`  | `to` | 不定式标记 | `to go` |
| `UH`  | Interjection | 感叹词 | `Oh`, `Hello` |
| `WDT` | Wh-Determiner | 疑问限定词 | `which`, `that` |
| `WRB` | Wh-Adverb | 疑问副词 | `when`, `where` |

---

**权威参考**
- **Penn Treebank 官方文档**：  
  标签定义来源于宾夕法尼亚大学树库项目，详见 [Penn Treebank Tagset](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)。
- **NLTK 或 spaCy 的实现**：  
  主流工具库（如 NLTK、spaCy）均遵循此标准，可通过代码直接调用标签说明（例如 `nltk.help.upenn_tagset()`）。

### 1.3 可视化分块标签

In [48]:
chunk_features = raw_datasets["train"].features["chunk_tags"]
chunk_label_names = chunk_features.feature.names
print(chunk_label_names)

['O', 'B-ADJP', 'I-ADJP', 'B-ADVP', 'I-ADVP', 'B-CONJP', 'I-CONJP', 'B-INTJ', 'I-INTJ', 'B-LST', 'I-LST', 'B-NP', 'I-NP', 'B-PP', 'I-PP', 'B-PRT', 'I-PRT', 'B-SBAR', 'I-SBAR', 'B-UCP', 'I-UCP', 'B-VP', 'I-VP']


In [49]:
words = raw_datasets["train"][4]["tokens"]
label_indice = raw_datasets["train"][4]["chunk_tags"]
print(words)
print(label_indice)
visualize_token_with_labels(words, label_indice, chunk_label_names)

['Germany', "'s", 'representative', 'to', 'the', 'European', 'Union', "'s", 'veterinary', 'committee', 'Werner', 'Zwingmann', 'said', 'on', 'Wednesday', 'consumers', 'should', 'buy', 'sheepmeat', 'from', 'countries', 'other', 'than', 'Britain', 'until', 'the', 'scientific', 'advice', 'was', 'clearer', '.']
[11, 11, 12, 13, 11, 12, 12, 11, 12, 12, 12, 12, 21, 13, 11, 12, 21, 22, 11, 13, 11, 1, 13, 11, 17, 11, 12, 12, 21, 1, 0]
Germany 's   representative to   the  European Union 's   veterinary committee Werner Zwingmann said on   Wednesday consumers should buy  sheepmeat from countries other  than Britain until  the  scientific advice was  clearer . 
B-NP    B-NP I-NP           B-PP B-NP I-NP     I-NP  B-NP I-NP       I-NP      I-NP   I-NP      B-VP B-PP B-NP      I-NP      B-VP   I-VP B-NP      B-PP B-NP      B-ADJP B-PP B-NP    B-SBAR B-NP I-NP       I-NP   B-VP B-ADJP  O 


分块标签的含义如下：

以下是 **分块标注（Chunking Tags）** 的详细解释，基于 **IOB（Inside, Outside, Begin）标注格式**，通常用于短语级语法分析（如名词短语、动词短语等）。这些标签常见于 **CoNLL-2000 语料库** 和 **Penn Treebank 的短语结构标注**，是句法分析的基础。

---

**1. 基础格式说明**
- **`O`**：当前词**不属于任何分块**（Outside）。
- **`B-XXX`**：表示某类短语的**起始词**（Begin）。
- **`I-XXX`**：表示某类短语的**内部词**（Inside）。

---

**2. 具体分块标签解析**
**短语类型**
| 标签          | 全称                     | 含义                          | 示例（分块部分加粗） |
|---------------|--------------------------|-------------------------------|----------------------|
| **`B-ADJP`**  | Adjective Phrase         | 形容词短语（起始词）          | `[B-ADJP very happy]` |
| **`I-ADJP`**  | Adjective Phrase         | 形容词短语（内部词）          | `[B-ADJP very I-ADJP happy]` |
| **`B-ADVP`**  | Adverb Phrase            | 副词短语（起始词）            | `[B-ADVP quite quickly]` |
| **`I-ADVP`**  | Adverb Phrase            | 副词短语（内部词）            | `[B-ADVP quite I-ADVP quickly]` |
| **`B-NP`**    | Noun Phrase              | 名词短语（起始词）            | `[B-NP the cat]` |
| **`I-NP`**    | Noun Phrase              | 名词短语（内部词）            | `[B-NP the I-NP cat]` |
| **`B-VP`**    | Verb Phrase              | 动词短语（起始词）            | `[B-VP is running]` |
| **`I-VP`**    | Verb Phrase              | 动词短语（内部词）            | `[B-VP is I-VP running]` |
| **`B-PP`**    | Prepositional Phrase     | 介词短语（起始词）            | `[B-PP in the room]` |
| **`I-PP`**    | Prepositional Phrase     | 介词短语（内部词）            | `[B-PP in I-PP the I-PP room]` |

**其他特殊短语**
| 标签          | 全称                     | 含义                          | 示例 |
|---------------|--------------------------|-------------------------------|------|
| **`B-CONJP`** | Conjunction Phrase       | 连接词短语（起始词）          | `[B-CONJP either...or]` |
| **`I-CONJP`** | Conjunction Phrase       | 连接词短语（内部词）          | `[B-CONJP either I-CONJP or]` |
| **`B-INTJ`**  | Interjection             | 感叹语（起始词）              | `[B-INTJ Oh no!]` |
| **`I-INTJ`**  | Interjection             | 感叹语（内部词）              | `[B-INTJ Oh I-INTJ no]` |
| **`B-LST`**   | List Marker              | 列表标记（起始词）            | `[B-LST 1. Item]` |
| **`I-LST`**   | List Marker              | 列表标记（内部词）            | `[B-LST 1. I-LST Item]` |
| **`B-PRT`**   | Particle                 | 小品词（起始词）              | `[B-PRT give up]` |
| **`I-PRT`**   | Particle                 | 小品词（内部词）              | `[B-PRT give I-PRT up]` |
| **`B-SBAR`**  | Subordinating Clause     | 从属从句（起始词）            | `[B-SBAR because he left]` |
| **`I-SBAR`**  | Subordinating Clause     | 从属从句（内部词）            | `[B-SBAR because I-SBAR he I-SBAR left]` |
| **`B-UCP`**   | Unlike Coordinated Phrase| 非对称并列短语（起始词）      | `[B-UCP apples and oranges]` |
| **`I-UCP`**   | Unlike Coordinated Phrase| 非对称并列短语（内部词）      | `[B-UCP apples I-UCP and I-UCP oranges]` |

---

**3. 应用场景与示例**
**示例句子标注**
句子：  
`"The cat (B-NP) is running (B-VP) very quickly (B-ADVP) in the garden (B-PP)."`

标注结果（简化版）：  
- `The` → `B-NP`  
- `cat` → `I-NP`  
- `is` → `B-VP`  
- `running` → `I-VP`  
- `very` → `B-ADVP`  
- `quickly` → `I-ADVP`  
- `in` → `B-PP`  
- `the` → `I-PP`  
- `garden` → `I-PP`  
- `.` → `O`  

---

**4. 权威参考**
- **CoNLL-2000 共享任务**：  
  分块标注标准源自 [CoNLL-2000 语料库](https://www.clips.uantwerpen.be/conll2000/chunking/)，常用于序列标注模型训练。
- **Penn Treebank 短语结构**：  
  更详细的短语定义可参考 [Penn Treebank 标注指南](https://catalog.ldc.upenn.edu/docs/LDC95T7/cl93.html)。

## 2 处理数据

### 2.1 导入分词器

In [50]:
from transformers import AutoTokenizer

ckpt = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(ckpt)
assert tokenizer.is_fast

### 2.2 分词

In [51]:
train_set = raw_datasets["train"]
inputs = tokenizer(train_set[0]["tokens"], is_split_into_words=True)
print(inputs.tokens())

['[CLS]', 'EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'la', '##mb', '.', '[SEP]']


可以看到，“lamb”被分成了“la”和“##mb”，然而，NER标签是和“lamb”一一对应的，那么，有什么办法可以始终记得“la”和“##mb”是一个整体呢？答案是使用`inputs.word_ids`方法：

In [52]:
print(inputs.word_ids())
pprint(inputs)

[None, 0, 1, 2, 3, 4, 5, 6, 7, 7, 8, None]
{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'input_ids': [101,
               7270,
               22961,
               1528,
               1840,
               1106,
               21423,
               1418,
               2495,
               12913,
               119,
               102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}


可以看到，“la”和“##mb”的word id都是7，这就为我们将分词结果序列与NER标签序列进行对齐提供了便利。

### 2.3 对齐tokens和NER标签

只需稍加工作，我们就能将标签列表扩展以匹配各个 token。我们将应用的第一条规则是特殊 token 的标签为 -100 。这是因为默认情况下 -100 是一个在我们将使用的损失函数（交叉熵）中被忽略的索引。然后，每个 token 的标签与其所在单词的起始 token 相同，因为它们属于同一个实体。对于位于单词内部但不是起始位置的 token，我们将 B- 替换为 I- （因为这个 token 不是实体的起始 token）：

In [53]:
def align_labels_with_tokens(labels, word_ids):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id != current_word:
            # Start of a new word!
            current_word = word_id
            label = -100 if word_id is None else labels[word_id]
            new_labels.append(label)
        elif word_id is None:
            # Special token
            new_labels.append(-100)
        else:
            # Same word as previous token
            label = labels[word_id]
            # If the label is B-XXX we change it to I-XXX
            if label % 2 == 1:
                label += 1
            new_labels.append(label)

    return new_labels

In [54]:
words = train_set[0]["tokens"]
ner_label_indice = train_set[0]["ner_tags"]
visualize_token_with_labels(words, ner_label_indice, ner_label_names)

EU    rejects German call to boycott British lamb . 
B-ORG O       B-MISC O    O  O       B-MISC  O    O 


In [55]:
word_ids = tokenizer(words, is_split_into_words=True).word_ids()
ner_label_indice_aligned = align_labels_with_tokens(ner_label_indice, word_ids)
print(ner_label_indice)
print(ner_label_indice_aligned)

[3, 0, 7, 0, 0, 0, 7, 0, 0]
[-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, 0, -100]


### 2.4 分词并对齐

In [56]:
def tokenize_and_align_labels(batch_samples):
    batch_inputs = tokenizer(
        batch_samples["tokens"],
        truncation=True,
        is_split_into_words=True
    )
    batch_labels = batch_samples["ner_tags"]
    new_labels = []  # List[List[int]]
    for i, labels in enumerate(batch_labels):
        word_ids = batch_inputs.word_ids(i)  # 第i个句子的word ids
        new_labels.append(align_labels_with_tokens(labels, word_ids))

    batch_inputs["labels"] = new_labels
    return batch_inputs


In [57]:
tokenized_datasets = raw_datasets.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=raw_datasets["train"].column_names,
)

Map:   0%|          | 0/3250 [00:00<?, ? examples/s]

## 3 使用Trainer API微调模型

使用 Trainer 的实际代码将与之前相同；唯一的变化是将数据组合成批次的方式和指标计算函数。

### 3.1 数据整理：Data Collation

我们无法像第三章那样仅使用 DataCollatorWithPadding ，因为那样只会填充输入（输入 ID、注意力掩码和 token 类型 ID）。在我们的情况下，标签应该以与输入完全相同的方式填充，以便它们保持相同的大小，使用 -100 作为值，以便在损失计算中忽略相应的预测。

这一切都由一个DataCollatorForTokenClassification 完成。就像 DataCollatorWithPadding 一样，它使用用于预处理输入的 tokenizer ：

In [58]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

In [59]:
batch = data_collator([tokenized_datasets["train"][i] for i in range(2)])
pprint(batch["labels"])

tensor([[-100,    3,    0,    7,    0,    0,    0,    7,    0,    0,    0, -100],
        [-100,    1,    2, -100, -100, -100, -100, -100, -100, -100, -100, -100]])


In [60]:
for i in range(2):
    print(tokenized_datasets["train"][i]["labels"])

[-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, 0, -100]
[-100, 1, 2, -100]


如我们所见，第二组标签已使用 -100 进行填充，以匹配第一组的长度。

### 3.2 指标计算

#### 3.2.1 安装seqeval

用于评估 token 分类预测的传统框架是 seqeval。要使用这个指标，我们首先需要安装 seqeval 库：

In [61]:
!pip install seqeval



In [62]:
import evaluate

metric = evaluate.load("seqeval")

#### 3.2.2 尝试计算指标

这个指标集的行为与标准准确率不同：它实际上会将标签列表作为字符串，而不是整数，因此我们需要在将预测和标签传递给指标之前完全解码它们。

In [63]:
ner_label_names = raw_datasets["train"].features["ner_tags"].feature.names
label_indice = raw_datasets["train"][0]["ner_tags"]
labels = [ner_label_names[i] for i in label_indice]
print(ner_label_names)
print(labels)

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']
['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O']


然后我们可以通过仅更改索引 2 处的值来为这些创建假预测：

In [64]:
predictions = labels.copy()
predictions[2] = "O"
metric.compute(predictions=[predictions], references=[labels])

{'MISC': {'precision': np.float64(1.0),
  'recall': np.float64(0.5),
  'f1': np.float64(0.6666666666666666),
  'number': np.int64(2)},
 'ORG': {'precision': np.float64(1.0),
  'recall': np.float64(1.0),
  'f1': np.float64(1.0),
  'number': np.int64(1)},
 'overall_precision': np.float64(1.0),
 'overall_recall': np.float64(0.6666666666666666),
 'overall_f1': np.float64(0.8),
 'overall_accuracy': 0.8888888888888888}

#### 3.2.3 定义指标计算函数

In [65]:
import numpy as np


def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)

    # Remove ignored index (special tokens) and convert to labels
    true_labels = [[ner_label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [ner_label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": all_metrics["overall_precision"],
        "recall": all_metrics["overall_recall"],
        "f1": all_metrics["overall_f1"],
        "accuracy": all_metrics["overall_accuracy"],
    }

### 3.3 定义模型

由于我们正在处理一个 token 分类问题，我们将使用 AutoModelForTokenClassification 类别。定义这个模型时需要记住的主要事情是传递一些关于我们有多少标签的信息。最简单的方法是通过 num_labels 参数。

In [66]:
print(ner_label_names)

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']


In [67]:
id2label = {i: label for i, label in enumerate(ner_label_names)}
label2id = {label: i for i, label in id2label.items()}

In [68]:
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(
    ckpt,
    id2label=id2label,
    label2id=label2id,
)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [69]:
print(model.config.num_labels)

9


### 3.4 微调模型

In [70]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [71]:
from transformers import TrainingArguments

args = TrainingArguments(
    "bert-finetuned-ner",
    # evaluation_strategy="epoch",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True
)

In [72]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    processing_class=tokenizer,
)

In [73]:
trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.0773,0.062488,0.907712,0.93689,0.92207,0.982266
2,0.0351,0.065198,0.929044,0.945305,0.937104,0.985209
3,0.0211,0.062345,0.934394,0.949175,0.941726,0.985901


TrainOutput(global_step=5268, training_loss=0.06667736867384741, metrics={'train_runtime': 228.9601, 'train_samples_per_second': 183.975, 'train_steps_per_second': 23.008, 'total_flos': 920771584279074.0, 'train_loss': 0.06667736867384741, 'epoch': 3.0})

In [74]:
trainer.push_to_hub(commit_message="Training complete.")

Uploading...:   0%|          | 0.00/431M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/f499d5/bert-finetuned-ner/commit/3bf10f7383ef8fb1f97f9c3e89a264295c156a30', commit_message='Training complete.', commit_description='', oid='3bf10f7383ef8fb1f97f9c3e89a264295c156a30', pr_url=None, repo_url=RepoUrl('https://huggingface.co/f499d5/bert-finetuned-ner', endpoint='https://huggingface.co', repo_type='model', repo_id='f499d5/bert-finetuned-ner'), pr_revision=None, pr_num=None)