### Bert用于Token Classifcation的任务
Token Classifcation是对文本中的每个token（通常指的是经过分词后的词或者子词）分配一个标签。这类问题的典型应用包括：
1. 命名实体识别（NER, Named Entity Regcognition）：识别文本中的特定实体（人名、地名、机构等）
2. 词性标注（POS Tagging, Part-of-Speech Tagging）：为每个词分配一个语法类别，如名词、动词等
3. 分块标注（Chunking）：将句子中的词分组为具有特定语法功能的短语（名词短语、动词短语）<br>
### Token Classfication和Text Classification的区别
1. Token Classification是为输入文本中的每个词分配一个标签，通常用于序列标注任务；
2. Text Classification是为整个句子或者文本片段分配一个标签，通常用于情感分析、主题分类、垃圾邮件检测等任务。

### 1. 导入库

In [75]:
import torch
from datasets import load_dataset
import evaluate
import pandas as pd
from transformers import Trainer, TrainingArguments, AutoTokenizer, AutoModelForTokenClassification
import evaluate
from transformers import BertTokenizer, BertTokenizerFast

### 2. 加载数据集


WNUT 17数据集是"Workshop on Noisy User-generated Text"的一部嗯，特别关注于从用户生成的文本中识别命名实体。该数据集专注于网络文本中的命名实体识别任务，包括多种实体类型，如地点、组织和人名。

数据集的标签:<br>
1. O：其他
2. B-LOC：地点的开始（beginning of location）
3. I-LOC：地点的内部（inside of location）
4. B-ORG：组织的开始（beginning of organization）
5. I-ORG：组织的内部 (inside of organization)
6. B-PER：人名的开始 (beginning of person)
7. I-PER：人名的内部 (inside of percon)
8. B-MISC：杂项的开始 (begginning of miscellaneous)
9. I-MISC：杂项的内部 (inside of miscellaneous)

In [76]:
# 加载 WNUT 17 数据集
datasets = load_dataset("wnut_17")

# 标签列表：实体类别，如 'B-location', 'I-location' 等
label_list = datasets["train"].features["ner_tags"].feature.names

In [77]:
print(datasets)

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 3394
    })
    validation: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 1009
    })
    test: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 1287
    })
})


In [78]:
# 将训练集转换为 DataFrame
train_df = pd.DataFrame(datasets["train"])

In [79]:
train_df.head()

Unnamed: 0,id,tokens,ner_tags
0,0,"[@paulwalk, It, 's, the, view, from, where, I,...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, ..."
1,1,"[From, Green, Newsfeed, :, AHFA, extends, dead...","[0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"
2,2,"[Pxleyes, Top, 50, Photography, Contest, Pictu...","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"
3,3,"[today, is, my, last, day, at, the, office, .]","[0, 0, 0, 0, 0, 0, 0, 0, 0]"
4,4,"[4Dbling, 's, place, til, monday, ,, party, pa...","[9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"


### 3. 数据预处理

In [80]:
model_name = "dbmdz/bert-large-cased-finetuned-conll03-english"
# 初始化 BERT 分词器
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

model = AutoModelForTokenClassification.from_pretrained(model_name)

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### 5. 数据预处理

In [81]:
def tokenize_and_align_labels(examples):
    # 使用 padding 和 truncation
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, padding='max_length', max_length=128, is_split_into_words=True)
    labels = []

    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # 获取每个 token 对应的原始词语
        previous_word_idx = None
        label_ids = []

        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)  # 忽略填充的 token
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])  # 新词取第一个子词的标签
            else:
                label_ids.append(-100)  # 子词标记为 -100
            previous_word_idx = word_idx
        
        # 确保 labels 的长度与 input_ids 一致
        while len(label_ids) < len(tokenized_inputs["input_ids"]):
            label_ids.append(-100)  # 填充为 -100 直至与 input_ids 的长度一致

        # 截断 labels 以确保与 input_ids 长度一致
        label_ids = label_ids[:len(tokenized_inputs["input_ids"])]
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs


In [82]:
# 对训练集、验证集和测试集进行处理
tokenized_datasets = datasets.map(tokenize_and_align_labels, batched=True)

# 检查 tokenized_datasets 的示例
print(tokenized_datasets["train"][0])

Map: 100%|██████████| 1009/1009 [00:00<00:00, 3287.19 examples/s]

{'id': '0', 'tokens': ['@paulwalk', 'It', "'s", 'the', 'view', 'from', 'where', 'I', "'m", 'living', 'for', 'two', 'weeks', '.', 'Empire', 'State', 'Building', '=', 'ESB', '.', 'Pretty', 'bad', 'storm', 'here', 'last', 'evening', '.'], 'ner_tags': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 8, 8, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0], 'input_ids': [101, 1030, 2703, 17122, 2009, 1005, 1055, 1996, 3193, 2013, 2073, 1045, 1005, 1049, 2542, 2005, 2048, 3134, 1012, 3400, 2110, 2311, 1027, 9686, 2497, 1012, 3492, 2919, 4040, 2182, 2197, 3944, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 




### 6. 设置训练参数

In [83]:
# 训练参数
training_args = TrainingArguments(
    output_dir="./results",  # 输出目录
    evaluation_strategy="epoch",  # 每个 epoch 结束后进行评估
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

# 使用 evaluate 库加载评估指标
metric = evaluate.load("seqeval")

def compute_metrics(p):
    predictions, labels = p
    predictions = torch.argmax(predictions, dim=2)

    # 移除填充的部分
    true_labels = [[label_list[l] for l in label if l != -100] for label in labels]
    true_predictions = [[label_list[p] for p, l in zip(prediction, label) if l != -100] for prediction, label in zip(predictions, labels)]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }





### 7. 开始训练

In [84]:
# 初始化 Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [85]:
# 开始训练
trainer.train()




ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`labels` in this case) have excessive nesting (inputs type `list` where type `int` is expected).