<a href="https://www.kaggle.com/code/aisuko/token-classification-nlp?scriptVersionId=164631763" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Overview

Token classification assigns a label to individual tokens in a sentence. One of the most common token classification tasks is Named Entity Recognition (NER). NER attempts to find a label for each entity in a sentence, such as a person, location, or organization. As we mentioned in [Fine-tuning Fill Mask LLMs to Text Classification](https://www.kaggle.com/code/aisuko/fine-tuning-fill-mask-llm-to-text-classification) that Fill Mask model can be fine-tune to a Text Classification model. So, here we use DistilBERT model as the base model. For the Dataset, we can make choice on the `wnut_17` which is aimed for `Token Classificaion` tasks.

In [1]:
%%capture
!pip install transformers==4.35.2
!pip install datasets==2.15.0
!pip install evaluate==0.4.1
!pip install seqeval==1.2.2

In [2]:
import os
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()

login(token=user_secrets.get_secret("HUGGINGFACE_TOKEN"))


os.environ["WANDB_API_KEY"]=user_secrets.get_secret("WANDB_API_KEY")
os.environ["WANDB_PROJECT"] = "Fine-tune-models"
os.environ["WANDB_NOTES"] = "Fine tune model distilbert base uncased"
os.environ["WANDB_NAME"] = "ft-distilbert-base-uncased-with-wnut17-v-0"

Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


# Loading WNUT 17 dataset

Start by loading the WNUT 17 dataset from the Datasets library.

In [3]:
from datasets import load_dataset

dataset=load_dataset("wnut_17")
dataset

Downloading builder script:   0%|          | 0.00/7.46k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/9.05k [00:00<?, ?B/s]



Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/185k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/39.1k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/66.9k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/3394 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1009 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1287 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 3394
    })
    validation: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 1009
    })
    test: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 1287
    })
})

In [4]:
dataset["train"]

Dataset({
    features: ['id', 'tokens', 'ner_tags'],
    num_rows: 3394
})

In [5]:
# Checking the labels name of each number
label_list=dataset["train"].features["ner_tags"].feature.names
label_list

['O',
 'B-corporation',
 'I-corporation',
 'B-creative-work',
 'I-creative-work',
 'B-group',
 'I-group',
 'B-location',
 'I-location',
 'B-person',
 'I-person',
 'B-product',
 'I-product']

The letter that prefixes each `ner_tag` indicates the token position of the entity:
* B - indicates the begining of an entity
* I - indicates a token is contained inside the same entity
* 0 indicates the token doesn't correspond to any entity

# Pre-process Data

Loading a DistilBERT tokenizer to preprocess the `tokens` field

In [6]:
from transformers import AutoTokenizer

tokenizer=AutoTokenizer.from_pretrained("distilbert-base-uncased")
print(tokenizer)

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

DistilBertTokenizerFast(name_or_path='distilbert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}


It looks like the inputs has already been tokenized in above section. But the input actually hasn't been tokenized yet and we will need to set `is_split_into_words=True` to tokenize the words into subwords. For example:

In [7]:
example=dataset["train"][0]

tokenized_input=tokenizer(example["tokens"], is_split_into_words=True)
print(f'Input Ids: {tokenized_input["input_ids"]}')

print(f'==============================================================')

tokens=tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
print(f'Input Tokens: {tokens}')

Input Ids: [101, 1030, 2703, 17122, 2009, 1005, 1055, 1996, 3193, 2013, 2073, 1045, 1005, 1049, 2542, 2005, 2048, 3134, 1012, 3400, 2110, 2311, 1027, 9686, 2497, 1012, 3492, 2919, 4040, 2182, 2197, 3944, 1012, 102]
Input Tokens: ['[CLS]', '@', 'paul', '##walk', 'it', "'", 's', 'the', 'view', 'from', 'where', 'i', "'", 'm', 'living', 'for', 'two', 'weeks', '.', 'empire', 'state', 'building', '=', 'es', '##b', '.', 'pretty', 'bad', 'storm', 'here', 'last', 'evening', '.', '[SEP]']


However, this adds some special tokens[CLS] and [SEP] and the subword tokenization creates a mismatch between the input and labels. A single word corresponding to a single label may now be split into subwords. We will need to realign the tokens and lables by:

* Mapping all tokens to their corresponding word with the `word_ids` method
* Assigning the label -100 to the special tokens [CLS and [SEP] so they're ignored by the PyTorch loss function
* Only labeling the first token of a given word. Assign -100 to other subtokens from the same word.



In [8]:
def preprocess_func(examples):
    tokenized_inputs=tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
    
    labels=[]
    for i, label in enumerate(examples[f"ner_tags"]):
        word_ids=tokenized_inputs.word_ids(batch_index=i) # Map tokens to their respective word
        previous_word_idx=None
        label_ids=[]
        for word_idx in word_ids: # Set the special tokens to -100.
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx !=previous_word_idx: # Only label the first token of a given word.
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx=word_idx
        labels.append(label_ids)
    
    tokenized_inputs["labels"]=labels
    return tokenized_inputs


tokenized_dataset=dataset.map(preprocess_func, batched=True)

Map:   0%|          | 0/3394 [00:00<?, ? examples/s]

Map:   0%|          | 0/1009 [00:00<?, ? examples/s]

Map:   0%|          | 0/1287 [00:00<?, ? examples/s]

# Padding Strategy

Here we are going to use `dynamically pad` the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length

In [9]:
from transformers import DataCollatorForTokenClassification

data_collator=DataCollatorForTokenClassification(tokenizer=tokenizer)
print(data_collator)

DataCollatorForTokenClassification(tokenizer=DistilBertTokenizerFast(name_or_path='distilbert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}, padding=True, max_length=None, pad_to_multiple_of=None, label_pad_tok

# Define Evaluate Function

Let's create metric for the training processes. For this task, we are going to load the seqeval framework. It is produces several scores: precision, recall, F1, and accuracy. Let's get the NER labels first, and then create a function that passes our true predictions and true labels to comput to calcualte the scores.

In [10]:
import evaluate
import numpy as np


def compute_metrics(p):
    predictions, labels=p
    predictions=np.argmax(predictions, axis=2)
    
    true_predictions=[
        [label_list[p] for (p,l) in zip(prediction, label) if l !=-100]
        for prediction, label in zip(predictions, labels)
    ]
    
    true_labels=[
        [label_list[l] for (p,l) in zip(prediction, label) if l !=-100]
        for prediction, label in zip(predictions, labels)
    ]
    
    results=accuracy.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall":results["overall_recall"],
        "f1":results["overall_f1"],
        "accuracy":results["overall_accuracy"],
    }


accuracy=evaluate.load("seqeval")

labels=[label_list[i] for i in example[f"ner_tags"]]

Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

# Training

Before the training process, create a map of the expected `ids` to their labels with `id2label` and `label2id`:

In [11]:
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer


id2label = {
    0: "O",
    1: "B-corporation",
    2: "I-corporation",
    3: "B-creative-work",
    4: "I-creative-work",
    5: "B-group",
    6: "I-group",
    7: "B-location",
    8: "I-location",
    9: "B-person",
    10: "I-person",
    11: "B-product",
    12: "I-product",
}

label2id = {
    "O": 0,
    "B-corporation": 1,
    "I-corporation": 2,
    "B-creative-work": 3,
    "I-creative-work": 4,
    "B-group": 5,
    "I-group": 6,
    "B-location": 7,
    "I-location": 8,
    "B-person": 9,
    "I-person": 10,
    "B-product": 11,
    "I-product": 12,
}


model=AutoModelForTokenClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=13, id2label=id2label, label2id=label2id
)

print(model.config)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "O",
    "1": "B-corporation",
    "2": "I-corporation",
    "3": "B-creative-work",
    "4": "I-creative-work",
    "5": "B-group",
    "6": "I-group",
    "7": "B-location",
    "8": "I-location",
    "9": "B-person",
    "10": "I-person",
    "11": "B-product",
    "12": "I-product"
  },
  "initializer_range": 0.02,
  "label2id": {
    "B-corporation": 1,
    "B-creative-work": 3,
    "B-group": 5,
    "B-location": 7,
    "B-person": 9,
    "B-product": 11,
    "I-corporation": 2,
    "I-creative-work": 4,
    "I-group": 6,
    "I-location": 8,
    "I-person": 10,
    "I-product": 12,
    "O": 0
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1

In [12]:
training_args=TrainingArguments(
    output_dir=os.getenv("WANDB_NAME"),
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    gradient_checkpointing=True,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
    report_to="wandb",
    run_name=os.getenv("WANDB_NAME"),
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33murakiny[0m ([33mcausal_language_trainer[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: wandb version 0.16.3 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
[34m[1mwandb[0m: Tracking run with wandb version 0.16.0
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/kaggle/working/wandb/run-20240228_041559-tou2q6au[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mft-distilbert-base-uncased-with-wnut17-v-0[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/causal_language_trainer/Fine-tune-models[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/causal_language_trainer/Fine-tune-models/runs/tou2q6au[0m
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the tex

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,0.307912,0.388571,0.063021,0.108453,0.930144
2,No log,0.298827,0.497585,0.190918,0.275954,0.93583


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


TrainOutput(global_step=214, training_loss=0.2658348618266739, metrics={'train_runtime': 162.2325, 'train_samples_per_second': 41.841, 'train_steps_per_second': 1.319, 'total_flos': 98230856448120.0, 'train_loss': 0.2658348618266739, 'epoch': 2.0})

In [13]:
tokenizer.push_to_hub(os.getenv("WANDB_NAME"))
trainer.push_to_hub(os.getenv("WANDB_NAME"))

training_args.bin:   0%|          | 0.00/4.22k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/266M [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

CommitInfo(commit_url='https://huggingface.co/aisuko/ft-distilbert-base-uncased-with-wnut17-v-0/commit/563f7178531517e38e64db23235a43502536fea3', commit_message='ft-distilbert-base-uncased-with-wnut17-v-0', commit_description='', oid='563f7178531517e38e64db23235a43502536fea3', pr_url=None, pr_revision=None, pr_num=None)

# Inference

In [14]:
from transformers import pipeline

text = "The Golden State Warriors are an American professional basketball team based in San Francisco."

classifier=pipeline("ner", model=os.getenv('MODEL_NAME'))
classifier(text)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

[{'entity': 'I-ORG',
  'score': 0.9997073,
  'index': 2,
  'word': 'Golden',
  'start': 4,
  'end': 10},
 {'entity': 'I-ORG',
  'score': 0.9997104,
  'index': 3,
  'word': 'State',
  'start': 11,
  'end': 16},
 {'entity': 'I-ORG',
  'score': 0.9996499,
  'index': 4,
  'word': 'Warriors',
  'start': 17,
  'end': 25},
 {'entity': 'I-MISC',
  'score': 0.9981956,
  'index': 7,
  'word': 'American',
  'start': 33,
  'end': 41},
 {'entity': 'I-LOC',
  'score': 0.9990578,
  'index': 13,
  'word': 'San',
  'start': 80,
  'end': 83},
 {'entity': 'I-LOC',
  'score': 0.9966156,
  'index': 14,
  'word': 'Francisco',
  'start': 84,
  'end': 93}]