##Token classification


The first application we’ll explore is token classification. This generic task encompasses any problem that can be formulated as “attributing a label to each token in a sentence,” such as:

**Named entity recognition (NER)**: Find the entities (such as persons, locations, or organizations) in a sentence. This can be formulated as attributing a label to each token by having one class per entity and one class for “no entity.”

**Part-of-speech tagging (POS)**: Mark each word in a sentence as corresponding to a particular part of speech (such as noun, verb, adjective, etc.).

**Chunking**: Find the tokens that belong to the same entity. This task (which can be combined with POS or NER) can be formulated as attributing one label (usually B-) to any tokens that are at the beginning of a chunk, another label (usually I-) to tokens that are inside a chunk, and a third label (usually O) to tokens that don’t belong to any chunk.



- O means the word doesn’t correspond to any entity.
- B-PER/I-PER means the word corresponds to the beginning of/is inside a person entity.
- B-ORG/I-ORG means the word corresponds to the beginning of/is inside an organization entity.
- B-LOC/I-LOC means the word corresponds to the beginning of/is inside a location entity.
- B-MISC/I-MISC means the word corresponds to the beginning of/is inside a miscellaneous entity.

In [85]:
# Install
!pip install transformers datasets tokenizers seqeval -q

In [86]:
from google.colab import drive
drive.mount('/content/drive/')

import pandas as pd
import numpy as np
import datasets
from datasets import Dataset, Features, Value, Sequence, ClassLabel
train = pd.read_csv("/content/drive/MyDrive/NLP4/train.csv")
valid = pd.read_csv("/content/drive/MyDrive/NLP4/dev.csv")
test = pd.read_csv("/content/drive/MyDrive/NLP4/test.csv")

print(train.head())
print(valid.head())
print(test.head())
print(test["Column2"].unique())
tagdict = {o:i for i, o in enumerate(['O', 'B-loc', 'I-loc', 'B-pers', 'B-org', 'I-pers', 'B-event', 'I-event', 'B-fac',
 'I-fac', 'I-org', 'B-pro', 'I-pro'])}
print(tagdict)
def make_sentences(words, nertags):
    global tagdict
    ids = []
    word2d = []
    tag2d = []
    assert len(words) == len(nertags)
    i = 0
    id = 0
    while i < len(words):
        word1d = []
        tag1d = []
        while not words[i].strip() in [".", "!", "،", "؟", "?", "؛", ";", ":"]:
            word1d.append(words[i].strip())
            tag1d.append(tagdict[nertags[i].strip()])
            i+=1
        word1d.append(words[i].strip())
        tag1d.append(tagdict[nertags[i].strip()])
        i+= 1
        ids.append(id)
        id += 1
        word2d.append(word1d)
        tag2d.append(tag1d)
    return ids, word2d, tag2d


data = {"train": train, "valid": valid, "test": test}
for k, v in data.items():
    data[k] = data[k].reset_index().to_dict(orient='list')
    ids, word2d, tag2d = make_sentences(data[k]["Column1"], data[k]["Column2"])
    data[k] = {"id": ids, "tokens": word2d, "ner_tags":tag2d}
    print(pd.DataFrame(data[k].items()).head(100))
    data[k] = Dataset.from_dict({"id": ids, "tokens": word2d, "ner_tags":tag2d}, features=Features({"id": Value(dtype='string', id=None),
"tokens": Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
"ner_tags": Sequence(feature=ClassLabel(names=['O', 'B-loc', 'I-loc', 'B-pers', 'B-org', 'I-pers', 'B-event', 'I-event', 'B-fac',
 'I-fac', 'I-org', 'B-pro', 'I-pro'], id=None), length=-1, id=None)}))

data = datasets.DatasetDict({"train":data["train"], "validation":data["valid"], "test":data["test"]})



Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).
  Column1 Column2
0      به       O
1   عنوان       O
2    مثال       O
3    وقتی       O
4  نشریات       O
  Column1 Column2
0    افقی       O
1       :       O
2       0       O
3       ـ       O
4      از       O
  Column1 Column2
0    افقی       O
1       :       O
2       0       O
3       ـ       O
4      از       O
['O' 'B-loc' 'I-loc' 'B-pers' 'B-org' 'I-pers' 'B-event' 'I-event' 'B-fac'
 'I-fac' 'I-org' 'B-pro' 'I-pro']
{'O': 0, 'B-loc': 1, 'I-loc': 2, 'B-pers': 3, 'B-org': 4, 'I-pers': 5, 'B-event': 6, 'I-event': 7, 'B-fac': 8, 'I-fac': 9, 'I-org': 10, 'B-pro': 11, 'I-pro': 12}
          0                                                  1
0        id  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
1    tokens  [[به, عنوان, مثال, وقتی, نشریات, مدافع, اصول, ...
2  ner_tags  [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
          0           

In [87]:
data

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 27230
    })
    validation: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 13372
    })
    test: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 20301
    })
})

Data link: https://huggingface.co/datasets/conll2003

In [88]:
import datasets
import numpy as np
from transformers import BertTokenizerFast
from transformers import DataCollatorForTokenClassification
from transformers import AutoModelForTokenClassification

conll2003 = data

In [89]:
conll2003

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 27230
    })
    validation: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 13372
    })
    test: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 20301
    })
})

In [90]:
conll2003.shape

{'train': (27230, 3), 'validation': (13372, 3), 'test': (20301, 3)}

In [91]:
conll2003["train"][0]

{'id': '0',
 'tokens': ['به',
  'عنوان',
  'مثال',
  'وقتی',
  'نشریات',
  'مدافع',
  'اصول',
  'و',
  'ارزشها',
  'و',
  'منادی',
  'انقلاب',
  'و',
  'اسلام',
  'در',
  'بالاترین',
  'درجه',
  '،'],
 'ner_tags': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}

In [92]:
print(conll2003["train"].features["id"])
print(conll2003["train"].features["tokens"])
print(conll2003["train"].features["ner_tags"])

Value(dtype='string', id=None)
Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)
Sequence(feature=ClassLabel(names=['O', 'B-loc', 'I-loc', 'B-pers', 'B-org', 'I-pers', 'B-event', 'I-event', 'B-fac', 'I-fac', 'I-org', 'B-pro', 'I-pro'], id=None), length=-1, id=None)


In [93]:
conll2003['train'].description

''

In [94]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("HooshvareLab/bert-base-parsbert-ner-uncased")
# tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

#### Note:
Transformers are often pretrained with subword tokenizers, meaning that even if your inputs have been split into words already, each of those words could be split again by the tokenizer.
This means that we need to do some processing on our labels as the input ids returned by the tokenizer are longer than the lists of labels our dataset contain.
This is happening, first because some special tokens might be added (we can a [CLS] and a [SEP] above) and then because of those possible splits of words in multiple tokens:

####just for checking the output of some variables before applying tokenize_and_align_labels()

In [95]:
conll2003['train'][0]

{'id': '0',
 'tokens': ['به',
  'عنوان',
  'مثال',
  'وقتی',
  'نشریات',
  'مدافع',
  'اصول',
  'و',
  'ارزشها',
  'و',
  'منادی',
  'انقلاب',
  'و',
  'اسلام',
  'در',
  'بالاترین',
  'درجه',
  '،'],
 'ner_tags': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}

In [96]:
example_text = conll2003['train'][0]

tokenized_input = tokenizer(example_text["tokens"], is_split_into_words=True)

tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])

word_ids = tokenized_input.word_ids()

print(word_ids)

''' As we can see, it returns a list with the same number of elements as our processed input ids, mapping special tokens to None and all other tokens to their respective word. This way, we can align the labels with the processed input ids. '''

# tokenized_input

[None, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, None]


' As we can see, it returns a list with the same number of elements as our processed input ids, mapping special tokens to None and all other tokens to their respective word. This way, we can align the labels with the processed input ids. '

In [97]:
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
tokens

['[CLS]',
 'به',
 'عنوان',
 'مثال',
 'وقتی',
 'نشریات',
 'مدافع',
 'اصول',
 'و',
 'ارزشها',
 'و',
 'منادی',
 'انقلاب',
 'و',
 'اسلام',
 'در',
 'بالاترین',
 'درجه',
 '،',
 '[SEP]']

Problem of Sub-Token - The input ids returned by the tokenizer are longer than the lists of labels our dataset contain.

In [98]:
len(example_text['ner_tags']), len(tokenized_input["input_ids"])


(18, 20)

The below function tokenize_and_align_labels does 2 jobs
- set –100 as the label for these special tokens and the subwords we wish to mask during training
- mask the subword representations after the first subword

Then we align the labels with the token ids using the strategy we picked:

In [99]:
def tokenize_and_align_labels(examples, label_all_tokens=True):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        # word_ids() => Return a list mapping the tokens
        # to their actual word in the initial sentence.
        # It Returns a list indicating the word corresponding to each token.
        previous_word_idx = None
        label_ids = []
        # Special tokens like `` and `<\s>` are originally mapped to None
        # We need to set the label to -100 so they are automatically ignored in the loss function.
        for word_idx in word_ids:
            if word_idx is None:
                # set –100 as the label for these special tokens
                label_ids.append(-100)
            # For the other tokens in a word, we set the label to either the current label or -100, depending on
            # the label_all_tokens flag.
            elif word_idx != previous_word_idx:
                # if current word_idx is != prev then its the most regular case
                # and add the corresponding token
                label_ids.append(label[word_idx])
            else:
                # to take care of sub-words which have the same word_idx
                # set -100 as well for them, but only if label_all_tokens == False
                label_ids.append(label[word_idx] if label_all_tokens else -100)
                # mask the subword representations after the first subword

            previous_word_idx = word_idx
        labels.append(label_ids)
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [100]:
conll2003['train'][4:5]

{'id': ['4'],
 'tokens': [['خبر',
   'را',
   'عینا',
   'به',
   'همین',
   'درشتی',
   'و',
   'با',
   'همین',
   'ترکیب',
   'عبارتی',
   'در',
   'صدر',
   'صفحه',
   'نخست',
   'به',
   'چاپ',
   'می\u200cرساند',
   'و',
   'در',
   'آن',
   'مورد',
   'هم',
   'به',
   'جای',
   'ذکر',
   'نام',
   'يا',
   'عضویت',
   'آن',
   'شخص',
   'در',
   'گروه',
   'و',
   'کمیته\u200cی',
   'خاص',
   'صرفا',
   'بر',
   'روی',
   'عنوان',
   'مشاور',
   'فلان',
   'مسئول',
   'بلندمرتبه',
   'تأکید',
   'می\u200cکنند',
   '؟']],
 'ner_tags': [[0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0]]}

In [101]:
q = tokenize_and_align_labels(conll2003['train'][4:5])
print(q)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


{'input_ids': [[2, 2165, 2049, 9425, 2031, 2531, 33809, 331, 2037, 2531, 4950, 9410, 2028, 5354, 4203, 2894, 2031, 3552, 8144, 331, 2028, 2050, 2334, 2063, 2031, 2585, 4546, 2410, 333, 1157, 4793, 2050, 2983, 2028, 2690, 331, 40334, 3154, 5263, 2043, 2421, 2339, 3753, 9984, 2900, 22034, 2617, 2484, 303, 4]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'labels': [[-100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -100]]}


:So before applying the tokenize_and_align_labels() the tokenized_input has 3 keys
- input_ids
- token_type_ids
- attention_mask

But after applying tokenize_and_align_labels() we have an extra key - 'labels'

In [102]:
for token, label in zip(tokenizer.convert_ids_to_tokens(q["input_ids"][0]),q["labels"][0]):
    print(f"{token:_<40} {label}")

[CLS]___________________________________ -100
خبر_____________________________________ 0
را______________________________________ 0
عینا____________________________________ 0
به______________________________________ 0
همین____________________________________ 0
درشتی___________________________________ 0
و_______________________________________ 0
با______________________________________ 0
همین____________________________________ 0
ترکیب___________________________________ 0
عبارتی__________________________________ 0
در______________________________________ 0
صدر_____________________________________ 0
صفحه____________________________________ 0
نخست____________________________________ 0
به______________________________________ 0
چاپ_____________________________________ 0
میرساند_________________________________ 0
و_______________________________________ 0
در______________________________________ 0
ان______________________________________ 0
مورد____________________________________ 0
هم______

In [103]:
## Applying on entire data
tokenized_datasets = conll2003.map(tokenize_and_align_labels, batched=True)

Map:   0%|          | 0/27230 [00:00<?, ? examples/s]

Map:   0%|          | 0/13372 [00:00<?, ? examples/s]

Map:   0%|          | 0/20301 [00:00<?, ? examples/s]

In [104]:
tokenized_datasets['train'][0]

{'id': '0',
 'tokens': ['به',
  'عنوان',
  'مثال',
  'وقتی',
  'نشریات',
  'مدافع',
  'اصول',
  'و',
  'ارزشها',
  'و',
  'منادی',
  'انقلاب',
  'و',
  'اسلام',
  'در',
  'بالاترین',
  'درجه',
  '،'],
 'ner_tags': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'input_ids': [2,
  2031,
  2339,
  4281,
  3043,
  11084,
  5966,
  3655,
  331,
  11751,
  331,
  30672,
  2858,
  331,
  2393,
  2028,
  6204,
  4817,
  300,
  4],
 'token_type_ids': [0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1],
 'labels': [-100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -100]}

In [106]:
# Defining model

# model = AutoModelForTokenClassification.from_pretrained("bert-base-uncased", num_labels=9)
model = AutoModelForTokenClassification.from_pretrained("HooshvareLab/bert-base-parsbert-ner-uncased")


Some weights of the model checkpoint at HooshvareLab/bert-base-parsbert-ner-uncased were not used when initializing BertForTokenClassification: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [107]:
!pip install accelerate -U



In [108]:
#Define training args
from transformers import TrainingArguments, Trainer


args = TrainingArguments(
"test-ner",
evaluation_strategy = "epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=3,
weight_decay=0.01,
)

In [109]:
data_collator = DataCollatorForTokenClassification(tokenizer)

In [110]:
metric = datasets.load_metric("seqeval")

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


### Lets test the metrix on an example

In [111]:
example = conll2003['train'][0]

In [112]:
label_list = conll2003["train"].features["ner_tags"].feature.names

label_list

['O',
 'B-loc',
 'I-loc',
 'B-pers',
 'B-org',
 'I-pers',
 'B-event',
 'I-event',
 'B-fac',
 'I-fac',
 'I-org',
 'B-pro',
 'I-pro']

In [113]:
for i in example["ner_tags"]:
  print(i)

0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0


In [114]:
labels = [label_list[i] for i in example["ner_tags"]]
labels

['O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O']

In [115]:
metric.compute(predictions=[labels], references=[labels])

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  avg = a.mean(axis, **keepdims_kw)
  ret = ret.dtype.type(ret / rcount)


{'overall_precision': 0.0,
 'overall_recall': 0.0,
 'overall_f1': 0.0,
 'overall_accuracy': 1.0}

###Compute Metrics
This compute_metrics() function first takes the argmax of the logits to convert them to predictions (as usual, the logits and the probabilities are in the same order, so we don’t need to apply the softmax). Then we have to convert both labels and predictions from integers to strings. We remove all the values where the label is -100, then pass the results to the metric.compute() method:

In [116]:
def compute_metrics(eval_preds):
    pred_logits, labels = eval_preds

    pred_logits = np.argmax(pred_logits, axis=2)
    # the logits and the probabilities are in the same order,
    # so we don’t need to apply the softmax

    # We remove all the values where the label is -100
    predictions = [
        [label_list[eval_preds] for (eval_preds, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(pred_logits, labels)
    ]

    true_labels = [
      [label_list[l] for (eval_preds, l) in zip(prediction, label) if l != -100]
       for prediction, label in zip(pred_logits, labels)
   ]
    results = metric.compute(predictions=predictions, references=true_labels)

    return {
          "precision": results["overall_precision"],
          "recall": results["overall_recall"],
          "f1": results["overall_f1"],
          "accuracy": results["overall_accuracy"],
  }

## Training

In [117]:
trainer = Trainer(
   model,
   args,
   train_dataset=tokenized_datasets["train"],
   eval_dataset=tokenized_datasets["validation"],
   data_collator=data_collator,
   tokenizer=tokenizer,
   compute_metrics=compute_metrics
)

In [None]:
trainer.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss


# Save

In [None]:
## Save model
model.save_pretrained("ner_model")

In [None]:
## Save tokenizer
tokenizer.save_pretrained("tokenizer")

In [None]:
id2label = {
    str(i): label for i,label in enumerate(label_list)
}
label2id = {
    label: str(i) for i,label in enumerate(label_list)
}

In [None]:
id2label

In [None]:
label2id

## Loading model & prediction

In [None]:
import json

In [None]:
config = json.load(open("ner_model/config.json"))
config["id2label"] = id2label
config["label2id"] = label2id
json.dump(config, open("ner_model/config.json","w"))

In [None]:
model_fine_tuned = AutoModelForTokenClassification.from_pretrained("ner_model")

In [None]:
from transformers import pipeline

In [None]:
nlp = pipeline("ner", model=model_fine_tuned, tokenizer=tokenizer)


example = "این سریال به صورت رسمی در تاریخ دهم می ۲۰۱۱ توسط شبکه فاکس برای پخش رزرو شد."

ner_results = nlp(example)

print(ner_results)

Reference: https://huggingface.co/course/chapter7/2