## Token classification (NER)

The first application we’ll explore is token classification. This generic task encompasses any problem that can be formulated as “attributing a label to each token in a sentence,” such as:

Named entity recognition (NER): Find the entities (such as persons, locations, or organizations) in a sentence. This can be formulated as attributing a label to each token by having one class per entity and one class for “no entity.”

Part-of-speech tagging (POS): Mark each word in a sentence as corresponding to a particular part of speech (such as noun, verb, adjective, etc.).

Chunking: Find the tokens that belong to the same entity. This task (which can be combined with POS or NER) can be formulated as attributing one label (usually B-) to any tokens that are at the beginning of a chunk, another label (usually I-) to tokens that are inside a chunk, and a third label (usually O) to tokens that don’t belong to any chunk.

O means the word doesn’t correspond to any entity.
B-PER/I-PER means the word corresponds to the beginning of/is inside a person entity.
B-ORG/I-ORG means the word corresponds to the beginning of/is inside an organization entity.
B-LOC/I-LOC means the word corresponds to the beginning of/is inside a location entity.
B-MISC/I-MISC means the word corresponds to the beginning of/is inside a miscellaneous entity

In [None]:
!pip install transformers datasets tokenizers seqeval evaluate -q 

In [None]:
import datasets
import numpy as np 
from transformers import BertTokenizerFast
from transformers import DataCollatorForTokenClassification
from transformers import AutoModelForTokenClassification

In [None]:
conll2003 = datasets.load_dataset('conll2003')

In [None]:
conll2003

In [None]:
conll2003['train'][0]

In [None]:
conll2003['train'].features['ner_tags']

In [None]:
conll2003['train'].features['pos_tags']

In [None]:
conll2003['train'].dataset_size

In [None]:
conll2003['train'].description

In [None]:
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
tokenizer

In [None]:
## Test tokenizer output
conll2003['train'][0]

In [None]:
example_text = conll2003['train'][0]
tokenized_input = tokernizer(example_text['tokens'], is_split_into_words=True)
tokenized_input

In [None]:
token = tokernizer.convert_ids_to_tokens(tokenized_input['input_ids'])
token

Problem of Sub-Token - The input ids returned by the tokenizer are longer than the lists of labels our dataset contain.

In [None]:
example_text['ner_tags'], tokenized_input["input_ids"]

In [None]:
len(example_text['ner_tags']), len(tokenized_input["input_ids"])

The below function tokenize_and_align_labels does 2 jobs

* set –100 as the label for these special tokens and the subwords we wish to mask during training
* mask the subword representations after the first subword <br>

Then we align the labels with the token ids using the strategy we picked:

In [None]:
def tokenize_and_align_labels(examples, label_all_tokens=True): 
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True) 
    labels = [] 
    for i, label in enumerate(examples["ner_tags"]): 
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        # word_ids() => Return a list mapping the tokens
        # to their actual word in the initial sentence.
        # It Returns a list indicating the word corresponding to each token. 
        previous_word_idx = None 
        label_ids = []
        # Special tokens like `` and `<\s>` are originally mapped to None 
        # We need to set the label to -100 so they are automatically ignored in the loss function.
        for word_idx in word_ids: 
            if word_idx is None:
                # set –100 as the label for these special tokens
                label_ids.append(-100)
            # For the other tokens in a word, we set the label to either the current label or -100, depending on
            # the label_all_tokens flag.
            elif word_idx != previous_word_idx:
                # if current word_idx is != prev then its the most regular case
                # and add the corresponding token                 
                label_ids.append(label[word_idx]) 
            else: 
                # to take care of sub-words which have the same word_idx
                # set -100 as well for them, but only if label_all_tokens == False
                label_ids.append(label[word_idx] if label_all_tokens else -100) 
                # mask the subword representations after the first subword
                 
            previous_word_idx = word_idx 
        labels.append(label_ids) 
    tokenized_inputs["labels"] = labels 
    return tokenized_inputs 

In [None]:
q = tokenize_and_align_labels(conll2003['train'][4:5]) 
print(q) 

In [None]:
for token, label in zip(tokenizer.convert_ids_to_tokens(q["input_ids"][0]),q["labels"][0]): 
    print(f"{token:_<40} {label}") 

In [None]:
## Applying on entire data
tokenized_datasets = conll2003.map(tokenize_and_align_labels, batched=True)

In [None]:
tokenized_datasets['train'][0]

In [None]:
model = AutoModelForTokenClassification.from_pretrained('bert-base-uncased', num_labels = 9)

In [None]:
#Define training args
from transformers import TrainingArguments, Trainer 

args = TrainingArguments( 
    output_dir='./results',
    num_train_epochs=2,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    evaluation_strategy = "epoch",
    warmup_steps=500,
    learning_rate=5e-5,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
)

In [None]:
data_collator = DataCollatorForTokenClassification(tokenizer)

In [None]:
import evaluate
metric = evaluate.load("seqeval") 

### Lets test the metrix on an example

In [None]:
label_list = conll2003["train"].features["ner_tags"].feature.names 

label_list

In [None]:
conll2003['train'][0]["ner_tags"]

In [None]:
labels = [label_list[i] for i in conll2003['train'][0]["ner_tags"]] 
labels

In [None]:
metric.compute(predictions=[labels], references=[labels]) # checking on the training data for demo

## Compute Metrics

This compute_metrics() function first takes the argmax of the logits to convert them to predictions (as usual, the logits and the probabilities are in the same order, so we don’t need to apply the softmax). Then we have to convert both labels and predictions from integers to strings. We remove all the values where the label is -100, then pass the results to the metric.compute() method:

In [None]:
def compute_metrics(eval_preds): 
    pred_logits, labels = eval_preds 
    
    pred_logits = np.argmax(pred_logits, axis=2) 
    # the logits and the probabilities are in the same order,
    # so we don’t need to apply the softmax
    
    # We remove all the values where the label is -100
    predictions = [ 
        [label_list[eval_preds] for (eval_preds, l) in zip(prediction, label) if l != -100] 
        for prediction, label in zip(pred_logits, labels) 
    ] 
    
    true_labels = [ 
      [label_list[l] for (eval_preds, l) in zip(prediction, label) if l != -100] 
       for prediction, label in zip(pred_logits, labels) 
   ] 
    results = metric.compute(predictions=predictions, references=true_labels)

    return { 
          "precision": results["overall_precision"], 
          "recall": results["overall_recall"], 
          "f1": results["overall_f1"], 
          "accuracy": results["overall_accuracy"], 
  } 


## Training

In [None]:
trainer = Trainer( 
   model, 
   args, 
   train_dataset=tokenized_datasets["train"], 
   eval_dataset=tokenized_datasets["validation"], 
   data_collator=data_collator, 
   tokenizer=tokenizer, 
   compute_metrics=compute_metrics 
) 

In [None]:
trainer.train() 

## Save Artifacts

In [None]:
## Save model
model.save_pretrained("ner_model")

In [None]:
## Save tokenizer
tokenizer.save_pretrained("tokenizer")

In [None]:
id2label = {
    str(i): label for i,label in enumerate(label_list)
}
label2id = {
    label: str(i) for i,label in enumerate(label_list)
}

In [None]:
label2id

In [None]:
import json

In [None]:
config = json.load(open("ner_model/config.json"))
config["id2label"] = id2label
config["label2id"] = label2id
json.dump(config, open("ner_model/config.json","w"))

## Loading model & prediction

In [1]:
from transformers import AutoModelForTokenClassification
from transformers import BertTokenizerFast
from transformers import pipeline

In [4]:
model_fine_tuned = AutoModelForTokenClassification.from_pretrained("ner_model")
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
nlp = pipeline("ner", model=model_fine_tuned, tokenizer=tokenizer)

example = "My name is Hasnain and I live in Vietnam"
ner_results = nlp(example)
print(ner_results)
print(len(ner_results))

[{'entity': 'B-PER', 'score': 0.999496, 'index': 4, 'word': 'hasn', 'start': 11, 'end': 15}, {'entity': 'B-PER', 'score': 0.99936, 'index': 5, 'word': '##ain', 'start': 15, 'end': 18}, {'entity': 'B-LOC', 'score': 0.9995597, 'index': 10, 'word': 'vietnam', 'start': 33, 'end': 40}]
3


In [None]:
ner_results

In [5]:
if isinstance(ner_results, list) and len(ner_results) == 3:
    print("Validated")

Validated


## Upload Artifacts to S3

In [1]:
from object_store import CloudSync

In [3]:
sync = CloudSync()
sync.upload_ner_config()
#sync.download_ner_pytorch_model()