# <center> Token Classification </center>

In this notebook, we will finetuning NLP model (in this case IndoBERT from IndoNLU) for token classification task, using transformers library by 🤗. More specifically, we will make the model capable to perform NER task. We gonna use dataset from [IndoNLU](https://github.com/indobenchmark/indonlu) called NERGrit. The labels of token in the NERGrit dataset consist of PERSON (name
of person), PLACE (name of location), and ORGA-
NIZATION (name of organization), which follow the IOB2 chunking format.

## Install the sentencepiece, transformers, and datasets library.

These are the necessary library we need to install, since these are not available by default in Google Colab.

In [1]:
!pip install sentencepiece==0.1.95
!pip install transformers==4.2.2
!pip install datasets==1.2.0
!pip install seqeval
!pip install tqdm==4.48



## Import depedency library

In [2]:
from pathlib import Path
import numpy as np
import re
from datasets import Dataset, load_metric
import pickle
import torch
from torch.utils.data import DataLoader
from transformers import AutoModelForTokenClassification, BertTokenizerFast, TrainingArguments, Trainer
from sklearn.model_selection import train_test_split
import nltk
from tqdm.auto import tqdm
from seqeval.scheme import IOB2
from seqeval.metrics import classification_report as seqeval_classification_report
nltk.download('punkt')
# if error related to tqdm happens, just run the previous cell once again

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Data

### Download data

In [3]:
# download train dataset
!wget "https://raw.githubusercontent.com/indobenchmark/indonlu/master/dataset/nergrit_ner-grit/train_preprocess.txt"

# download test dataset, note that we actually use the validation dataset from the repository, since the real test dataset is masked
!wget "https://raw.githubusercontent.com/indobenchmark/indonlu/master/dataset/nergrit_ner-grit/valid_preprocess.txt"

# you can see the downloaded files by clicking the refresh icon on te left, the files is in the form of txt file.

--2021-09-05 08:35:18--  https://raw.githubusercontent.com/indobenchmark/indonlu/master/dataset/nergrit_ner-grit/train_preprocess.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 522268 (510K) [text/plain]
Saving to: ‘train_preprocess.txt’


2021-09-05 08:35:18 (12.7 MB/s) - ‘train_preprocess.txt’ saved [522268/522268]

--2021-09-05 08:35:18--  https://raw.githubusercontent.com/indobenchmark/indonlu/master/dataset/nergrit_ner-grit/valid_preprocess.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 64884 (63K) [text/plain]
Savi

### Read data

In [4]:
# function to read data
def read_wnut(file_path):
    file_path = Path(file_path)

    raw_text = file_path.read_text().strip()
    raw_docs = re.split(r'\n\t?\n', raw_text)
    token_docs = []
    tag_docs = []
    for doc in raw_docs:
        tokens = []
        tags = []
        for line in doc.split('\n'):
            token, tag = line.split('\t')
            tokens.append(token)
            tags.append(tag)
        token_docs.append(tokens)
        tag_docs.append(tags)

    return token_docs, tag_docs

In [5]:
# read data
train_texts, train_tags = read_wnut('train_preprocess.txt')
test_texts, test_tags = read_wnut('valid_preprocess.txt')

Here some snippet of the content of the txt file



In [7]:
train_index = 12
print(train_texts[train_index], train_tags[train_index], sep='\n')

['Telah', 'menjadi', 'universitas', 'sejak', '1966', ',', 'tetapi', 'merupakan', 'institusi', 'sejak', '1909', ',', 'ketika', 'Loughborough', 'Technical', 'Institute', 'didirikan', 'dengan', 'fokus', 'terhadap', 'kemampuan', 'dan', 'pengetahuan', 'yang', 'dapat', 'diaplikasikan', 'di', 'seluruh', 'dunia', ',', 'sebuah', 'tradisi', 'yang', 'masih', 'berlanjut', 'hingga', 'sekarang', ',', 'dengan', 'UNIEI', 'mendanai', '"', 'Survei', 'Tahunan', 'terhadap', 'Aktivitas', 'Perpindahan', 'Teknologi', 'Universitas', '"', 'yang', 'menjadikan', 'Loughborough', 'sebagai', 'operasi', 'perpindahan', 'teknologi', 'paling', 'efisien', 'di', 'Britania', '.']
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORGANISATION', 'I-ORGANISATION', 'I-ORGANISATION', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORGANISATION', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-PLACE', 'O', 'O', 'O', 'O', 'O',

In [8]:
test_index = 12
print(test_texts[test_index], test_tags[test_index], sep='\n')

['Selain', 'itu', 'juga', 'ada', 'penerbangan', 'reguler', 'di', 'Bandara', 'Bersujud', ',', 'pada', 'tahun', '2007', 'dilayani', 'oleh', 'PT', '.']
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-PLACE', 'I-PLACE', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']


As you can see, the sentence already split into words the labels is aligned with the word tokens. If you use custom data with different form, you need to process it into the form like above.

### Train val split

Next we will split loaded train data into train and validation data

In [9]:
train_texts, val_texts, train_tags, val_tags = train_test_split(train_texts, train_tags, test_size=0.2, random_state=42)

# this code below for test purpose, since we want to test the code with small number of data

# train_texts = train_texts[:100]
# val_texts = val_texts[:100]
# train_tags = train_tags[:100]
# val_tags = val_tags[:100]

In [10]:
print(f"Total train data: {len(train_texts)}, {len(train_tags)}")
print(f"Total validation data: {len(val_texts)}, {len(val_tags)}")
print(f"Total test data: {len(test_texts)}, {len(test_tags)}")

Total train data: 1337, 1337
Total validation data: 335, 335
Total test data: 209, 209


Create the mapper for label

In [11]:
unique_tags = set(tag for doc in train_tags + val_tags for tag in doc)
tag2id = {tag: id for id, tag in enumerate(unique_tags)}
id2tag = {id: tag for tag, id in tag2id.items()}

In [12]:
tag2id

{'B-ORGANISATION': 4,
 'B-PERSON': 5,
 'B-PLACE': 6,
 'I-ORGANISATION': 0,
 'I-PERSON': 2,
 'I-PLACE': 3,
 'O': 1}

In [13]:
id2tag 

{0: 'I-ORGANISATION',
 1: 'O',
 2: 'I-PERSON',
 3: 'I-PLACE',
 4: 'B-ORGANISATION',
 5: 'B-PERSON',
 6: 'B-PLACE'}

In [14]:
# create the folder to put idtag files
!mkdir idtag

Save the mapper into folder idtag

In [15]:
idtag_dir = "idtag"

# !!! don't forget to save the label map, so that we can use it to translate the model output, we will put these into the "idtag" folder
with open(f"{idtag_dir}/tag2id_pkl", 'wb') as f1:
    pickle.dump(tag2id, f1)
    
with open(f'{idtag_dir}/id2tag_pkl', 'wb') as f2:
    pickle.dump(id2tag, f2)  

### Preprocess data

#### Encode input text data

Load Tokenizer to preprocess data

In [16]:
tokenizer_checkpoint = "indobenchmark/indobert-lite-base-p1" 
tokenizer = BertTokenizerFast.from_pretrained(tokenizer_checkpoint, do_lower_case=True, max_length = 512)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=224974.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2.0, style=ProgressStyle(description_wi…




Encode text data

In [17]:
encoder_max_len = 300 # 512 is the maximum number of token input, you can modify this value, but i will take 300 for now
train_encodings = tokenizer(train_texts, is_split_into_words=True, return_offsets_mapping=True, max_length=encoder_max_len, padding='max_length', truncation=True)
val_encodings = tokenizer(val_texts, is_split_into_words=True, return_offsets_mapping=True, max_length=encoder_max_len, padding='max_length', truncation=True)

#### Encode label

In [18]:
# function to encode label data
def encode_tags(tags, encodings):
    labels = [[tag2id[tag] for tag in doc] for doc in tags]
    encoded_labels = []
    for doc_labels, doc_offset in zip(labels, encodings.offset_mapping):
        # create an empty array of -100
        doc_enc_labels = np.ones(len(doc_offset),dtype=int) * -100
        arr_offset = np.array(doc_offset)

        # set labels whose first offset position is 0 and the second is not 0
        doc_enc_labels[(arr_offset[:,0] == 0) & (arr_offset[:,1] != 0)] = doc_labels
        encoded_labels.append(doc_enc_labels.tolist())

    return encoded_labels

In [19]:
# encode labels
train_labels = encode_tags(train_tags, train_encodings)
val_labels = encode_tags(val_tags, val_encodings)

#### Wrap data using custom train and validation set with dataset class

In [20]:
class WNUTDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

In [21]:
train_encodings.pop("offset_mapping") # we don't want to pass this to the model
val_encodings.pop("offset_mapping")
train_dataset = WNUTDataset(train_encodings, train_labels)
val_dataset = WNUTDataset(val_encodings, val_labels)

## Training

we will use Trainer class from transformers library for training

In [22]:
train_batch = 4 # batch size per device during training
eval_batch = 8 # batch size for evaluation
logging_steps = int(np.floor(len(train_texts)/train_batch)) # train loss is logged every epoch

training_args = TrainingArguments(
    output_dir='./results',             # output directory
    evaluation_strategy = "epoch",
    num_train_epochs = 3,               # total number of training epochs
    learning_rate = 2e-5,
    per_device_train_batch_size = train_batch,  
    per_device_eval_batch_size = eval_batch,   
    weight_decay = 0.01,                # strength of weight decay
    logging_dir = './logs',             # directory for storing logs
    logging_steps= logging_steps,
    load_best_model_at_end = True,      # load the best model after the end of training
    metric_for_best_model = 'eval_f1',
    greater_is_better = True,
    save_total_limit = 1,               # save only one model
    dataloader_drop_last = True
)

# check for the docs for the explanation of the other parameter, be diligent lads!, remember this notebook is using transformers ver. 4.2.2

In [23]:
model_checkpoint =tokenizer_checkpoint
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=len(unique_tags))

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1542.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=46739879.0, style=ProgressStyle(descrip…




Some weights of AlbertForTokenClassification were not initialized from the model checkpoint at indobenchmark/indobert-lite-base-p1 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [24]:
# need to create custom metrics for token classification task, it used seqeval metric normally used in 

# using seqeval metric from datasets library
metric = load_metric("seqeval")

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens) where the code is -100
    true_predictions = [
        [id2tag[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [id2tag[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1753.0, style=ProgressStyle(description…




In [25]:
trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset,            # evaluation dataset
    compute_metrics=compute_metrics      # custom metrics
)

In [26]:
trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy,Runtime,Samples Per Second
1,0.3039,-1.896689,0.660317,0.789374,0.719101,0.934348,13.7975,24.28
2,0.1377,-1.928719,0.750226,0.786528,0.767948,0.949094,13.7182,24.42
3,0.0943,-1.928005,0.757307,0.811195,0.783326,0.950502,13.7144,24.427


Not all data has been set. Are you sure you passed all values?
Not all data has been set. Are you sure you passed all values?
Not all data has been set. Are you sure you passed all values?
Not all data has been set. Are you sure you passed all values?
Not all data has been set. Are you sure you passed all values?
Not all data has been set. Are you sure you passed all values?
Not all data has been set. Are you sure you passed all values?
Not all data has been set. Are you sure you passed all values?
Not all data has been set. Are you sure you passed all values?


TrainOutput(global_step=1002, training_loss=0.17866465288721872, metrics={'train_runtime': 486.2653, 'train_samples_per_second': 2.061, 'total_flos': 80068116600000, 'epoch': 3.0})

In [27]:
# we can see that only the best model is loaded in trainer object at the end
trainer.evaluate()

Not all data has been set. Are you sure you passed all values?
Not all data has been set. Are you sure you passed all values?
Not all data has been set. Are you sure you passed all values?


{'epoch': 3.0,
 'eval_accuracy': 0.9505024889640274,
 'eval_f1': 0.7833256985799358,
 'eval_loss': -1.9280047416687012,
 'eval_precision': 0.7573073516386183,
 'eval_recall': 0.8111954459203036,
 'eval_runtime': 13.7177,
 'eval_samples_per_second': 24.421}

## Using the model

In [28]:
# load the best model, make sure the best model file location is the right one, check the folder results 
best_model_checkpoint = "results/checkpoint-1002"
best_model = AutoModelForTokenClassification.from_pretrained(best_model_checkpoint)

In [29]:
# set get device function
def get_default_device():
    """Pick GPU if available, else CPU"""
    if torch.cuda.is_available():
        return torch.device('cuda')
    else:
        return torch.device('cpu')

In [30]:
# get device
device = get_default_device()

In [31]:
# put the model into device (GPU in case we use GPU)
best_model = best_model.to(device)

In [32]:
# define predict function, this function takes a sentence and return word tokens of the sentence with its label
def best_model_predict(text):

  word_token = nltk.word_tokenize(text)
  tokenized_input_text = tokenizer(word_token, is_split_into_words=True, return_offsets_mapping=True, max_length=encoder_max_len, padding='max_length', truncation=True, 
                                   return_tensors="pt")
  input_ids, attention_mask, offset_mapping = tokenized_input_text["input_ids"], tokenized_input_text["attention_mask"], tokenized_input_text["offset_mapping"]
  input_ids = input_ids.to(device)
  attention_mask = attention_mask.to(device)
  output = best_model(input_ids = input_ids, attention_mask = attention_mask)
  logits = output.logits
  logits_labels = torch.argmax(logits, dim=-1).tolist()
  tag = np.array([id2tag[i] for i in logits_labels[0]])
  # since by the tokenizer some word are splitted, so we need to map the output label using the offset_mapping into the right token
  arr_offset = offset_mapping.numpy()[0]
  mask = (arr_offset[:,0] == 0) & (arr_offset[:,1] != 0)

  return word_token, tag[mask].tolist()

In [33]:
# Test the function
text = "Ketua KPK, Firli Bahuri, meminta dugaan pemberian uang Rp 650 juta ke orang mengaku sebagai pegawai KPK dari mantan Bupati Kuantan Singingi (Kuansing) Mursini dibuktikan. "

tokenized_text, predicted_label = best_model_predict(text)

In [34]:
# print the resul
for word, label in zip(tokenized_text, predicted_label):
  print(word, " --> ", label)

Ketua  -->  O
KPK  -->  B-ORGANISATION
,  -->  O
Firli  -->  B-PERSON
Bahuri  -->  I-PERSON
,  -->  O
meminta  -->  O
dugaan  -->  O
pemberian  -->  O
uang  -->  O
Rp  -->  O
650  -->  O
juta  -->  O
ke  -->  O
orang  -->  O
mengaku  -->  O
sebagai  -->  O
pegawai  -->  O
KPK  -->  B-ORGANISATION
dari  -->  O
mantan  -->  O
Bupati  -->  O
Kuantan  -->  B-PLACE
Singingi  -->  I-PLACE
(  -->  O
Kuansing  -->  B-PERSON
)  -->  O
Mursini  -->  O
dibuktikan  -->  O
.  -->  O


The model performance is quite good.

## Prediction and evaluation on test data!

Now, lets do some prediction on test data and evaluate the result, but without using trainer class. We will using Dataset class from transformers lib and DataLoader from pytorch lib since we want to utilize GPU optimaly. 

In [35]:
batch_size = 4
test_encoding = tokenizer(test_texts, is_split_into_words=True, return_offsets_mapping=True, max_length=encoder_max_len, padding='max_length', truncation=True)
offset_mapping = test_encoding.pop('offset_mapping')
test_dataset = Dataset.from_dict({'input_ids': test_encoding['input_ids'], 'attention_mask': test_encoding['attention_mask']})
# we need to format the data into pytorch Tensor, since we use pytorch model for prediction
test_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'], output_all_columns=True)
# put the dataset into data loader
test_dl = DataLoader(test_dataset, batch_size, shuffle=False)

In [36]:
# define predict function, this function takes a sentence and return word tokens of the sentence with its label
def best_model_predict_test_data(data_loader, offset_mapping):

  pbar = tqdm(data_loader, leave=True, total=len(data_loader))

  with torch.no_grad():
    list_pred_label = []
    for idx, data in enumerate(pbar):
      input_ids, attention_mask = data["input_ids"], data["attention_mask"]
      input_ids = input_ids.to(device)
      attention_mask = attention_mask.to(device)
      output = best_model(input_ids = input_ids, attention_mask = attention_mask)
      logits = output.logits
      logits_labels = torch.argmax(logits, dim=-1).tolist()

      for label in logits_labels:
        tag = np.array([id2tag[i] for i in label])
        list_pred_label.append(tag)

  final_list_pred_label = []
  for pred_label, map in zip(list_pred_label, offset_mapping):
      arr_map = np.array(map)
      mask = (arr_map[:,0] == 0) & (arr_map[:,1] != 0)
      final_list_pred_label.append(pred_label[mask].tolist())

  return final_list_pred_label

In [37]:
pred_test_labels = best_model_predict_test_data(test_dl, offset_mapping)

HBox(children=(FloatProgress(value=0.0, max=53.0), HTML(value='')))




Lets do some evaluation

In [38]:
print(seqeval_classification_report(test_tags, pred_test_labels, mode='strict', scheme=IOB2))

              precision    recall  f1-score   support

ORGANISATION       0.70      0.74      0.71       121
      PERSON       0.81      0.85      0.83       213
       PLACE       0.83      0.85      0.84       328

   micro avg       0.80      0.83      0.81       662
   macro avg       0.78      0.81      0.79       662
weighted avg       0.80      0.83      0.81       662



The evaluation result is very nice

## Download model

If you want to download the model run these code below

In [None]:
!zip -r ./results.zip ./results

In [None]:
!zip -r ./idtag.zip ./idtag

In [None]:
from google.colab import files

In [None]:
files.download("./results.zip")
files.download("./idtag.zip ")

***Author: Hadi Muhshi***