# Workshop: Supervised Fine-tuning Encoder-based Model
This notebook contains a tutorial on how to supervised finetuning text classification task and named entity recognition (ner) task.

In [1]:
# check for GPU
!nvidia-smi

Sat Nov 11 06:13:53 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P8     9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Install Dependencies
Default Google Colab doesn't provides these following packages:

In [2]:
# install dependencies
!pip install -q datasets evaluate seqeval
!pip install -q accelerate -U # for multigpu
!pip install -q bitsandbytes # quantization
!pip install -q transformers[sentencepiece] # transformers+sp
!pip install -q pythainlp emoji

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/493.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m153.6/493.7 kB[0m [31m4.3 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m491.5/493.7 kB[0m [31m7.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m493.7/493.7 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/1

# Text Classification: Supervised Finetuning

Supervised finetuning WangchanBERTa model for text classification

## Import libraries

In [3]:
# import libs
import evaluate
import pandas as pd
from datasets import load_dataset, load_metric, Features, Value
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding,
    AutoModel,
    AutoModelForSequenceClassification
)

## Define preprocessing function
For this workshop, we'll use `airesearch/wangchanberta-base-att-spm-uncased` model. The model **requires additioanl preprocessing function** of the text because we want the finetuning data distribution to match the pretraining text as much as possible.

Let's download `preprocess.py` taken from [thai2tranformer github](https://github.com/vistec-AI/thai2transformers/blob/master/thai2transformers/preprocess.py) here, and import the preprocess function.

In [4]:
!wget https://raw.githubusercontent.com/vistec-AI/thai2transformers/master/thai2transformers/preprocess.py

--2023-11-11 06:15:39--  https://raw.githubusercontent.com/vistec-AI/thai2transformers/master/thai2transformers/preprocess.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7357 (7.2K) [text/plain]
Saving to: ‘preprocess.py’


2023-11-11 06:15:39 (70.1 MB/s) - ‘preprocess.py’ saved [7357/7357]



In [5]:
from preprocess import process_transformers

## Define evaluation function
Huggingface's Trainer class also allow you to parse evaluation function to evaluate the validation set during training, let's define our evaluating function here.

In [6]:
def classification_metrics(pred, pred_labs=False):
    labels = pred.label_ids
    preds = pred.predictions if pred_labs else pred.predictions.argmax(-1)

    precision_macro, recall_macro, f1_macro, _ = precision_recall_fscore_support(labels, preds, average="macro")
    precision_micro, recall_micro, f1_micro, _ = precision_recall_fscore_support(labels, preds, average="micro")
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1_micro': f1_micro,
        'precision_micro': precision_micro,
        'recall_micro': recall_micro,
        'f1_macro': f1_macro,
        'precision_macro': precision_macro,
        'recall_macro': recall_macro,
        'nb_samples': len(labels)
    }

# Text Classification: Wisesight Sentiment
Now that we've prepared all useful functions, let's supervised finetune WangchanBERTa on Wisesight sentiment dataset!

In [7]:
#parameters
class Args:
    model_name = 'airesearch/wangchanberta-base-att-spm-uncased' # WangchanBERTa
    # model_name = 'xlm-roberta-base'
    dataset_name_or_path = 'wisesight_sentiment'
    feature_col = 'texts'
    label_col = 'category'
    output_dir = 'models/wisesight_sentiment'
    batch_size = 16
    warmup_percent = 0.1
    learning_rate = 3e-05
    num_train_epochs = 5
    weight_decay = 0.01
    metric_for_best_model = 'f1_micro'
    seed = 1412
    max_length=510 # model max length is 510

args = Args()

## Download Wisesight dataset
First, download the `wisesight_sentiment` dataset using hugging face `load_dataset`

In [8]:
dataset = load_dataset(args.dataset_name_or_path)
dataset = dataset.map(lambda examples: {'labels': examples[args.label_col]}, batched=True).remove_columns("category")
num_labels = len(set(dataset['train']['labels']))
dataset

Downloading builder script:   0%|          | 0.00/5.08k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/3.48k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/11.9k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.10M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/21628 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2404 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2671 [00:00<?, ? examples/s]

Map:   0%|          | 0/21628 [00:00<?, ? examples/s]

Map:   0%|          | 0/2404 [00:00<?, ? examples/s]

Map:   0%|          | 0/2671 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['texts', 'labels'],
        num_rows: 21628
    })
    validation: Dataset({
        features: ['texts', 'labels'],
        num_rows: 2404
    })
    test: Dataset({
        features: ['texts', 'labels'],
        num_rows: 2671
    })
})

## Preprocessing Text
Preprocess text data using the imported `preprocess_transformers` function.

In [9]:
def preprocess_example(examples):
    examples[args.feature_col] = process_transformers(examples[args.feature_col])
    return examples

processed_dataset = dataset.map(preprocess_example)

Map:   0%|          | 0/21628 [00:00<?, ? examples/s]

Map:   0%|          | 0/2404 [00:00<?, ? examples/s]

Map:   0%|          | 0/2671 [00:00<?, ? examples/s]

In [10]:
processed_dataset["train"][0]

{'texts': 'ไปจองมาแล้วนาจา<_>mitsubishi<_>attrage<_>ได้หลังสงกรานต์เลย<_>รอขับอยู่นาจา<_>กระทัดรัด<_>เหมาะกับสาวๆขับรถคนเดียวแบบเรา<_>ราคาสบายกระเป๋า<_>ประหยัดน้ำมัน<_>วิ่งไกลแค่ไหนหายห่วงค่ะ',
 'labels': 1}

Notice that the whitespace now becomes `<_>` special token, we add this because whitespace contains some meaning in Thai, and we don't want those token to be ignored. For more info, read the model card [here](https://huggingface.co/airesearch/wangchanberta-base-att-spm-uncased).

After we preprocess the text, let's tokenize the processed dataset

In [11]:
#create tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    args.model_name,
    model_max_length=args.max_length,
    truncation=True,
    use_fast=False,
    padding='max_length'
)

# define encode dataset function
def encode_function(examples):
    return tokenizer(examples[args.feature_col], max_length=args.max_length, truncation=True, padding='max_length')

encoded_dataset = processed_dataset.map(encode_function, batched=True)
encoded_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])

Downloading (…)okenizer_config.json:   0%|          | 0.00/282 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/546 [00:00<?, ?B/s]

Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/905k [00:00<?, ?B/s]

Map:   0%|          | 0/21628 [00:00<?, ? examples/s]

Map:   0%|          | 0/2404 [00:00<?, ? examples/s]

Map:   0%|          | 0/2671 [00:00<?, ? examples/s]

In [12]:
len(tokenizer.convert_ids_to_tokens(encoded_dataset["train"][0]["input_ids"]))

510

## Instantiate Sequence Classification Model
Once we finished preparing the dataset, let's instantiate the text classification model. Luckily, Huggingface provide us an-of-the-shelves model for sequence classification. To do so, run the following comand:

In [13]:
model = AutoModelForSequenceClassification.from_pretrained(args.model_name, num_labels=num_labels)

Downloading model.safetensors:   0%|          | 0.00/423M [00:00<?, ?B/s]

Some weights of CamembertForSequenceClassification were not initialized from the model checkpoint at airesearch/wangchanberta-base-att-spm-uncased and are newly initialized: ['classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.dense.weight', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


The code will automatically attach classifier head with `num_labels` as an output node to the model.

## Train the Model

For this workshop, we'll use Huggingface's `Trainer` class, which is a very easy to use trainer wrapper to train the model.

In [14]:
# define training argument

train_args = TrainingArguments(
    output_dir=args.output_dir,
    learning_rate=args.learning_rate,
    per_device_train_batch_size=args.batch_size,
    per_device_eval_batch_size=args.batch_size,
    num_train_epochs=args.num_train_epochs,
    warmup_steps=int(len(encoded_dataset['train']) * args.num_train_epochs // args.batch_size * args.warmup_percent),
    weight_decay=args.weight_decay,
    save_total_limit=1,
    metric_for_best_model=args.metric_for_best_model,
    seed=args.seed,
    # num_train_epochs=5,  # too long for the workshop
    evaluation_strategy="steps", # or "epochs"
    eval_steps=25,
    max_steps=50,  # we'll train for only 50 steps for fast demo
    logging_strategy="steps",  # or "epochs"
    logging_steps=5
)

In [15]:
# instantiate trainer object

trainer = Trainer(
    model,
    train_args,
    train_dataset=encoded_dataset['train'],
    eval_dataset=encoded_dataset['validation'],
    tokenizer=tokenizer,
    compute_metrics=classification_metrics # compute_metrics fn we defined earlier
)

In [None]:
# train model
trainer.train()

Step,Training Loss,Validation Loss,Accuracy,F1 Micro,Precision Micro,Recall Micro,F1 Macro,Precision Macro,Recall Macro,Nb Samples
25,1.3079,1.335415,0.405574,0.405574,0.405574,0.405574,0.252694,0.279725,0.277939,2404
50,1.3001,1.259765,0.484193,0.484193,0.484193,0.484193,0.240591,0.293734,0.268194,2404


TrainOutput(global_step=50, training_loss=1.340449676513672, metrics={'train_runtime': 213.5961, 'train_samples_per_second': 3.745, 'train_steps_per_second': 0.234, 'total_flos': 209670387264000.0, 'train_loss': 1.340449676513672, 'epoch': 0.04})

## Evaluate the model

Once we finished training, let's evaluate our model on test set. We can obtain the prediction easily using `trainer.predict` function. what's cool about this is that fi you parse `compute_metrics` function as an argument, it also evaluate your test set as well!

In [None]:
# test model
preds  = trainer.predict(encoded_dataset['test'])
pd.DataFrame.from_dict(preds[2],orient='index').transpose()

Unnamed: 0,test_loss,test_accuracy,test_f1_micro,test_precision_micro,test_recall_micro,test_f1_macro,test_precision_macro,test_recall_macro,test_nb_samples,test_runtime,test_samples_per_second,test_steps_per_second
0,1.258708,0.487458,0.487458,0.487458,0.487458,0.239111,0.283543,0.266367,2671.0,80.1855,33.31,2.083


In [None]:
preds[2]

{'test_loss': 1.2587077617645264,
 'test_accuracy': 0.48745788094346687,
 'test_f1_micro': 0.48745788094346687,
 'test_precision_micro': 0.48745788094346687,
 'test_recall_micro': 0.48745788094346687,
 'test_f1_macro': 0.2391108979767614,
 'test_precision_macro': 0.28354318493070035,
 'test_recall_macro': 0.26636658218824727,
 'test_nb_samples': 2671,
 'test_runtime': 80.1855,
 'test_samples_per_second': 33.31,
 'test_steps_per_second': 2.083}

# Text Classification: Wongnai Reviews
Let's repeat the previous steps on Wongnai Reviews

In [None]:
#parameters
class Args:
    model_name = 'airesearch/wangchanberta-base-att-spm-uncased'
    # model_name = 'xlm-roberta-base'
    dataset_name_or_path = 'wongnai_reviews'
    feature_col = 'review_body'
    label_col = 'star_rating'
    output_dir = 'models/wongnai_reviews'
    batch_size = 16
    warmup_percent = 0.1
    learning_rate = 3e-05
    num_train_epochs = 5
    weight_decay = 0.01
    metric_for_best_model = 'f1_micro'
    seed = 1412
    max_length = 510

args = Args()

In [None]:
# load dataset
dataset = load_dataset(args.dataset_name_or_path)

In [None]:
dataset["train"][0]

{'review_body': 'ร้านอาหารใหญ่มากกกกกกก \nเลี้ยวเข้ามาเจอห้องน้ำก่อนเลย เออแปลกดี \nห้องทานหลักๆอยู่ชั้น 2 มีกาแฟ น้ำผึ้ง ซึ่งก็แค่เอาน้ำผึ้งมาราด แพงเวอร์ อย่าสั่งเลย \nลาบไข่ต้ม ไข่มันคาวอะ เลยไม่ประทับใจเท่าไหร่\nทอดมันหัวปลีกรอบอร่อยต้องเบิ้ล \nพะแนงห่อไข่อร่อยดี เห้ยแต่ราคา 150บาทมันเกินไปนะ รับไม่ไหวว\nเลิกกินแล้วมีขนมหวานให้กินฟรีเล็กน้อย )ขนมไทย) \n\nคงไม่ไปซ้ำ แพงเกิน ',
 'star_rating': 2}

In [None]:
# convert `start_rating` col to `labels`
dataset = dataset.map(lambda examples: {'labels': examples[args.label_col]}, batched=True)
num_labels = len(set(dataset['train']['labels'])) # get num labels

Since Wongnai dataset doesn't provides split like Wisesight, we can split train/val/test using this function:

In [None]:
train_val_split = dataset["train"].train_test_split(test_size=0.1, shuffle=True, seed=2020)
dataset["train"] = train_val_split["train"]
dataset["validation"] = train_val_split["test"]
dataset

DatasetDict({
    train: Dataset({
        features: ['review_body', 'star_rating', 'labels'],
        num_rows: 36000
    })
    test: Dataset({
        features: ['review_body', 'star_rating', 'labels'],
        num_rows: 6203
    })
    validation: Dataset({
        features: ['review_body', 'star_rating', 'labels'],
        num_rows: 4000
    })
})

In [None]:
# create tokenizer
tokenizer = AutoTokenizer.from_pretrained(args.model_name, model_max_length=args.max_length, truncation=True,use_fast=False, padding='max_length')

# encode dataset
def encode_function(examples):
    return tokenizer(examples[args.feature_col], max_length=args.max_length, truncation=True, padding='max_length')

# preprocess text
def preprocess_example(examples):
    examples[args.feature_col] = process_transformers(examples[args.feature_col])
    return examples

processed_dataset = dataset.map(preprocess_example)
encoded_dataset = processed_dataset.map(encode_function, batched=True)
encoded_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])

Map:   0%|          | 0/36000 [00:00<?, ? examples/s]

Map:   0%|          | 0/6203 [00:00<?, ? examples/s]

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

In [None]:
len(encoded_dataset["train"][0]["input_ids"])

510

In [None]:
# inspect processed dataset
encoded_dataset["train"][0]

{'labels': tensor(3),
 'input_ids': tensor([    5,    10,  1440,  1062,   739, 12529,    33,   193,    15,     8,
          1609,   702,   269,   695,   676,   116,  1440,   193,  5979,    60,
           315,     8,    10, 18327,     8,    10,  5268,   878,    22,     8,
            10,  1742,  9356,   770,     8,    10,  6402,    55,  1730, 20211,
          8976,   476,  7316,    17,     8,    10, 18187,     8,    10,  9347,
             8,    10,  1306,     8,    10,   374,   241,  1687,   617,     6,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,  

In [None]:
# inspect input_ids
tokenizer.convert_ids_to_tokens(encoded_dataset["train"][0]["input_ids"])

['<s>',
 '▁',
 'รส',
 'ขา',
 'ติ',
 'ไปกัน',
 'กับ',
 'ราคา',
 'ได้',
 '<_>',
 '▁นอกจากนี้',
 'ยังมี',
 'ระบบ',
 'บัตร',
 'สมาชิก',
 'ทําให้',
 'รส',
 'ราคา',
 'จากเดิม',
 'ซึ่ง',
 'รวม',
 '<_>',
 '▁',
 'vat',
 '<_>',
 '▁',
 'ไว้ให้',
 'เรียบร้อย',
 'แล้ว',
 '<_>',
 '▁',
 'เมนู',
 'มีให้เลือก',
 'มากมาย',
 '<_>',
 '▁',
 'การบริการ',
 'อยู่',
 'ในระดับ',
 'ที่น่าพอใจ',
 'ถึงแม้จะ',
 'ประกาศ',
 'ตัวเองว่า',
 'เป็น',
 '<_>',
 '▁',
 'self',
 '<_>',
 '▁',
 'service',
 '<_>',
 '▁',
 'ก็ตาม',
 '<_>',
 '▁',
 'กลับมา',
 'กิน',
 'ซ้ํา',
 'แน่นอน',
 '</s>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>

In [16]:
# create model
model = AutoModelForSequenceClassification.from_pretrained(args.model_name, num_labels=num_labels)

Some weights of CamembertForSequenceClassification were not initialized from the model checkpoint at airesearch/wangchanberta-base-att-spm-uncased and are newly initialized: ['classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.dense.weight', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Train the model

In [17]:
# define training argument

train_args = TrainingArguments(
    output_dir=args.output_dir,
    learning_rate=args.learning_rate,
    per_device_train_batch_size=args.batch_size,
    per_device_eval_batch_size=args.batch_size,
    num_train_epochs=args.num_train_epochs,
    warmup_steps=int(len(encoded_dataset['train']) * args.num_train_epochs // args.batch_size * args.warmup_percent),
    weight_decay=args.weight_decay,
    save_total_limit=1,
    metric_for_best_model=args.metric_for_best_model,
    seed=args.seed,
    # num_train_epochs=5,  # too long for the workshop
    evaluation_strategy="steps", # or "epochs"
    eval_steps=25,
    max_steps=50,  # we'll train for only 50 steps for fast demo
    logging_strategy="steps",  # or "epochs"
    logging_steps=5
)

In [18]:
# instantiate trainer object

trainer = Trainer(
    model,
    train_args,
    train_dataset=encoded_dataset['train'],
    eval_dataset=encoded_dataset['validation'],
    tokenizer=tokenizer,
    compute_metrics=classification_metrics # compute_metrics fn we defined earlier
)

In [19]:
# train model
trainer.train()

Step,Training Loss,Validation Loss,Accuracy,F1 Micro,Precision Micro,Recall Micro,F1 Macro,Precision Macro,Recall Macro,Nb Samples
25,1.5125,1.466823,0.16015,0.16015,0.16015,0.16015,0.14459,0.245564,0.245335,2404
50,1.3936,1.343478,0.332779,0.332779,0.332779,0.332779,0.245722,0.264421,0.267268,2404


TrainOutput(global_step=50, training_loss=1.4839708137512206, metrics={'train_runtime': 214.5877, 'train_samples_per_second': 3.728, 'train_steps_per_second': 0.233, 'total_flos': 209670387264000.0, 'train_loss': 1.4839708137512206, 'epoch': 0.04})

In [23]:
# get predictions
preds  = trainer.predict(encoded_dataset['test'])
pd.DataFrame.from_dict(preds[2],orient='index').transpose()

Unnamed: 0,test_loss,test_accuracy,test_f1_micro,test_precision_micro,test_recall_micro,test_f1_macro,test_precision_macro,test_recall_macro,test_nb_samples,test_runtime,test_samples_per_second,test_steps_per_second
0,1.341843,0.348559,0.348559,0.348559,0.348559,0.249591,0.264399,0.254754,2671.0,76.3067,35.003,2.189


In [24]:
preds[2]

{'test_loss': 1.341842532157898,
 'test_accuracy': 0.3485585922875328,
 'test_f1_micro': 0.3485585922875328,
 'test_precision_micro': 0.3485585922875328,
 'test_recall_micro': 0.3485585922875328,
 'test_f1_macro': 0.24959120123227627,
 'test_precision_macro': 0.2643990914935785,
 'test_recall_macro': 0.25475433528193925,
 'test_nb_samples': 2671,
 'test_runtime': 76.3067,
 'test_samples_per_second': 35.003,
 'test_steps_per_second': 2.189}

# Named-Entity Recognition (NER): Thai NER Corpus
Supervised Finetuning WangchanBERTa on NER task

In [None]:
# import libraries
import os
import random
from dataclasses import dataclass, field
from functools import partial
from typing import Tuple, List, Dict, Any

import evaluate
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sn
import torch
from datasets import load_dataset
from sklearn.metrics import f1_score
from torch.utils.data import Dataset, DataLoader
from tqdm.auto import tqdm
from transformers import TrainingArguments, Trainer
from transformers import AutoModelForTokenClassification, AutoModelForCausalLM, AutoTokenizer, AutoConfig, DataCollatorForTokenClassification

In [None]:
# define seed function
def seed_everything(seed):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = True

In [None]:
seed_everything(42) # apply seed
device = 'cuda' if torch.cuda.is_available() else 'cpu'

## Download `thainer` dataset
To download the dataset, we'll use an off-the-shelve `load_dataset` function to download the publicly available dataset on Huggingface. For more info on Thai NER dataset, read more [here](https://huggingface.co/datasets/pythainlp/thainer-corpus-v2).

In [None]:
# Prepare NER dataset
dataset = load_dataset("pythainlp/thainer-corpus-v2")
dataset

Downloading readme:   0%|          | 0.00/3.16k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/571k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/199k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/205k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/3938 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1313 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1313 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['words', 'ner'],
        num_rows: 3938
    })
    validation: Dataset({
        features: ['words', 'ner'],
        num_rows: 1313
    })
    test: Dataset({
        features: ['words', 'ner'],
        num_rows: 1313
    })
})

After we downloaded the NER dataset, let's inspect what the data sample looks like:

In [None]:
dataset["train"][0]

{'words': ['ทักษิณ',
  ' ',
  'ชินวัตร',
  ' ',
  'ทวีต',
  'ไม่',
  'แปลกใจ',
  'ศาลปกครอง',
  'สูงสุด',
  ' ',
  'ยก',
  'ฟ้องคดี',
  'ถอน',
  'พาสปอร์ต',
  ' ',
  'ชี้',
  ' ',
  'กระบวนการยุติธรรม',
  'ไทย',
  'ถูก',
  'ใช้',
  'เป็น',
  'เครื่องมือ',
  'การเมือง'],
 'ner': [0,
  1,
  1,
  2,
  2,
  2,
  2,
  3,
  2,
  2,
  2,
  2,
  2,
  2,
  2,
  2,
  2,
  2,
  4,
  2,
  2,
  2,
  2,
  2]}

## NER as Token Classification

According to the data shown above, the data contains two keys:
- `words`: a tokenized word
- `ner`: a label to each token

What does these numbers represent? How does NER becomes token classification problem? Here's why.

Generally speaking, NER task goal is to let the model predicts where are the text span of the corresponded label.

To represent span, we'll need three values: (`start_idx`, `end_idx`, `label`). These three values allow you to specify where the span was started/end, and identify which label this span is.

For example, given this tokenized text (taken from above sample)
```python
[ 'ทักษิณ', # index=0
  ' ', # index=1
  'ชินวัตร', # index=2
  ' ', # index=3
  'ทวีต', # index=4
  'ไม่', # index=5
  'แปลกใจ', # index=6
  'ศาลปกครอง', # index=7
  'สูงสุด', # index=8
  ' ', # index=9
  'ยก', # index=10
  'ฟ้องคดี', # index=11
  'ถอน', # index=12
  'พาสปอร์ต', # index=13
  ' ', # index=14
  'ชี้', # index=15
  ' ', # index=16
  'กระบวนการยุติธรรม', # index=17
  'ไทย', # index=18
  'ถูก', # index=19
  'ใช้', # index=20
  'เป็น', # index=21
  'เครื่องมือ', # index=22
  'การเมือง'] # index=23
```

If we want to label `['ทักษิณ', ' ', 'ชินวัตร']` as a label `PERSON`, the span will be represented as `(0, 3, "PERSON")`.

Now you might started to get the picture of how NER is related to token classification. Given this tokenized text and it's corresponded span tuple `(start_idx, end_idx, LABEL)`, we can assign label to each corresponded token as follows:

```python
[ 'ทักษิณ', # label=PERSON
  ' ', # label=PERSON
  'ชินวัตร', # label=PERSON
  ' ', # label=None
  'ทวีต', # label=None
  ...
```

## Preprocessing NER dataset

Now that we understand how NER works in a context of token classification, let's prepare class list and it's mapping from encoded label to label string.

In [None]:
# Prepare NER classes
classes_list: List[str] = dataset["train"].features["ner"].feature._int2str # get label list

# get mapper from (encoded label)<>(label string)
# label_string -> encoded_label
class_to_idx: Dict[str, int] = {class_name: idx for idx, class_name in enumerate(classes_list)}
# encoded_label -> label_string
idx_to_class: Dict[int, str] = {idx: class_name for class_name, idx in class_to_idx.items()}

# print number of classes
print(f"Number of classes: {len(classes_list)}")
class_to_idx # inspect the label string

Number of classes: 36


{'B-PERSON': 0,
 'I-PERSON': 1,
 'O': 2,
 'B-ORGANIZATION': 3,
 'B-LOCATION': 4,
 'I-ORGANIZATION': 5,
 'I-LOCATION': 6,
 'B-DATE': 7,
 'I-DATE': 8,
 'B-TIME': 9,
 'I-TIME': 10,
 'B-MONEY': 11,
 'I-MONEY': 12,
 'B-FACILITY': 13,
 'I-FACILITY': 14,
 'B-URL': 15,
 'I-URL': 16,
 'B-PERCENT': 17,
 'I-PERCENT': 18,
 'B-LEN': 19,
 'I-LEN': 20,
 'B-AGO': 21,
 'I-AGO': 22,
 'B-LAW': 23,
 'I-LAW': 24,
 'B-PHONE': 25,
 'I-PHONE': 26,
 'B-EMAIL': 27,
 'I-EMAIL': 28,
 'B-ZIP': 29,
 'B-TEMPERATURE': 30,
 'I-TEMPERATURE': 31,
 'B-DTAE': 32,
 'I-DTAE': 33,
 'B-DATA': 34,
 'I-DATA': 35}

What we see here seems a bit different from what we explained above. Why so?

Obviously, we can see that the `None` label was mapped to `O`, but other labels seem a bit odd.

In the previous example, we annotate our token label as follows:
```python
[ 'ทักษิณ', # label=PERSON
  ' ', # label=PERSON
  'ชินวัตร', # label=PERSON
  ' ', # label=None
  'ทวีต', # label=None
  ...
```

However, from what we see here, we can't see any `PERSON` class, but only found `B-PERSON` and `I-PERSON` class.

What is the prefix `B-` and `I-`?

## NER Annotation Scheme

There're several annotation scheme to NER annotation. The example that doesn't contains `B-`, and `I-` above is called `IO Scheme` where the `None` label will be treated as `O` class, while other label will be treated as `I-CLASS_NAME`. The IO scheme annotation format would annotate the previous example as follows:
```python
[ 'ทักษิณ', # label=I-PERSON
  ' ', # label=I-PERSON
  'ชินวัตร', # label=I-PERSON
  ' ', # label=O
  'ทวีต', # label=O
  ...
```

However, there's some problem to this scheme. What if I change my text as follows:
```python
[ 'ทักษิณ', # label=I-PERSON
  ' ', # label=I-PERSON
  'ชินวัตร', # label=I-PERSON
  'ยิ่งลักษณ์', # label=I-PERSON
  ' ', # label=I-PERSON
  'ชินวัตร', # label=I-PERSON
  'ทวีต', # label=O
  ...
```

We can clearly see that I have add another name next to `ทักษิน ชินวัตร`. The IO Scheme wouldn't allow you to annotated `ทักษิน ชินวัตร` and `ยิ่งลักษณ์ ชินวัตร` seperately, and this would cause confusion if your doenstream application want to separate these two entities.

To fix this, there's an alternative to IO scheme called BIO Scheme where we also marked beginning token `B-` instead of just `I-` token to prevent consecutive span with same label merging. With BIO scheme, the annotation will now be:

```python
[ 'ทักษิณ', # label=B-PERSON
  ' ', # label=I-PERSON
  'ชินวัตร', # label=I-PERSON
  'ยิ่งลักษณ์', # label=B-PERSON
  ' ', # label=I-PERSON
  'ชินวัตร', # label=I-PERSON
  'ทวีต', # label=O
  ...
```

For `thainer` corpus, we uses the BIO scheme annotation.

Read more about more NER annotation scheme [here](https://medium.com/@rongqianhui/named-entity-recognition-annotation-schemes-e684f9cd5a56)

## Evaluation
For NER task, we usually use `seqeval` to evaluate a token prediction task. For this step, we'll create a custom class that will help us evaluate the model predictions.

In [None]:
class Benchmark(object):

    def __init__(self) -> None:
        # define seqeval object loaded from HF's `evaluate` lib
        self.seqeval = evaluate.load("seqeval")

    def eval(
        self,
        predictions: List[List[int]],
        labels: List[List[int]],
    ) -> dict:
        """Run evaluation on predictions and labels, return
        a dictionary consists of Precision, Recall, F1, and accuracy
        of the span predicted
        """
        # convert encoded predictions to string
        true_predictions = [
            [classes_list[p] for (p, l) in zip(prediction, label) if l != -100]
            for prediction, label in zip(predictions, labels)
        ]
        # convert encoded labels to string
        true_labels = [
            [classes_list[l] for (p, l) in zip(prediction, label) if l != -100]
            for prediction, label in zip(predictions, labels)
        ]

        # use seqeval to compute metrics
        results = self.seqeval.compute(predictions=true_predictions, references=true_labels)

        # return metrics dict
        return {
            "precision": results["overall_precision"],
            "recall": results["overall_recall"],
            "f1": results["overall_f1"],
            "accuracy": results["overall_accuracy"],
        }

In [None]:
# instantiate benchmark object
# we'll use this laber
benchmark = Benchmark()

Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

## Initialize model
We'll use HuggingFace's `AutoModelForTokenClassification` to load the model with token classification head.

In [None]:
model_name_or_path = "airesearch/wangchanberta-base-att-spm-uncased" # checkpoint name

# Prepare model and tokenizer
model = AutoModelForTokenClassification.from_pretrained(
    model_name_or_path,
    num_labels=len(class_to_idx),  # specify num class
    id2label=idx_to_class, # define mapper so the model doesn't create one themselves
    label2id=class_to_idx
)
# move model to GPU
model.to(device)

# define tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

Downloading (…)lve/main/config.json:   0%|          | 0.00/546 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/423M [00:00<?, ?B/s]

Some weights of CamembertForTokenClassification were not initialized from the model checkpoint at airesearch/wangchanberta-base-att-spm-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading (…)okenizer_config.json:   0%|          | 0.00/282 [00:00<?, ?B/s]

Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/905k [00:00<?, ?B/s]

## Tokenize and Encode Dataset
Before we feed in the text to the model, we need to tokenize the words first using the loaded `tokenizer` above. However, there's some challenge to this. When the words are tokenized into subwords using tokenizers, the label index also shifted. We need to define a function to handle those shifts, and apply to our dataset.

In [None]:
def tokenize_and_align_labels(tokenizer, examples):
    # tokenize words
    tokenized_inputs = tokenizer(
        examples["words"],  # words to tokenize
        truncation=True,  # truncate if sequence exceed
        is_split_into_words=True  # specify whether the word is already tokenized into list
    )

    labels = []
    # iterate over labels in dataset
    for i, label in enumerate(examples[f"ner"]):
        # get word_ids of that is corresponded to label
        # word ids is the list that has a size of tokenized
        # sentence (list of subwords), but each index will be
        # a index of the "word" prior being tokenized to subword
        # e.g.
        # before:    [this is an example sentence]
        # index:     [  0,  1, 2,   3,       4   ]
        # tokenized: [<s>, th, is, _is, _an, _ex, ample, _sent, ence, </s>]
        # word_ids:  [None, 0,  0,   1,   2,   3,     3,     4,    4, None]
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word.

        # initialize
        previous_word_idx = None
        label_ids = []

        # Set the special tokens to -100
        # the -100 is a special label that
        # will be skip when computing loss
        # iterate over each word
        for word_idx in word_ids:
            # special character will marked label as -100
            if word_idx is None:
                label_ids.append(-100)
            # append same label if there's no change in word index
            elif word_idx != previous_word_idx:  # Only label the first token of a given word.
                label_ids.append(label[word_idx])
            # otherwise we skip the loss
            # we only mark label for one token at the start is enough
            else:
                label_ids.append(-100)

            # update previous word
            previous_word_idx = word_idx

        # append label_ids
        labels.append(label_ids)

    # assign new column
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [None]:
# apply tokenize + align labels
tokenized_dataset = dataset.map(partial(tokenize_and_align_labels, tokenizer), batched=True)
tokenized_dataset

Map:   0%|          | 0/3938 [00:00<?, ? examples/s]

Map:   0%|          | 0/1313 [00:00<?, ? examples/s]

Map:   0%|          | 0/1313 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['words', 'ner', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 3938
    })
    validation: Dataset({
        features: ['words', 'ner', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1313
    })
    test: Dataset({
        features: ['words', 'ner', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1313
    })
})

In [None]:
idx = 0

print("BEFORE ALIGNED/TOKENIZE LABELS")
print(list(zip(tokenized_dataset["train"][0]["words"], tokenized_dataset["train"][0]["ner"])))
print()

print("AFTER ALIGNED/TOKENIZE LABELS")
list(zip(
    tokenizer.convert_ids_to_tokens(tokenized_dataset["train"][0]["input_ids"]),
    tokenized_dataset["train"][0]["labels"]
))

BEFORE ALIGNED/TOKENIZE LABELS
[('ทักษิณ', 0), (' ', 1), ('ชินวัตร', 1), (' ', 2), ('ทวีต', 2), ('ไม่', 2), ('แปลกใจ', 2), ('ศาลปกครอง', 3), ('สูงสุด', 2), (' ', 2), ('ยก', 2), ('ฟ้องคดี', 2), ('ถอน', 2), ('พาสปอร์ต', 2), (' ', 2), ('ชี้', 2), (' ', 2), ('กระบวนการยุติธรรม', 2), ('ไทย', 4), ('ถูก', 2), ('ใช้', 2), ('เป็น', 2), ('เครื่องมือ', 2), ('การเมือง', 2)]

AFTER ALIGNED/TOKENIZE LABELS


[('<s>', -100),
 ('▁', 0),
 ('ทักษิณ', -100),
 ('▁', 1),
 ('▁', 1),
 ('ชินวัตร', -100),
 ('▁', 2),
 ('▁', 2),
 ('ทวีต', -100),
 ('▁ไม่', 2),
 ('▁', 2),
 ('แปลกใจ', -100),
 ('▁', 3),
 ('ศาลปกครอง', -100),
 ('▁', 2),
 ('สูงสุด', -100),
 ('▁', 2),
 ('▁', 2),
 ('ยก', -100),
 ('▁', 2),
 ('ฟ้องคดี', -100),
 ('▁', 2),
 ('ถอน', -100),
 ('▁', 2),
 ('พาสปอร์ต', -100),
 ('▁', 2),
 ('▁', 2),
 ('ชี้', -100),
 ('▁', 2),
 ('▁', 2),
 ('กระบวนการ', -100),
 ('ยุติธรรม', -100),
 ('▁', 4),
 ('ไทย', -100),
 ('▁', 2),
 ('ถูก', -100),
 ('▁ใช้', 2),
 ('▁', 2),
 ('เป็น', -100),
 ('▁', 2),
 ('เครื่องมือ', -100),
 ('▁', 2),
 ('การเมือง', -100),
 ('</s>', -100)]

## Train the model
Let's define `compute_metrics` function and start training!

In [None]:
# this compute metrics will be used as an argument
# to HF's Trainer class
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)
    return benchmark.eval(predictions=predictions, labels=labels)

In [None]:
training_args = TrainingArguments(
    output_dir="thainer_corpus_v2_model",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    # num_train_epochs=10,  # too long for the workshop
    eval_steps=25,
    max_steps=50,  # we'll train for only 50 steps for fast demo
    save_steps=25, # change this if neccessary
    weight_decay=0.01,
    evaluation_strategy="steps",  # or "epochs"
    save_strategy="steps",  # or "epochs"
    load_best_model_at_end=True,
    logging_strategy="steps",  # or "epochs"
    logging_steps=5
)

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
trainer = Trainer(
  model=model,
  args=training_args,
  train_dataset=tokenized_dataset["train"],
  eval_dataset=tokenized_dataset["validation"],
  tokenizer=tokenizer,
  data_collator=data_collator,
  compute_metrics=compute_metrics,
)

trainer.train()

You're using a CamembertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
25,1.3669,1.155898,0.0,0.0,0.0,0.784821
50,1.0647,1.106427,0.0,0.0,0.0,0.784762


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


TrainOutput(global_step=50, training_loss=1.33467209815979, metrics={'train_runtime': 80.3882, 'train_samples_per_second': 9.952, 'train_steps_per_second': 0.622, 'total_flos': 89979058842624.0, 'train_loss': 1.33467209815979, 'epoch': 0.2})

In [None]:
# Load finetuned model on latest checkpoint
finetuned_model = AutoModelForTokenClassification.from_pretrained(
    "thainer_corpus_v2_model/checkpoint-50", # load latest checkpoint
    num_labels=len(class_to_idx), id2label=idx_to_class, label2id=class_to_idx
  )
finetuned_model.to(device)

CamembertForTokenClassification(
  (roberta): CamembertModel(
    (embeddings): CamembertEmbeddings(
      (word_embeddings): Embedding(25005, 768, padding_idx=1)
      (position_embeddings): Embedding(512, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): CamembertEncoder(
      (layer): ModuleList(
        (0-11): 12 x CamembertLayer(
          (attention): CamembertAttention(
            (self): CamembertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): CamembertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)


## Predict Test Set and Evaluate the Results
To make the code more readable, we'll define a `NERWangchanBERTaInferenceModel` (which is a class that wraps `tokenizer`, `model`, and mapper from encoded label to label string) for doing a prediction.

In [None]:
class NERWangchanBERTaInferenceModel(object):

    def __init__(
        self,
        model: AutoModelForTokenClassification,
        tokenizer: AutoTokenizer,
        idx_to_class: Dict[int, str],
        class_to_idx: Dict[str, int]
    ) -> None:
        self.model = model
        self.tokenizer = tokenizer
        self.idx_to_class = idx_to_class
        self.class_to_idx = class_to_idx

        self.model.eval()

    def get_prediction_in_word_level(self, input_ids, word_ids, logits):
        # initialize
        predictions = []
        prev_word_id = None

        # predefined possible prediction index
        # this `possible_class_indices` variable
        # will be constantly updated through decoding
        # steps and store possible index for decoding
        # for example, if current time step is B-PERSON
        # the only possible next step are B-[other classes], O, and I-PERSON
        # the initial state would be O, B-[all classes]
        # start with I is impossible
        possible_class_indices = [self.class_to_idx["O"]]
        possible_class_indices += [
            self.class_to_idx[class_name]
            for class_name
            in self.class_to_idx.keys()
            if class_name.startswith("B-")]

        for token_id, word_id in enumerate(word_ids):
            # Skip special tokens ([START] and [END] tokens) and sub-word tokens
            if word_id == None or word_id == prev_word_id:
                continue
            if word_id != prev_word_id:
                # Only get predicted class from the first token in a word (we want prediction in word-level)
                prev_word_id = word_id

            # Get constrained prediction
            # argmax only logits that satisfy possible_class_indices
            filterd_logit = logits[token_id, possible_class_indices]
            pred_class_id = possible_class_indices[filterd_logit.argmax(0)]
            pred_class_name = self.idx_to_class[pred_class_id]
            predictions.append(pred_class_id)

            # Update possible_class_indices
            if pred_class_name == "O":
                # get next possible indices of "O"
                possible_class_indices = [self.class_to_idx["O"]]
                possible_class_indices += [
                    self.class_to_idx[class_name]
                    for class_name in self.class_to_idx.keys()
                    if class_name.startswith("B-")]
            elif pred_class_name.startswith("B-"):
                # get next possible indices of B-XXX
                possible_class_indices = [self.class_to_idx["O"]]
                possible_class_indices += [
                    self.class_to_idx[class_name]
                    for class_name in self.class_to_idx.keys()
                    if class_name.startswith(f"I-{pred_class_name[2:]}")]
            else:
                # get next possible indices of I-XXX
                possible_class_indices = [self.class_to_idx["O"]]
                possible_class_indices += [
                    self.class_to_idx[class_name]
                    for class_name in self.class_to_idx.keys()
                    if class_name.startswith("B-")]
                possible_class_indices += [
                    self.class_to_idx[class_name]
                    for class_name in self.class_to_idx.keys()
                    if class_name.startswith(f"I-{pred_class_name[2:]}")]

        # return constraint predictions
        return predictions

    def predict(self, words: List[str]) -> List[str]:
        # tokenize input words and move batch to GPU
        batch = self.tokenizer(words, return_tensors="pt", truncation=True, is_split_into_words=True)
        batch = batch.to(self.model.device)

        # forward and get logits
        with torch.no_grad():
            logits = self.model(**batch).logits

        # get word-level predictions
        predictions = self.get_prediction_in_word_level(
            input_ids=batch["input_ids"][0],
            word_ids=batch.word_ids(batch_index=0),
            logits=logits[0]
        )

        return predictions

In [None]:
# instantiate the inferencewrapper
inference_model = NERWangchanBERTaInferenceModel(
    finetuned_model,
    tokenizer,
    idx_to_class=idx_to_class,
    class_to_idx=class_to_idx)

In [None]:
predictions = [
    inference_model.predict(
        words=sample["words"],
    ) for sample in tqdm(dataset["test"])
]

  0%|          | 0/1313 [00:00<?, ?it/s]

In [None]:
# evaluate using predefined benchmark class
metric = benchmark.eval(
    predictions=predictions,
    labels=[sample["ner"] for sample in dataset["test"]]
)

print("Test set results:")
metric

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Test set results:


{'precision': 0.1,
 'recall': 0.0002620545073375262,
 'f1': 0.0005227391531625719,
 'accuracy': 0.7898823620763288}