<a href="https://colab.research.google.com/github/Danzigerrr/MultiClass-Entity-Linking-System/blob/NER-datasets/NER_BERT_with_Conll2003.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERT with Conll2003

Original code: https://github.com/rohan-paul/LLM-FineTuning-Large-Language-Models/blob/main/Other-Language_Models_BERT_related/YT_Fine_tuning_BERT_NER_v1.ipynb



**There are 9 types of labels in the dataset:**
- O means the word doesn’t correspond to any entity.
- B-PER/I-PER means the word corresponds to the beginning of/is inside a person entity.
- B-ORG/I-ORG means the word corresponds to the beginning of/is inside an organization entity.
- B-LOC/I-LOC means the word corresponds to the beginning of/is inside a location entity.
- B-MISC/I-MISC means the word corresponds to the beginning of/is inside a miscellaneous entity.



## Import libraries

In [1]:
!pip install datasets transformers tokenizers seqeval evaluate -q

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m8.4 MB/s[0m eta [36

In [2]:
import datasets
from datasets import load_dataset
import numpy as np
from transformers import BertTokenizerFast
from transformers import DataCollatorForTokenClassification
from transformers import AutoModelForTokenClassification

## Import Conll2003 dataset

In [3]:
# The dataset is stored at https://huggingface.co/datasets/eriktks/conll2003
conll2003 = load_dataset("conll2003", trust_remote_code=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/12.3k [00:00<?, ?B/s]

conll2003.py:   0%|          | 0.00/9.57k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/983k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14041 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3250 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3453 [00:00<?, ? examples/s]

In [4]:
conll2003

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

In [5]:
conll2003.shape

{'train': (14041, 5), 'validation': (3250, 5), 'test': (3453, 5)}

In [6]:
# first sample from train dataset:
conll2003["train"][0]

{'id': '0',
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}

In [7]:
# feature names
conll2003["train"].features["ner_tags"]

Sequence(feature=ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None), length=-1, id=None)

## Create tokenizer

In [8]:
# define tokenizer
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

###  Problem of Sub-Token
The input ids returned by the tokenizer are longer than the lists of labels our dataset contain.
Therefore, we need to do some pre-processing on the data before training.
We need to depend on the result of the *word_ids()* mehtod.


In [9]:
example_text = conll2003['train'][0]

tokenized_input = tokenizer(example_text["tokens"], is_split_into_words=True)

tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])

word_ids = tokenized_input.word_ids()

print(f"word_ids: {word_ids}")

''' As we can see, it returns a list with the same number of
elements as our processed input ids, mapping special tokens to
None and all other tokens to their respective word.
This way, we can align the labels with the processed input ids. '''

print(f"tokenized_input: {tokenized_input}")

word_ids: [None, 0, 1, 2, 3, 4, 5, 6, 7, 8, None]
tokenized_input: {'input_ids': [101, 7327, 19164, 2446, 2655, 2000, 17757, 2329, 12559, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


Length of ner_tags and input_ids are different:

In [10]:
len(example_text['ner_tags']), len(tokenized_input["input_ids"])
# (9, 11)

(9, 11)

The below function tokenize_and_align_labels does 2 jobs

- set –100 as the label for these special tokens and the subwords we wish to mask during training
- mask the subword representations after the first subword

In [11]:
def tokenize_and_align_labels(examples, label_all_tokens=True):
    """
    Function to tokenize and align labels with respect to the tokens. This function is specifically designed for
    Named Entity Recognition (NER) tasks where alignment of the labels is necessary after tokenization.

    Parameters:
    examples (dict): A dictionary containing the tokens and the corresponding NER tags.
                     - "tokens": list of words in a sentence.
                     - "ner_tags": list of corresponding entity tags for each word.

    label_all_tokens (bool): A flag to indicate whether all tokens should have labels.
                             If False, only the first token of a word will have a label,
                             the other tokens (subwords) corresponding to the same word will be assigned -100.

    Returns:
    tokenized_inputs (dict): A dictionary containing the tokenized inputs and the corresponding labels aligned with the tokens.
    """
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        # word_ids() => Return a list mapping the tokens
        # to their actual word in the initial sentence.
        # It Returns a list indicating the word corresponding to each token.
        previous_word_idx = None
        label_ids = []
        # Special tokens like `<s>` and `<\s>` are originally mapped to None
        # We need to set the label to -100 so they are automatically ignored in the loss function.
        for word_idx in word_ids:
            if word_idx is None:
                # set –100 as the label for these special tokens
                label_ids.append(-100)
            # For the other tokens in a word, we set the label to either the current label or -100, depending on
            # the label_all_tokens flag.
            elif word_idx != previous_word_idx:
                # if current word_idx is != prev then its the most regular case
                # and add the corresponding token
                label_ids.append(label[word_idx])
            else:
                # to take care of sub-words which have the same word_idx
                # set -100 as well for them, but only if label_all_tokens == False
                label_ids.append(label[word_idx] if label_all_tokens else -100)
                # mask the subword representations after the first subword

            previous_word_idx = word_idx
        labels.append(label_ids)
    tokenized_inputs["labels"] = labels  # a new key is added
    return tokenized_inputs

In [12]:
q = tokenize_and_align_labels(conll2003['train'][0:1])
print(q)

{'input_ids': [[101, 7327, 19164, 2446, 2655, 2000, 17757, 2329, 12559, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'labels': [[-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, -100]]}


In [13]:
for token, label in zip(tokenizer.convert_ids_to_tokens(q["input_ids"][0]),q["labels"][0]):
    print(f"{token:_<20} {label}")

[CLS]_______________ -100
eu__________________ 3
rejects_____________ 0
german______________ 7
call________________ 0
to__________________ 0
boycott_____________ 0
british_____________ 7
lamb________________ 0
.___________________ 0
[SEP]_______________ -100


In [14]:
# apply this method to the whole dataset
tokenized_datasets = conll2003.map(tokenize_and_align_labels, batched=True)

Map:   0%|          | 0/14041 [00:00<?, ? examples/s]

Map:   0%|          | 0/3250 [00:00<?, ? examples/s]

Map:   0%|          | 0/3453 [00:00<?, ? examples/s]

## Create the model

In [15]:
model = AutoModelForTokenClassification.from_pretrained("bert-base-uncased", num_labels=9)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [16]:
from transformers import TrainingArguments, Trainer

args = TrainingArguments(
  "test-ner",
  evaluation_strategy = "epoch",
  learning_rate=2e-5,
  per_device_train_batch_size=16,
  per_device_eval_batch_size=16,
  num_train_epochs=3,
  weight_decay=0.01,
)



In [17]:
# data_collators form the batch
data_collator = DataCollatorForTokenClassification(tokenizer)

### Metrics for evaluation

In [18]:
# import metrics
import datasets
import evaluate
metric = evaluate.load("seqeval")

Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

In [19]:
sample_from_dataset = conll2003['train'][0]
sample_from_dataset

{'id': '0',
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}

In [20]:
# 9 possible feature names (labels)
label_list = conll2003["train"].features["ner_tags"].feature.names

label_list

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

In [21]:
# check if the metric method is working:
labels = [label_list[i] for i in sample_from_dataset["ner_tags"]]

metric.compute(predictions=[labels], references=[labels])

{'MISC': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 2},
 'ORG': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1},
 'overall_precision': 1.0,
 'overall_recall': 1.0,
 'overall_f1': 1.0,
 'overall_accuracy': 1.0}

In [22]:
def compute_metrics(eval_preds):
    """
    Function to compute the evaluation metrics for Named Entity Recognition (NER) tasks.
    The function computes precision, recall, F1 score and accuracy.

    Parameters:
    eval_preds (tuple): A tuple containing the predicted logits and the true labels.

    Returns:
    A dictionary containing the precision, recall, F1 score and accuracy.
    """
    pred_logits, labels = eval_preds

    pred_logits = np.argmax(pred_logits, axis=2)
    # the logits and the probabilities are in the same order,
    # so we don’t need to apply the softmax

    # We remove all the values where the label is -100
    predictions = [
        [label_list[eval_preds] for (eval_preds, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(pred_logits, labels)
    ]

    true_labels = [
      [label_list[l] for (eval_preds, l) in zip(prediction, label) if l != -100]
       for prediction, label in zip(pred_logits, labels)
   ]
    results = metric.compute(predictions=predictions, references=true_labels)
    return {
   "precision": results["overall_precision"],
   "recall": results["overall_recall"],
   "f1": results["overall_f1"],
  "accuracy": results["overall_accuracy"],
  }

## Train model

In [23]:
trainer = Trainer(
   model,
   args,
   train_dataset=tokenized_datasets["train"],
   eval_dataset=tokenized_datasets["validation"],
   data_collator=data_collator,
   tokenizer=tokenizer,
   compute_metrics=compute_metrics
)

  trainer = Trainer(


In [24]:
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.2153,0.062212,0.92213,0.92997,0.926033,0.983113
2,0.0462,0.058031,0.932979,0.942164,0.937549,0.98513
3,0.0263,0.057524,0.937016,0.946974,0.941969,0.986211


TrainOutput(global_step=2634, training_loss=0.07542063498370287, metrics={'train_runtime': 554.9325, 'train_samples_per_second': 75.907, 'train_steps_per_second': 4.747, 'total_flos': 1020143109346326.0, 'train_loss': 0.07542063498370287, 'epoch': 3.0})

In [24]:
model.save_pretrained("ner_model")
tokenizer.save_pretrained("tokenizer")

In [25]:
id2label = {
    str(i): label for i,label in enumerate(label_list)
}
label2id = {
    label: str(i) for i,label in enumerate(label_list)
}

## Load trained model

In [26]:
import json

In [63]:
checkpoint_path = "/content/test-ner/checkpoint-2634"

In [30]:
config = json.load(open(checkpoint_path + "/config.json"))

In [31]:
config["id2label"] = id2label
config["label2id"] = label2id

In [33]:
json.dump(config, open(checkpoint_path + "/config.json","w"))

In [44]:
model_fine_tuned = AutoModelForTokenClassification.from_pretrained(checkpoint_path)
model_fine_tuned

BertForTokenClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12

In [37]:
from transformers import pipeline

In [60]:
nlp = pipeline("ner", model=model_fine_tuned, tokenizer=tokenizer)

example = "Michael Jordan is a player who plays for the Chicago Bulls."

ner_results = nlp(example)

print(ner_results)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'entity': 'B-PER', 'score': 0.9985115, 'index': 1, 'word': 'michael', 'start': 0, 'end': 7}, {'entity': 'I-PER', 'score': 0.9970872, 'index': 2, 'word': 'jordan', 'start': 8, 'end': 14}, {'entity': 'B-ORG', 'score': 0.99408454, 'index': 10, 'word': 'chicago', 'start': 45, 'end': 52}, {'entity': 'I-ORG', 'score': 0.99448806, 'index': 11, 'word': 'bulls', 'start': 53, 'end': 58}]


Visualize the output:

In [61]:
from IPython.core.display import display, HTML

def visualize_ner_results_merged(text, ner_results):
    # Mapping entity types to colors for visualization
    entity_colors = {
        "PER": "darkblue",
        "ORG": "darkgreen",
        "LOC": "darkcoral",
        "MISC": "darkgoldenrodyellow"
    }

    # Merge contiguous entities
    merged_entities = []
    current_entity = None

    for entity in ner_results:
        entity_type = entity['entity'].split('-')[-1]  # Extract the main type (e.g., PER)

        if entity['entity'].startswith("B-"):
            # Start of a new entity
            if current_entity:  # Append the previous entity
                merged_entities.append(current_entity)
            current_entity = {
                "type": entity_type,
                "start": entity['start'],
                "end": entity['end']
            }
        elif entity['entity'].startswith("I-") and current_entity and current_entity['type'] == entity_type:
            # Continuation of the current entity
            current_entity["end"] = entity['end']
        else:
            # Append the current entity if it exists
            if current_entity:
                merged_entities.append(current_entity)
            current_entity = None  # Reset for next

    # Append the last entity
    if current_entity:
        merged_entities.append(current_entity)

    # Build the visualization HTML
    highlighted_text = ""
    last_end = 0
    for entity in merged_entities:
        entity_type = entity['type']
        color = entity_colors.get(entity_type, "lightgray")  # Default color if not mapped
        start, end = entity['start'], entity['end']

        # Append text before the entity
        highlighted_text += text[last_end:start]

        # Append the highlighted entity
        highlighted_text += f"<span style='background-color: {color}; padding: 2px; border-radius: 3px;'>{text[start:end]} ({entity_type})</span>"

        # Update the last end position
        last_end = end

    # Append remaining text
    highlighted_text += text[last_end:]

    # Display the result
    display(HTML(f"<div style='font-family: Arial, sans-serif; line-height: 1.6;'>{highlighted_text}</div>"))

# Input text and results
visualize_ner_results_merged(example, ner_results)


### See logits (probabilities)

In [62]:
import torch
import numpy as np
from transformers import AutoTokenizer, AutoModelForTokenClassification

# Load tokenizer and fine-tuned model
checkpoint_path = "/content/test-ner/checkpoint-2634"
model_fine_tuned = AutoModelForTokenClassification.from_pretrained(checkpoint_path)
tokenizer = AutoTokenizer.from_pretrained(checkpoint_path)

# Tokenize input
inputs = tokenizer(example, return_tensors="pt", truncation=True, is_split_into_words=False)
with torch.no_grad():
    outputs = model_fine_tuned(**inputs)

# Extract logits
logits = outputs.logits
# Apply softmax to get probabilities
probabilities = torch.nn.functional.softmax(logits, dim=-1)

# Get the tokenized input IDs
input_ids = inputs["input_ids"].squeeze()
tokens = tokenizer.convert_ids_to_tokens(input_ids)

# Get the labels (from the model's config)
label_map = model_fine_tuned.config.id2label

# Process each token's probabilities
results = []
for idx, token_probs in enumerate(probabilities.squeeze()):
    token = tokens[idx]
    token_probs_np = token_probs.cpu().numpy()
    token_probs_dict = {label_map[i]: token_probs_np[i] for i in range(len(token_probs_np))}

    results.append({
        "token": token,
        "probabilities": token_probs_dict
    })

# Print results
for result in results:
    print(f"\nToken: {result['token']}")
    for label, prob in result["probabilities"].items():
        print(f"  {label:_<8}: {prob:.4f}")



Token: [CLS]
  O_______: 0.9997
  B-PER___: 0.0001
  I-PER___: 0.0000
  B-ORG___: 0.0000
  I-ORG___: 0.0000
  B-LOC___: 0.0000
  I-LOC___: 0.0000
  B-MISC__: 0.0001
  I-MISC__: 0.0000

Token: michael
  O_______: 0.0003
  B-PER___: 0.9985
  I-PER___: 0.0002
  B-ORG___: 0.0002
  I-ORG___: 0.0001
  B-LOC___: 0.0002
  I-LOC___: 0.0001
  B-MISC__: 0.0003
  I-MISC__: 0.0001

Token: jordan
  O_______: 0.0004
  B-PER___: 0.0006
  I-PER___: 0.9971
  B-ORG___: 0.0002
  I-ORG___: 0.0005
  B-LOC___: 0.0003
  I-LOC___: 0.0003
  B-MISC__: 0.0003
  I-MISC__: 0.0003

Token: is
  O_______: 0.9997
  B-PER___: 0.0001
  I-PER___: 0.0000
  B-ORG___: 0.0000
  I-ORG___: 0.0000
  B-LOC___: 0.0000
  I-LOC___: 0.0000
  B-MISC__: 0.0000
  I-MISC__: 0.0000

Token: a
  O_______: 0.9998
  B-PER___: 0.0000
  I-PER___: 0.0000
  B-ORG___: 0.0000
  I-ORG___: 0.0000
  B-LOC___: 0.0000
  I-LOC___: 0.0000
  B-MISC__: 0.0000
  I-MISC__: 0.0000

Token: player
  O_______: 0.9997
  B-PER___: 0.0001
  I-PER___: 0.0000
  B-ORG

### Evaluate on test datastet

In [59]:
from transformers import Trainer, TrainingArguments, AutoModelForTokenClassification
from sklearn.metrics import classification_report
import numpy as np
from datasets import load_dataset
from transformers import AutoTokenizer
import torch

# Load dataset
conll2003 = load_dataset("conll2003", trust_remote_code=True)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(checkpoint_path)

# Tokenize the test dataset
def tokenize_function(examples):
    return tokenizer(examples['tokens'], truncation=True, padding='max_length', is_split_into_words=True)

# Apply tokenization to the test split
tokenized_test = conll2003['test'].map(tokenize_function, batched=True)

# Format the tokenized dataset to have input IDs and labels
def format_examples(examples):
    labels = examples['ner_tags']
    # Padding label to match the max_length if necessary
    labels = [label + [0] * (tokenizer.model_max_length - len(label)) for label in labels]
    return {"labels": labels}

tokenized_test = tokenized_test.map(format_examples, batched=True)

# Load the fine-tuned model
model_fine_tuned = AutoModelForTokenClassification.from_pretrained(checkpoint_path)

# Define the metric function to evaluate performance
def compute_metrics(p):
    predictions, labels = p
    # Convert logits to predicted labels
    predictions = np.argmax(predictions, axis=-1)

    # Flatten arrays
    true_labels = labels.flatten()
    pred_labels = predictions.flatten()

    # Exclude padding labels (label = 0) for metrics calculation
    mask = true_labels != 0
    true_labels = true_labels[mask]
    pred_labels = pred_labels[mask]

    # Return classification report metrics
    return classification_report(true_labels, pred_labels, output_dict=True)

# Define training arguments for evaluation
evaluation_args = TrainingArguments(
    per_device_eval_batch_size=8,
    output_dir="./results",
    do_train=False,
    do_eval=True,
    evaluation_strategy="epoch",
    logging_dir="./logs",
)

# Initialize the Trainer
trainer = Trainer(
    model=model_fine_tuned,
    args=evaluation_args,
    eval_dataset=tokenized_test,
    compute_metrics=compute_metrics,
)

# Evaluate the model
eval_results = trainer.evaluate()

# Print evaluation results
print("Evaluation Results:", eval_results)


Map:   0%|          | 0/3453 [00:00<?, ? examples/s]

Map:   0%|          | 0/3453 [00:00<?, ? examples/s]



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
Trainer is attempting to log a value of "{'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 0.0}" of type <class 'dict'> for key "eval/0" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'precision': 0.14305949008498584, 'recall': 0.06246134817563389, 'f1-score': 0.08695652173913043, 'support': 1617.0}" of type <class 'dict'> for key "eval/1" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'precision': 0.23260437375745527, 'recall': 0.10121107266435986, 'f1-score': 0.1410488245931284, 'support': 1156.0}" of type <class 'dict'> for key "eval/2" as a scalar. Th

Evaluation Results: {'eval_loss': 0.670944094657898, 'eval_model_preparation_time': 0.0056, 'eval_0': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 0.0}, 'eval_1': {'precision': 0.14305949008498584, 'recall': 0.06246134817563389, 'f1-score': 0.08695652173913043, 'support': 1617.0}, 'eval_2': {'precision': 0.23260437375745527, 'recall': 0.10121107266435986, 'f1-score': 0.1410488245931284, 'support': 1156.0}, 'eval_3': {'precision': 0.11187607573149742, 'recall': 0.03913305237808549, 'f1-score': 0.05798394290811775, 'support': 1661.0}, 'eval_4': {'precision': 0.635, 'recall': 0.15209580838323353, 'f1-score': 0.24541062801932367, 'support': 835.0}, 'eval_5': {'precision': 0.22950819672131148, 'recall': 0.04196642685851319, 'f1-score': 0.07095793208312215, 'support': 1668.0}, 'eval_6': {'precision': 0.42857142857142855, 'recall': 0.058365758754863814, 'f1-score': 0.10273972602739725, 'support': 257.0}, 'eval_7': {'precision': 0.07453416149068323, 'recall': 0.017094017094017