# **Comparative Analysis of Bi-LSTM + CRF and BERT for Named Entity Recognition**


---



## Method two:BERT for Named Entity Recognition

Steps

1. Prepare training data and map labels  
2. Load pretrained BERT model(bert for ner) and tokenizer
3. Define training arguments and trainer
4. Fine-tune model on training data
5. Evaluate on test data

The trained model can extract named entities from text by encoding the text and applying the model's token classification head.




# Importing packages and libraries

In [1]:
import numpy as np
import pandas as pd
from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import DataCollatorForTokenClassification

import evaluate
from transformers import AutoModelForTokenClassification
from transformers import TrainingArguments
from transformers import Trainer
from transformers import pipeline
import time
from memory_profiler import memory_usage
from transformers import AutoModelForTokenClassification, AutoConfig
import random
import nltk
from nltk.corpus import wordnet

# Ensure nltk resources are downloaded
nltk.download("wordnet")
nltk.download("omw-1.4")


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\visha\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\visha\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

# Loading the data

In [2]:
# Loading the CoNLL-2003 dataset
dataset = load_dataset('conll2003')
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

# Understaning the structre of the dataset

In [3]:
# Printing the features of the training dataset.
print(dataset['train'].features)

{'id': Value(dtype='string', id=None), 'tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'pos_tags': Sequence(feature=ClassLabel(names=['"', "''", '#', '$', '(', ')', ',', '.', ':', '``', 'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'NN|SYM', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB'], id=None), length=-1, id=None), 'chunk_tags': Sequence(feature=ClassLabel(names=['O', 'B-ADJP', 'I-ADJP', 'B-ADVP', 'I-ADVP', 'B-CONJP', 'I-CONJP', 'B-INTJ', 'I-INTJ', 'B-LST', 'I-LST', 'B-NP', 'I-NP', 'B-PP', 'I-PP', 'B-PRT', 'I-PRT', 'B-SBAR', 'I-SBAR', 'B-UCP', 'I-UCP', 'B-VP', 'I-VP'], id=None), length=-1, id=None), 'ner_tags': Sequence(feature=ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None), length=-1, id=None)}


In [4]:
# Accessing the label names from the 'ner_tags' feature.
label_names = dataset['train'].features['ner_tags'].feature.names
label_names

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

In [5]:
dataset['train'][0]

{'id': '0',
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}

# Data Preprocessing

A pretrained BERT tokenizer is used to tokenize the text, ensuring that the input is compatible with the BERT model. Special care is taken to align the tokenized words with their respective labels, as BERT tokenizes text into subword units, which may cause a mismatch between the original words and their labels.

To address this issue:

- Labels are adjusted so that each token, including special tokens like [CLS] and [SEP], receives the appropriate label.
- The tokenized data is grouped into batches, ensuring efficient processing.

This preprocessing step prepares the data for training and evaluation, maintaining consistency between the tokenized inputs and their corresponding labels.

In [6]:
# Define the checkpoint to use for the tokenizer.
model_name = 'bert-base-cased'

# Creating a tokenizer instance by loading the pre-trained checkpoint.
tokenizer = AutoTokenizer.from_pretrained(model_name)



In [7]:
token = tokenizer(dataset['train'][0]['tokens'], is_split_into_words = True)

In [8]:
def align_target(labels, word_ids):
    """
    Aligns target labels with tokenized word IDs, ensuring subword tokens
    and special tokens receive appropriate labels.

    Args:
        labels (list): Original list of labels corresponding to words.
        word_ids (list): List of word IDs from tokenization, where `None`
                         represents special tokens.

    Returns:
        list: Aligned list of labels for tokenized word IDs.
    """
    # Initialize an empty list to store aligned labels
    align_labels = []

    # Variable to keep track of the last processed word ID
    last_word = None

    for word in word_ids:
        if word is None:
            # Assign -100 to special tokens (e.g., [CLS], [SEP])
            label = -100
        elif word != last_word:
            # Use the label for the current word ID if it differs from the last one
            label = labels[word]
        else:
            # Retain the label for subword tokens of the same word
            label = labels[word]

        # Append the determined label to the aligned labels list
        align_labels.append(label)

        # Update the last_word variable to the current word ID
        last_word = word

    return align_labels

In [9]:
# Extracting labels and word_ids
labels = dataset['train'][0]['ner_tags']
word_ids = token.word_ids()

# align_target function to align labels
aligned_target = align_target(labels, word_ids)

# Print tokenized tokens, original labels, and aligned labels
print(token.tokens(), '\n--------------------------------------------------------------------------------------\n',
      labels, '\n--------------------------------------------------------------------------------------\n',
      aligned_target)

['[CLS]', 'EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'la', '##mb', '.', '[SEP]'] 
--------------------------------------------------------------------------------------
 [3, 0, 7, 0, 0, 0, 7, 0, 0] 
--------------------------------------------------------------------------------------
 [-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, 0, -100]


In [10]:
# Creating a list of aligned labels using label names
aligned_labels = [label_names[t] if t >= 0 else None for t in aligned_target]

# Loop through tokens and aligned labels and print them
for x, y in zip(token.tokens(), aligned_labels):
    print(f"{x}\t{y}")

[CLS]	None
EU	B-ORG
rejects	O
German	B-MISC
call	O
to	O
boycott	O
British	B-MISC
la	O
##mb	O
.	O
[SEP]	None


In [11]:
# Define fake input data printing the labels

words = ['[CLS]', 'Ger', '##man', 'call', 'to', 'Micro', '##so', '##ft', '[SEP]']
word_ids = [None, 0, 0, 1, 2, 3, 3, 3, None]
labels = [7, 0, 0, 3, 4]

aligned_target = align_target(labels, word_ids)
aligned_labels = [label_names[t] if t >= 0 else None for t in aligned_target]

for x, y in zip(words, aligned_labels):
    print(f"{x}\t{y}")

[CLS]	None
Ger	B-MISC
##man	B-MISC
call	O
to	O
Micro	B-ORG
##so	B-ORG
##ft	B-ORG
[SEP]	None


In [12]:
def tokenize_fn(batch):
    """
    Tokenizes a batch of inputs and aligns the labels with the tokenized outputs.

    Args:
        batch (dict): A batch containing:
            - 'tokens' (list of str): The input text tokens.
            - 'ner_tags' (list of int): Corresponding labels for the tokens.

    Returns:
        dict: A dictionary with tokenized inputs and aligned labels under the "labels" key.
    """
    # Tokenize the input tokens with truncation and split into words
    tokenized_inputs = tokenizer(
        batch['tokens'], truncation=True, is_split_into_words=True
    )

    # Extract labels from the batch
    labels_batch = batch['ner_tags']

    # List to store aligned labels for each example
    aligned_targets_batch = []

    for i, labels in enumerate(labels_batch):
        # Get word IDs for the current example
        word_ids = tokenized_inputs.word_ids(i)

        # Align labels with tokenized word IDs
        aligned_targets_batch.append(align_target(labels, word_ids))

    # Add aligned labels to the tokenized inputs
    tokenized_inputs["labels"] = aligned_targets_batch

    # Return the tokenized inputs with labels
    return tokenized_inputs


In [13]:
tokenized_dataset = dataset.map(tokenize_fn, batched=True, remove_columns=dataset['train'].column_names)

Map:   0%|          | 0/14041 [00:00<?, ? examples/s]

Map:   0%|          | 0/3250 [00:00<?, ? examples/s]

Map:   0%|          | 0/3453 [00:00<?, ? examples/s]

In [14]:
# Create a DataCollatorForTokenClassification object
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

# Testing data using the data collator and display the resulting batch
batch = data_collator([tokenized_dataset['train'][i] for i in range(2)])
batch

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'input_ids': tensor([[  101,  7270, 22961,  1528,  1840,  1106, 21423,  1418,  2495, 12913,
           119,   102],
        [  101,  1943, 14428,   102,     0,     0,     0,     0,     0,     0,
             0,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]), 'labels': tensor([[-100,    3,    0,    7,    0,    0,    0,    7,    0,    0,    0, -100],
        [-100,    1,    2, -100, -100, -100, -100, -100, -100, -100, -100, -100]])}

In [15]:
# Loading the seqeval metric which can evaluate NER and other sequence tasks
metric = evaluate.load("seqeval")

# List of List Input
metric.compute(predictions = [['O' , 'B-ORG' , 'I-ORG']],
               references = [['O' , 'B-MISC' , 'I-ORG']])

  _warn_prf(average, modifier, msg_start, len(result))


{'MISC': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 1},
 'ORG': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 1},
 'overall_precision': 0.0,
 'overall_recall': 0.0,
 'overall_f1': 0.0,
 'overall_accuracy': 0.6666666666666666}

In [16]:
def compute_metrics(logits_and_labels):
    """
    Computes evaluation metrics (precision, recall, F1-score, accuracy) for model predictions.

    Args:
        logits_and_labels (tuple): A tuple containing:
            - logits (ndarray): Model output logits for each token.
            - labels (list of lists): True labels for the tokens.

    Returns:
        dict: A dictionary containing precision, recall, F1-score, and accuracy.
    """
    # Unpack logits and labels
    logits, labels = logits_and_labels

    # Convert logits to predicted labels by taking the argmax
    predictions = np.argmax(logits, axis=-1)

    # Remove ignored index (-100) from labels and map to label names
    str_labels = [
        [label_names[t] for t in label if t != -100]
        for label in labels
    ]

    # Map predictions to label names while ignoring special token indices
    str_preds = [
        [label_names[p] for (p, t) in zip(prediction, label) if t != -100]
        for prediction, label in zip(predictions, labels)
    ]

    # Compute metrics using the evaluation metric object
    results = metric.compute(predictions=str_preds, references=str_labels)

    # Extract and return key metrics
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"]
    }


In [17]:
# Create mapping from label ID to label string name
id2label = {k: v for k, v in enumerate(label_names)}

# Create reverse mapping from label name to label ID
label2id = {v: k for k, v in enumerate(label_names)}

print(id2label , '\n--------------------\n' , label2id)

{0: 'O', 1: 'B-PER', 2: 'I-PER', 3: 'B-ORG', 4: 'I-ORG', 5: 'B-LOC', 6: 'I-LOC', 7: 'B-MISC', 8: 'I-MISC'} 
--------------------
 {'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, 'B-LOC': 5, 'I-LOC': 6, 'B-MISC': 7, 'I-MISC': 8}


In [18]:
# Initialize model object with pretrained weights
model = AutoModelForTokenClassification.from_pretrained(
  model_name,

  # Pass in label mappings
  id2label=id2label,
  label2id=label2id
)

  return torch.load(checkpoint_file, map_location="cpu")
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForTokenClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were n

### Training model
The fine-tuning process utilized the pre-trained bert-base-cased BERT model with the following setup:

**Batch Size:** Set to 16 to strike a balance between memory usage and training efficiency.

**Optimizer:** Used the AdamW optimizer with:
- **Learning Rate:** 2e-5.
- **Weight Decay:** 0.01 for regularization.

**Epochs:** Trained for 10 epochs, with evaluation conducted at the end of each epoch to monitor performance on the test set.



In [19]:
# Configure training arguments using TrainigArguments class
training_args = TrainingArguments(
  output_dir = "fine_tuned_model",

  # Evaluate each epoch
  evaluation_strategy = "epoch",

  # Learning rate for Adam optimizer
  learning_rate = 2e-5,

  # Batch sizes for training and evaluation
  per_device_train_batch_size = 16,
  per_device_eval_batch_size = 16,

  # Number of training epochs
  num_train_epochs = 10,

  # L2 weight decay regularization
  weight_decay = 0.01
)

In [20]:
# Initialize Trainer object for model training
trainer = Trainer(
  model=model,

  # Training arguments
  args=training_args,

  # Training and validation datasets
  train_dataset=tokenized_dataset["train"],
  eval_dataset=tokenized_dataset["test"],

  # Tokenizer
  tokenizer=tokenizer,

  # Custom metric function
  compute_metrics=compute_metrics,

  # Data collator
  data_collator=data_collator
)

In [21]:
start_time = time.time()
trainer.train()
end_time = time.time()

print(f"Training time: {end_time - start_time} seconds")

***** Running training *****
  Num examples = 14041
  Num Epochs = 10
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 8780
  Number of trainable parameters = 107726601


  0%|          | 0/8780 [00:00<?, ?it/s]

Saving model checkpoint to fine_tuned_model\checkpoint-500
Configuration saved in fine_tuned_model\checkpoint-500\config.json


{'loss': 0.235, 'learning_rate': 1.886104783599089e-05, 'epoch': 0.57}


Model weights saved in fine_tuned_model\checkpoint-500\pytorch_model.bin
tokenizer config file saved in fine_tuned_model\checkpoint-500\tokenizer_config.json
Special tokens file saved in fine_tuned_model\checkpoint-500\special_tokens_map.json
***** Running Evaluation *****
  Num examples = 3453
  Batch size = 16


  0%|          | 0/216 [00:00<?, ?it/s]

{'eval_loss': 0.14176404476165771, 'eval_precision': 0.885834109972041, 'eval_recall': 0.8818072177381946, 'eval_f1': 0.883816076991027, 'eval_accuracy': 0.9682324577299444, 'eval_runtime': 10.7236, 'eval_samples_per_second': 322.001, 'eval_steps_per_second': 20.143, 'epoch': 1.0}


Saving model checkpoint to fine_tuned_model\checkpoint-1000
Configuration saved in fine_tuned_model\checkpoint-1000\config.json


{'loss': 0.0811, 'learning_rate': 1.7722095671981778e-05, 'epoch': 1.14}


Model weights saved in fine_tuned_model\checkpoint-1000\pytorch_model.bin
tokenizer config file saved in fine_tuned_model\checkpoint-1000\tokenizer_config.json
Special tokens file saved in fine_tuned_model\checkpoint-1000\special_tokens_map.json
Saving model checkpoint to fine_tuned_model\checkpoint-1500
Configuration saved in fine_tuned_model\checkpoint-1500\config.json


{'loss': 0.0531, 'learning_rate': 1.6583143507972667e-05, 'epoch': 1.71}


Model weights saved in fine_tuned_model\checkpoint-1500\pytorch_model.bin
tokenizer config file saved in fine_tuned_model\checkpoint-1500\tokenizer_config.json
Special tokens file saved in fine_tuned_model\checkpoint-1500\special_tokens_map.json
***** Running Evaluation *****
  Num examples = 3453
  Batch size = 16


  0%|          | 0/216 [00:00<?, ?it/s]

{'eval_loss': 0.17270991206169128, 'eval_precision': 0.8797263233705438, 'eval_recall': 0.90657760460154, 'eval_f1': 0.8929501530588934, 'eval_accuracy': 0.9690991317502088, 'eval_runtime': 9.1111, 'eval_samples_per_second': 378.988, 'eval_steps_per_second': 23.707, 'epoch': 2.0}


Saving model checkpoint to fine_tuned_model\checkpoint-2000
Configuration saved in fine_tuned_model\checkpoint-2000\config.json


{'loss': 0.035, 'learning_rate': 1.5444191343963555e-05, 'epoch': 2.28}


Model weights saved in fine_tuned_model\checkpoint-2000\pytorch_model.bin
tokenizer config file saved in fine_tuned_model\checkpoint-2000\tokenizer_config.json
Special tokens file saved in fine_tuned_model\checkpoint-2000\special_tokens_map.json
Saving model checkpoint to fine_tuned_model\checkpoint-2500
Configuration saved in fine_tuned_model\checkpoint-2500\config.json


{'loss': 0.0261, 'learning_rate': 1.4305239179954442e-05, 'epoch': 2.85}


Model weights saved in fine_tuned_model\checkpoint-2500\pytorch_model.bin
tokenizer config file saved in fine_tuned_model\checkpoint-2500\tokenizer_config.json
Special tokens file saved in fine_tuned_model\checkpoint-2500\special_tokens_map.json
***** Running Evaluation *****
  Num examples = 3453
  Batch size = 16


  0%|          | 0/216 [00:00<?, ?it/s]

{'eval_loss': 0.19291624426841736, 'eval_precision': 0.8862885462555066, 'eval_recall': 0.8959087113832452, 'eval_f1': 0.8910726643598617, 'eval_accuracy': 0.9684215502434567, 'eval_runtime': 8.8889, 'eval_samples_per_second': 388.463, 'eval_steps_per_second': 24.3, 'epoch': 3.0}


Saving model checkpoint to fine_tuned_model\checkpoint-3000
Configuration saved in fine_tuned_model\checkpoint-3000\config.json


{'loss': 0.0209, 'learning_rate': 1.3166287015945332e-05, 'epoch': 3.42}


Model weights saved in fine_tuned_model\checkpoint-3000\pytorch_model.bin
tokenizer config file saved in fine_tuned_model\checkpoint-3000\tokenizer_config.json
Special tokens file saved in fine_tuned_model\checkpoint-3000\special_tokens_map.json
Saving model checkpoint to fine_tuned_model\checkpoint-3500
Configuration saved in fine_tuned_model\checkpoint-3500\config.json


{'loss': 0.0176, 'learning_rate': 1.2027334851936218e-05, 'epoch': 3.99}


Model weights saved in fine_tuned_model\checkpoint-3500\pytorch_model.bin
tokenizer config file saved in fine_tuned_model\checkpoint-3500\tokenizer_config.json
Special tokens file saved in fine_tuned_model\checkpoint-3500\special_tokens_map.json
***** Running Evaluation *****
  Num examples = 3453
  Batch size = 16


  0%|          | 0/216 [00:00<?, ?it/s]

{'eval_loss': 0.19878941774368286, 'eval_precision': 0.8831521739130435, 'eval_recall': 0.904536598942388, 'eval_f1': 0.8937164856317888, 'eval_accuracy': 0.9689573123650747, 'eval_runtime': 8.992, 'eval_samples_per_second': 384.008, 'eval_steps_per_second': 24.021, 'epoch': 4.0}


Saving model checkpoint to fine_tuned_model\checkpoint-4000
Configuration saved in fine_tuned_model\checkpoint-4000\config.json


{'loss': 0.0098, 'learning_rate': 1.0888382687927108e-05, 'epoch': 4.56}


Model weights saved in fine_tuned_model\checkpoint-4000\pytorch_model.bin
tokenizer config file saved in fine_tuned_model\checkpoint-4000\tokenizer_config.json
Special tokens file saved in fine_tuned_model\checkpoint-4000\special_tokens_map.json
***** Running Evaluation *****
  Num examples = 3453
  Batch size = 16


  0%|          | 0/216 [00:00<?, ?it/s]

{'eval_loss': 0.20481108129024506, 'eval_precision': 0.888242247885787, 'eval_recall': 0.9062065126635124, 'eval_f1': 0.8971344599559148, 'eval_accuracy': 0.9706433872772254, 'eval_runtime': 8.9958, 'eval_samples_per_second': 383.845, 'eval_steps_per_second': 24.011, 'epoch': 5.0}


Saving model checkpoint to fine_tuned_model\checkpoint-4500
Configuration saved in fine_tuned_model\checkpoint-4500\config.json


{'loss': 0.0102, 'learning_rate': 9.749430523917997e-06, 'epoch': 5.13}


Model weights saved in fine_tuned_model\checkpoint-4500\pytorch_model.bin
tokenizer config file saved in fine_tuned_model\checkpoint-4500\tokenizer_config.json
Special tokens file saved in fine_tuned_model\checkpoint-4500\special_tokens_map.json
Saving model checkpoint to fine_tuned_model\checkpoint-5000
Configuration saved in fine_tuned_model\checkpoint-5000\config.json


{'loss': 0.0062, 'learning_rate': 8.610478359908885e-06, 'epoch': 5.69}


Model weights saved in fine_tuned_model\checkpoint-5000\pytorch_model.bin
tokenizer config file saved in fine_tuned_model\checkpoint-5000\tokenizer_config.json
Special tokens file saved in fine_tuned_model\checkpoint-5000\special_tokens_map.json
***** Running Evaluation *****
  Num examples = 3453
  Batch size = 16


  0%|          | 0/216 [00:00<?, ?it/s]

{'eval_loss': 0.22591976821422577, 'eval_precision': 0.8874411871154542, 'eval_recall': 0.9099174320437888, 'eval_f1': 0.8985387751362741, 'eval_accuracy': 0.9709427837569531, 'eval_runtime': 9.0417, 'eval_samples_per_second': 381.896, 'eval_steps_per_second': 23.889, 'epoch': 6.0}


Saving model checkpoint to fine_tuned_model\checkpoint-5500
Configuration saved in fine_tuned_model\checkpoint-5500\config.json


{'loss': 0.0053, 'learning_rate': 7.471526195899773e-06, 'epoch': 6.26}


Model weights saved in fine_tuned_model\checkpoint-5500\pytorch_model.bin
tokenizer config file saved in fine_tuned_model\checkpoint-5500\tokenizer_config.json
Special tokens file saved in fine_tuned_model\checkpoint-5500\special_tokens_map.json
Saving model checkpoint to fine_tuned_model\checkpoint-6000
Configuration saved in fine_tuned_model\checkpoint-6000\config.json


{'loss': 0.0041, 'learning_rate': 6.3325740318906616e-06, 'epoch': 6.83}


Model weights saved in fine_tuned_model\checkpoint-6000\pytorch_model.bin
tokenizer config file saved in fine_tuned_model\checkpoint-6000\tokenizer_config.json
Special tokens file saved in fine_tuned_model\checkpoint-6000\special_tokens_map.json
***** Running Evaluation *****
  Num examples = 3453
  Batch size = 16


  0%|          | 0/216 [00:00<?, ?it/s]

{'eval_loss': 0.23265650868415833, 'eval_precision': 0.8903261067589725, 'eval_recall': 0.9067631505705539, 'eval_f1': 0.8984694581054374, 'eval_accuracy': 0.9703282330880383, 'eval_runtime': 9.1629, 'eval_samples_per_second': 376.847, 'eval_steps_per_second': 23.573, 'epoch': 7.0}


Saving model checkpoint to fine_tuned_model\checkpoint-6500
Configuration saved in fine_tuned_model\checkpoint-6500\config.json


{'loss': 0.0028, 'learning_rate': 5.19362186788155e-06, 'epoch': 7.4}


Model weights saved in fine_tuned_model\checkpoint-6500\pytorch_model.bin
tokenizer config file saved in fine_tuned_model\checkpoint-6500\tokenizer_config.json
Special tokens file saved in fine_tuned_model\checkpoint-6500\special_tokens_map.json
Saving model checkpoint to fine_tuned_model\checkpoint-7000
Configuration saved in fine_tuned_model\checkpoint-7000\config.json


{'loss': 0.0022, 'learning_rate': 4.054669703872437e-06, 'epoch': 7.97}


Model weights saved in fine_tuned_model\checkpoint-7000\pytorch_model.bin
tokenizer config file saved in fine_tuned_model\checkpoint-7000\tokenizer_config.json
Special tokens file saved in fine_tuned_model\checkpoint-7000\special_tokens_map.json
***** Running Evaluation *****
  Num examples = 3453
  Batch size = 16


  0%|          | 0/216 [00:00<?, ?it/s]

{'eval_loss': 0.24634404480457306, 'eval_precision': 0.8907868435398874, 'eval_recall': 0.9095463401057612, 'eval_f1': 0.9000688547165481, 'eval_accuracy': 0.9707694489529002, 'eval_runtime': 9.1047, 'eval_samples_per_second': 379.256, 'eval_steps_per_second': 23.724, 'epoch': 8.0}


Saving model checkpoint to fine_tuned_model\checkpoint-7500
Configuration saved in fine_tuned_model\checkpoint-7500\config.json


{'loss': 0.0013, 'learning_rate': 2.9157175398633257e-06, 'epoch': 8.54}


Model weights saved in fine_tuned_model\checkpoint-7500\pytorch_model.bin
tokenizer config file saved in fine_tuned_model\checkpoint-7500\tokenizer_config.json
Special tokens file saved in fine_tuned_model\checkpoint-7500\special_tokens_map.json
***** Running Evaluation *****
  Num examples = 3453
  Batch size = 16


  0%|          | 0/216 [00:00<?, ?it/s]

{'eval_loss': 0.2611589729785919, 'eval_precision': 0.8873379860418744, 'eval_recall': 0.9082475183226645, 'eval_f1': 0.8976710067852558, 'eval_accuracy': 0.9699815634799326, 'eval_runtime': 9.4246, 'eval_samples_per_second': 366.383, 'eval_steps_per_second': 22.919, 'epoch': 9.0}


Saving model checkpoint to fine_tuned_model\checkpoint-8000
Configuration saved in fine_tuned_model\checkpoint-8000\config.json


{'loss': 0.0013, 'learning_rate': 1.7767653758542143e-06, 'epoch': 9.11}


Model weights saved in fine_tuned_model\checkpoint-8000\pytorch_model.bin
tokenizer config file saved in fine_tuned_model\checkpoint-8000\tokenizer_config.json
Special tokens file saved in fine_tuned_model\checkpoint-8000\special_tokens_map.json
Saving model checkpoint to fine_tuned_model\checkpoint-8500
Configuration saved in fine_tuned_model\checkpoint-8500\config.json


{'loss': 0.0012, 'learning_rate': 6.378132118451026e-07, 'epoch': 9.68}


Model weights saved in fine_tuned_model\checkpoint-8500\pytorch_model.bin
tokenizer config file saved in fine_tuned_model\checkpoint-8500\tokenizer_config.json
Special tokens file saved in fine_tuned_model\checkpoint-8500\special_tokens_map.json
***** Running Evaluation *****
  Num examples = 3453
  Batch size = 16


  0%|          | 0/216 [00:00<?, ?it/s]



Training completed. Do not forget to share your model on huggingface.co/models =)




{'eval_loss': 0.26006409525871277, 'eval_precision': 0.887920863962247, 'eval_recall': 0.907690880415623, 'eval_f1': 0.89769703642536, 'eval_accuracy': 0.9700130788988512, 'eval_runtime': 9.5399, 'eval_samples_per_second': 361.952, 'eval_steps_per_second': 22.642, 'epoch': 10.0}
{'train_runtime': 1399.6152, 'train_samples_per_second': 100.32, 'train_steps_per_second': 6.273, 'train_loss': 0.029273570748152115, 'epoch': 10.0}
Training time: 1400.7874233722687 seconds


In [22]:
trainer.save_model('fine_tuned_model')

Saving model checkpoint to fine_tuned_model
Configuration saved in fine_tuned_model\config.json
Model weights saved in fine_tuned_model\pytorch_model.bin
tokenizer config file saved in fine_tuned_model\tokenizer_config.json
Special tokens file saved in fine_tuned_model\special_tokens_map.json


### Creating model

In [23]:
# Load the configuration and model
config = AutoConfig.from_pretrained('fine_tuned_model', local_files_only=True)
model = AutoModelForTokenClassification.from_pretrained('fine_tuned_model', config=config, local_files_only=True)

# Print model parameters and their shapes
for name, param in model.named_parameters():
    print(f"{name}: {param.shape}")

# Print total number of parameters
total_params = sum(p.numel() for p in model.parameters())
print(f"Total number of parameters: {total_params}")

loading configuration file fine_tuned_model\config.json
Model config BertConfig {
  "_name_or_path": "fine_tuned_model",
  "architectures": [
    "BertForTokenClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "O",
    "1": "B-PER",
    "2": "I-PER",
    "3": "B-ORG",
    "4": "I-ORG",
    "5": "B-LOC",
    "6": "I-LOC",
    "7": "B-MISC",
    "8": "I-MISC"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "B-LOC": 5,
    "B-MISC": 7,
    "B-ORG": 3,
    "B-PER": 1,
    "I-LOC": 6,
    "I-MISC": 8,
    "I-ORG": 4,
    "I-PER": 2,
    "O": 0
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "transf

bert.embeddings.word_embeddings.weight: torch.Size([28996, 768])
bert.embeddings.position_embeddings.weight: torch.Size([512, 768])
bert.embeddings.token_type_embeddings.weight: torch.Size([2, 768])
bert.embeddings.LayerNorm.weight: torch.Size([768])
bert.embeddings.LayerNorm.bias: torch.Size([768])
bert.encoder.layer.0.attention.self.query.weight: torch.Size([768, 768])
bert.encoder.layer.0.attention.self.query.bias: torch.Size([768])
bert.encoder.layer.0.attention.self.key.weight: torch.Size([768, 768])
bert.encoder.layer.0.attention.self.key.bias: torch.Size([768])
bert.encoder.layer.0.attention.self.value.weight: torch.Size([768, 768])
bert.encoder.layer.0.attention.self.value.bias: torch.Size([768])
bert.encoder.layer.0.attention.output.dense.weight: torch.Size([768, 768])
bert.encoder.layer.0.attention.output.dense.bias: torch.Size([768])
bert.encoder.layer.0.attention.output.LayerNorm.weight: torch.Size([768])
bert.encoder.layer.0.attention.output.LayerNorm.bias: torch.Size([768

In [24]:
ner = pipeline(
    'token-classification',
    model = 'fine_tuned_model',
    aggregation_strategy = 'simple' ,
    device = 0
)

loading configuration file fine_tuned_model\config.json
Model config BertConfig {
  "_name_or_path": "fine_tuned_model",
  "architectures": [
    "BertForTokenClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "O",
    "1": "B-PER",
    "2": "I-PER",
    "3": "B-ORG",
    "4": "I-ORG",
    "5": "B-LOC",
    "6": "I-LOC",
    "7": "B-MISC",
    "8": "I-MISC"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "B-LOC": 5,
    "B-MISC": 7,
    "B-ORG": 3,
    "B-PER": 1,
    "I-LOC": 6,
    "I-MISC": 8,
    "I-ORG": 4,
    "I-PER": 2,
    "O": 0
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "transf

In [25]:
ner('EU rejects German call to boycott British lamb .')

[{'entity_group': 'ORG',
  'score': 0.99988127,
  'word': 'EU',
  'start': 0,
  'end': 2},
 {'entity_group': 'MISC',
  'score': 0.9998983,
  'word': 'German',
  'start': 11,
  'end': 17},
 {'entity_group': 'MISC',
  'score': 0.99990785,
  'word': 'British',
  'start': 34,
  'end': 41}]

In [26]:
# Original sentences (from CoNLL dataset or similar for demonstration)
original_sentences = [
    "Barack Obama was born in Honolulu, Hawaii.",
    "Apple Inc. is based in Cupertino, California.",
    "The Eiffel Tower is located in Paris.",
    "Cristiano Ronaldo scored a goal for Portugal.",
    "Tesla manufactures electric cars in California.",
    "Google was founded by Larry Page and Sergey Brin.",
    "Microsoft announced a new product yesterday.",
    "Amazon's headquarters are in Seattle, Washington.",
    "The Prime Minister met with the President of France.",
    "The Grand Canyon is a major tourist attraction in Arizona."
]

In [27]:
# Define a function for synonym replacement as an adversarial transformation
def synonym_replacement(sentence):
    """
    Replaces a random word in the sentence with one of its synonyms.

    Args:
        sentence (str): The input sentence to be transformed.

    Returns:
        str: The sentence with one word replaced by its synonym.
    """
    # Split the sentence into words
    words = sentence.split()

    # Randomly select a word to replace
    word_to_replace = random.choice(words)

    # Get synonyms for the selected word from WordNet
    synonyms = wordnet.synsets(word_to_replace)

    # If synonyms exist, replace the word with the first synonym found
    if synonyms:
        synonym = synonyms[0].lemmas()[0].name()
        # Ensure the synonym is different from the original word
        if synonym != word_to_replace:
            words[words.index(word_to_replace)] = synonym

    # Return the modified sentence
    return " ".join(words)

In [28]:
# Define a function to introduce a random typo in a sentence
def introduce_typo(sentence):
    """
    Introduces a random typo in a randomly selected word in the sentence.

    Args:
        sentence (str): The input sentence to be transformed.

    Returns:
        str: The sentence with one word modified to introduce a typo.
    """
    # Split the sentence into words
    words = sentence.split()

    # Randomly select a word to introduce a typo
    word_to_typo = random.choice(words)

    # If the word is longer than one character, introduce a typo
    if len(word_to_typo) > 1:
        typo = list(word_to_typo)
        # Replace a random character in the word with a random letter
        typo[random.randint(0, len(typo)-1)] = random.choice("abcdefghijklmnopqrstuvwxyz")
        # Update the word in the sentence
        words[words.index(word_to_typo)] = "".join(typo)

    # Return the modified sentence
    return " ".join(words)

In [29]:
# Define a function to swap named entities in a sentence
def entity_swapping(sentence):
    """
    Swaps predefined entities in the sentence with other entities.

    Args:
        sentence (str): The input sentence containing named entities.

    Returns:
        str: The sentence with entities swapped if found.
    """
    # List of original entities and their corresponding replacements
    entities = ["Barack Obama", "Honolulu", "Apple Inc.", "Cupertino", "Paris",
                "Cristiano Ronaldo", "Tesla", "Google", "Microsoft", "Seattle"]
    replacements = ["Joe Biden", "Kailua", "Google", "Mountain View", "London",
                    "Lionel Messi", "Lucid", "Facebook", "Amazon", "Portland"]

    # Loop through entities and their replacements
    for original, replacement in zip(entities, replacements):
        # If the original entity is found, replace it with the corresponding replacement
        if original in sentence:
            return sentence.replace(original, replacement)

    # Return the sentence unchanged if no entity is found
    return sentence

In [30]:
# Generate adversarial examples
adversarial_pairs = []
for sentence in original_sentences:
    transformation = random.choice([synonym_replacement, introduce_typo, entity_swapping])
    adversarial_sentence = transformation(sentence)
    adversarial_pairs.append((sentence, adversarial_sentence))

# Display results
for original, adversarial in adversarial_pairs:
    print(f"Original: {original}")
    print(f"Adversarial: {adversarial}")
    print()


Original: Barack Obama was born in Honolulu, Hawaii.
Adversarial: Joe Biden was born in Honolulu, Hawaii.

Original: Apple Inc. is based in Cupertino, California.
Adversarial: apple Inc. is based in Cupertino, California.

Original: The Eiffel Tower is located in Paris.
Adversarial: The Eiffel Tower is located in London.

Original: Cristiano Ronaldo scored a goal for Portugal.
Adversarial: Cristiano Rhnaldo scored a goal for Portugal.

Original: Tesla manufactures electric cars in California.
Adversarial: Lucid manufactures electric cars in California.

Original: Google was founded by Larry Page and Sergey Brin.
Adversarial: Google was founded by Larry Page and Sergey Brin.

Original: Microsoft announced a new product yesterday.
Adversarial: Microsoft announced a new merchandise yesterday.

Original: Amazon's headquarters are in Seattle, Washington.
Adversarial: Aaazon's headquarters are in Seattle, Washington.

Original: The Prime Minister met with the President of France.
Adversarial

Compares Named Entity Recognition (NER) results between original and adversarial texts.

For each pair of sentences:
- Prints the original sentence.
- Prints the adversarially transformed sentence.
- Performs NER on the adversarial sentence and displays the detected entities.

In [31]:
# Iterate over original and adversarial sentence pairs
for original, adversarial in adversarial_pairs:
    # Print the original sentence
    print(original)

    # Print the adversarial sentence
    print(adversarial)

    # Perform NER and display results
    print("NER Results:")
    result = ner(adversarial)
    for entity in result:
        # Print each detected entity and its predicted label group
        print(entity['word'], entity['entity_group'])
    print("\n")


Barack Obama was born in Honolulu, Hawaii.
Joe Biden was born in Honolulu, Hawaii.
NER Results:
Joe Biden PER
Honolulu LOC
Hawaii LOC


Apple Inc. is based in Cupertino, California.
apple Inc. is based in Cupertino, California.
NER Results:
apple Inc. ORG
Cup LOC
##ert LOC
##ino LOC
California LOC


The Eiffel Tower is located in Paris.
The Eiffel Tower is located in London.
NER Results:
E LOC
##iff LOC
##el Tower LOC
London LOC


Cristiano Ronaldo scored a goal for Portugal.
Cristiano Rhnaldo scored a goal for Portugal.
NER Results:
C PER
##rist PER
##iano Rhnaldo PER
Portugal LOC


Tesla manufactures electric cars in California.
Lucid manufactures electric cars in California.
NER Results:
Luc ORG
##id ORG
California LOC


Google was founded by Larry Page and Sergey Brin.
Google was founded by Larry Page and Sergey Brin.
NER Results:
Google ORG
Larry Page PER
Sergey Brin PER


Microsoft announced a new product yesterday.
Microsoft announced a new merchandise yesterday.
NER Results:
Mi

