# Bangla Name Extraction with Wikiann Dataset

We're building a Named Entity Recognition (NER) model to extract names from Bangla text. The dataset we're using is the Wikiann dataset, which contains labeled data for various languages, including Bangla.

Here's a brief overview of the steps involved:
- **Dataset**: We use Wikiann's Bangla subset, which includes training, validation, and test sets with tokenized text and named entity tags.
- **Model**: Our approach involves fine-tuning a pre-trained Bangla BERT model to recognize named entities in the Bangla language.
- **Training and Evaluation**: We train the model on the Wikiann training set and evaluate its performance using precision, recall, and F1-score on the validation set.
- **Deployment**: After training, we save the fine-tuned model and test it on unseen Bangla text to ensure accurate extraction of names and other entities.

Let's get started!


In [1]:
!pip install -U transformers
!pip install -U accelerate
!pip install -U datasets
!pip install seqeval
!pip install evaluate

Collecting transformers
  Downloading transformers-4.40.0-py3-none-any.whl (9.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.0/9.0 MB[0m [31m29.2 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.20,>=0.19 (from transformers)
  Downloading tokenizers-0.19.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m71.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, transformers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.15.2
    Uninstalling tokenizers-0.15.2:
      Successfully uninstalled tokenizers-0.15.2
  Attempting uninstall: transformers
    Found existing installation: transformers 4.38.2
    Uninstalling transformers-4.38.2:
      Successfully uninstalled transformers-4.38.2
Successfully installed tokenizers-0.19.1 transformers-4.40.0
Collecting accelerate
  Downloading accelerate-0.29.3-py3

In [2]:
import pandas as pd
from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import AutoModelForPreTraining
from transformers import DataCollatorForTokenClassification
import evaluate
import numpy as np
from transformers import TrainingArguments
from transformers import pipeline
from transformers import Trainer
from transformers import AutoModelForTokenClassification

# Load Dataset and Preprocessing

In [3]:
data = load_dataset("wikiann", "bn")
data

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/158k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/56.0k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/57.6k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/554k [00:00<?, ?B/s]

Generating validation split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/10000 [00:00<?, ? examples/s]

DatasetDict({
    validation: Dataset({
        features: ['tokens', 'ner_tags', 'langs', 'spans'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['tokens', 'ner_tags', 'langs', 'spans'],
        num_rows: 1000
    })
    train: Dataset({
        features: ['tokens', 'ner_tags', 'langs', 'spans'],
        num_rows: 10000
    })
})

In [4]:
data['train'].features

{'tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
 'ner_tags': Sequence(feature=ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC'], id=None), length=-1, id=None),
 'langs': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
 'spans': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)}

In [5]:
pd.DataFrame(data['train'][:])[['tokens', 'ner_tags']].iloc[0]

tokens      [ড্যানভিল, ,, ইলিনয়]
ner_tags                [5, 6, 6]
Name: 0, dtype: object

In [6]:
tags = data['train'].features['ner_tags'].feature

index2tag = {idx:tag for idx, tag in enumerate(tags.names)}
tag2index = {tag:idx for idx, tag in enumerate(tags.names)}

In [7]:
def create_tag_names(batch):
  tag_name = {'ner_tags_str': [tags.int2str(idx) for idx in batch['ner_tags']]}
  return tag_name

# Model Building

## Tokenization

In [8]:
from transformers import AutoTokenizer
from transformers import AutoModelForPreTraining

model_checkpoint = "csebuetnlp/banglabert"
model = AutoModelForPreTraining.from_pretrained("csebuetnlp/banglabert")
tokenizer = AutoTokenizer.from_pretrained("csebuetnlp/banglabert")

config.json:   0%|          | 0.00/586 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/443M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/119 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/528k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [9]:
tokenizer.is_fast

True

# Managing Consecutive Subwords in Transformer Models

When working with transformer models, you often encounter subword tokenization, where a single word can be divided into multiple subword units. This can cause a misalignment between the tokenized input and the original labels in the dataset. Here's how you can handle this problem.

## The Challenge with Consecutive Subwords
Subword tokenization can lead to discrepancies in the length of tokenized inputs and the original list of labels. This happens because some words might be split into smaller subwords. For instance, "playing" could be tokenized into "play" and "##ing".

## Addressing Special Tokens
Special tokens like `[CLS]` and `[SEP]` are often added by the tokenizer, further complicating label alignment. A common solution is to assign a label of `-100` to these special tokens, which effectively ignores them during model training.

## Strategies for Label Alignment
To align the labels with the tokenized inputs, you can use one of the following approaches:

- **Label All Subwords**: Assign the same label to all subwords derived from a single word. This ensures that even if a word is split, each subword carries the same label. Special tokens are given a label of `-100`.
  
- **Label Only the First Subword**: Assign a label only to the first subword of a word and use `-100` for the remaining subwords. This strategy works well when you only want the first subword to be considered during training.


In [10]:
def align_labels_with_tokens(labels, word_ids):
  new_labels = []
  current_word=None
  for word_id in word_ids:
    if word_id != current_word:
      current_word = word_id
      label = -100 if word_id is None else labels[word_id]
      new_labels.append(label)

    elif word_id is None:
      new_labels.append(-100)

    else:
      label = labels[word_id]

      if label%2==1:
        label = label + 1
      new_labels.append(label)

  return new_labels


In [11]:
def tokenize_and_align_labels(examples):
  tokenized_inputs = tokenizer(examples['tokens'], truncation=True, is_split_into_words=True)

  all_labels = examples['ner_tags']

  new_labels = []
  for i, labels in enumerate(all_labels):
    word_ids = tokenized_inputs.word_ids(i)
    new_labels.append(align_labels_with_tokens(labels, word_ids))

  tokenized_inputs['labels'] = new_labels

  return tokenized_inputs

In [12]:
tokenized_datasets = data.map(tokenize_and_align_labels, batched=True, remove_columns=data['train'].column_names)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

# Data Collection and Metrics

In [13]:
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
batch = data_collator([tokenized_datasets['train'][i] for i in range(2)])
batch

{'input_ids': tensor([[    2, 15111,  6950,    16,  6989,   775,   762,     3,     0,     0,
             0],
        [    2,  9074,  2303,  9074,  2303,    11,    11,    12,  5213,    13,
             3]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'labels': tensor([[-100,    5,    6,    6,    6,    6,    6, -100, -100, -100, -100],
        [-100,    3,    4,    3,    4,    0,    0,    0,    0,    0, -100]])}

In [14]:
metric = evaluate.load('seqeval')

Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

In [15]:
ner_feature = data['train'].features['ner_tags']
ner_feature

Sequence(feature=ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC'], id=None), length=-1, id=None)

In [16]:
label_names = ner_feature.feature.names
label_names

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']

In [17]:
labels = data['train'][0]['ner_tags']
labels = [label_names[i] for i in labels]
labels

['B-LOC', 'I-LOC', 'I-LOC']

In [18]:
predictions = labels.copy()
predictions[2] = "O"

metric.compute(predictions=[predictions], references=[labels])

{'LOC': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 1},
 'overall_precision': 0.0,
 'overall_recall': 0.0,
 'overall_f1': 0.0,
 'overall_accuracy': 0.6666666666666666}

In [19]:
def compute_metrics(eval_preds):
  logits, labels = eval_preds

  predictions = np.argmax(logits, axis=-1)

  true_labels = [[label_names[l] for l in label if l!=-100] for label in labels]

  true_predictions = [[label_names[p] for p,l in zip(prediction, label) if l!=-100]
                      for prediction, label in zip(predictions, labels)]

  all_metrics = metric.compute(predictions=true_predictions, references=true_labels)

  return {"precision": all_metrics['overall_precision'],
          "recall": all_metrics['overall_recall'],
          "f1": all_metrics['overall_f1'],
          "accuracy": all_metrics['overall_accuracy']}

# Training the Model

In [20]:
id2label = {i:label for i, label in enumerate(label_names)}
label2id = {label:i for i, label in enumerate(label_names)}

In [21]:
print(id2label)

{0: 'O', 1: 'B-PER', 2: 'I-PER', 3: 'B-ORG', 4: 'I-ORG', 5: 'B-LOC', 6: 'I-LOC'}


In [22]:
model = AutoModelForTokenClassification.from_pretrained(
                                                    model_checkpoint,
                                                    id2label=id2label,
                                                    label2id=label2id)

Some weights of ElectraForTokenClassification were not initialized from the model checkpoint at csebuetnlp/banglabert and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [23]:
model.config.num_labels

7

In [24]:
args = TrainingArguments("distilbert-finetuned-ner",
                         evaluation_strategy = "epoch",
                         save_strategy="epoch",
                         learning_rate = 2e-5,
                         num_train_epochs=3,
                         weight_decay=0.01)

In [25]:
tokenized_datasets["train"][0]

{'input_ids': [2, 15111, 6950, 16, 6989, 775, 762, 3],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1],
 'labels': [-100, 5, 6, 6, 6, 6, 6, -100]}

In [26]:
trainer = Trainer(model=model,
                  args=args,
                  train_dataset = tokenized_datasets['train'],
                  eval_dataset = tokenized_datasets['validation'],
                  data_collator=data_collator,
                  compute_metrics=compute_metrics,
                  tokenizer=tokenizer)


trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.2322,0.147713,0.925333,0.940379,0.932796,0.966266
2,0.0958,0.124639,0.94375,0.954833,0.949259,0.975508
3,0.0523,0.110202,0.959641,0.966576,0.963096,0.980283


TrainOutput(global_step=3750, training_loss=0.17732857208251954, metrics={'train_runtime': 319.9623, 'train_samples_per_second': 93.761, 'train_steps_per_second': 11.72, 'total_flos': 229579992225936.0, 'train_loss': 0.17732857208251954, 'epoch': 3.0})

In [27]:
!zip -r distilbert_ner.zip "/content/distilbert-finetuned-ner/checkpoint-3750"

  adding: content/distilbert-finetuned-ner/checkpoint-3750/ (stored 0%)
  adding: content/distilbert-finetuned-ner/checkpoint-3750/rng_state.pth (deflated 25%)
  adding: content/distilbert-finetuned-ner/checkpoint-3750/tokenizer_config.json (deflated 74%)
  adding: content/distilbert-finetuned-ner/checkpoint-3750/model.safetensors (deflated 7%)
  adding: content/distilbert-finetuned-ner/checkpoint-3750/training_args.bin (deflated 51%)
  adding: content/distilbert-finetuned-ner/checkpoint-3750/special_tokens_map.json (deflated 42%)
  adding: content/distilbert-finetuned-ner/checkpoint-3750/vocab.txt (deflated 71%)
  adding: content/distilbert-finetuned-ner/checkpoint-3750/optimizer.pt (deflated 25%)
  adding: content/distilbert-finetuned-ner/checkpoint-3750/config.json (deflated 54%)
  adding: content/distilbert-finetuned-ner/checkpoint-3750/tokenizer.json (deflated 76%)
  adding: content/distilbert-finetuned-ner/checkpoint-3750/trainer_state.json (deflated 71%)
  adding: content/distil

In [28]:
checkpoint = "/content/distilbert-finetuned-ner/checkpoint-3750"
token_classifier = pipeline(
    "token-classification", model=checkpoint, aggregation_strategy="simple"
)

In [29]:
sentence = "আফজালুর রহমান বলেন, সবার হাতে হাতে প্রশ্ন দেখে তিনি ভেবেছিলেন এটি ভুয়া প্রশ্ন। উত্তম কুমার ভট্টাচার্য্য এ কথার সাথে দ্বিমত পোশষণ করেন।"
tokens_ner = token_classifier(sentence)
# print(tokens_ner)

for ner in tokens_ner:
  if ner["entity_group"] == "PER":
    print(ner["word"])

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


আফজালুর রহমান
উত্তম কুমার ভট্টাচার্য্য
