<a href="https://colab.research.google.com/github/Firojpaudel/GenAI-Chronicles/blob/main/BERTs/BERT_based_NER.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install datasets

## **Named Entity Recognition (NER) in 🤗 using the pretrained bert-base**
---

So, before diving into the models, we need to know what NER is.

##### _Defining NER_

NER is the technique in NLP that focuses on classifiying key entities within the unstructured text.
The diagram below would describe this way better than just words.
<br><br>
<figure align="center">
  <img src= "https://editor.analyticsvidhya.com/uploads/19617Intro%20image.jpg" width= "550" />
  <figcaption><i>Visual Representation of NER in action</i></figcaption>
</figure>

##### _The B - I - O Scheme_

Okay so, while tagging even in the above image we can see that each tags utilize multiple tokens. And they are defined using the BIO (Begin- Inside- Outside) scheme.

> **The Breakdown:**
>So, lets take a tag: For eg., `the United States`.
>
> Here, The **"B"** prefix indicates the begining of named entity. This would denote "the" as article `[B- MISC]`. Likewise, **"I"** prefix would indicate "United" as an adjective `[I- ADJ.]` and "States" as an Location `[I- LOC.]`. **"O"** is not frequently used.

After we know this much lets start the task:


In [2]:
''' Initial step:
    Login to HugggingFaceHub using the APIToken
'''
from google.colab import userdata
from huggingface_hub import login

my_token = userdata.get('HF_collab')
login(my_token)

In [3]:
1. #@ First Loading the bert-base-NER Model

from transformers import AutoModelForTokenClassification as AmFTC

model_name = "dslim/bert-base-NER"

model= AmFTC.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [4]:
2. ##@ Loading the tokenizer for the model

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [5]:
3. ##@ Create the pipeline

from transformers import pipeline

nlp = pipeline('ner', model= model, tokenizer= tokenizer)
nlp

Device set to use cuda:0


<transformers.pipelines.token_classification.TokenClassificationPipeline at 0x7c226051ae30>

In [6]:
4. ##@ Extracting four types of entities: LOC, ORG, PER, MISC

text= "John Smith, the CEO of TechSolutions Inc., attended the annual conference in San Francisco on September 15, 2023."

ner_results = nlp(text)
print(ner_results)

[{'entity': 'B-PER', 'score': 0.99968624, 'index': 1, 'word': 'John', 'start': 0, 'end': 4}, {'entity': 'I-PER', 'score': 0.9997087, 'index': 2, 'word': 'Smith', 'start': 5, 'end': 10}, {'entity': 'B-ORG', 'score': 0.9995708, 'index': 7, 'word': 'Tech', 'start': 23, 'end': 27}, {'entity': 'I-ORG', 'score': 0.9993705, 'index': 8, 'word': '##S', 'start': 27, 'end': 28}, {'entity': 'I-ORG', 'score': 0.99904627, 'index': 9, 'word': '##ol', 'start': 28, 'end': 30}, {'entity': 'I-ORG', 'score': 0.9993812, 'index': 10, 'word': '##ution', 'start': 30, 'end': 35}, {'entity': 'I-ORG', 'score': 0.999374, 'index': 11, 'word': '##s', 'start': 35, 'end': 36}, {'entity': 'I-ORG', 'score': 0.99942553, 'index': 12, 'word': 'Inc', 'start': 37, 'end': 40}, {'entity': 'B-LOC', 'score': 0.9991295, 'index': 20, 'word': 'San', 'start': 77, 'end': 80}, {'entity': 'I-LOC', 'score': 0.9993037, 'index': 21, 'word': 'Francisco', 'start': 81, 'end': 90}]


Okay, so this dataset was the refined version of CoNLL2003 Dataset. Now what if we used the "og" dataset itself?



In [7]:
##@ Loading the CoNLL2003 Dataset

from datasets import load_dataset

conll= load_dataset("conll2003")
conll

README.md:   0%|          | 0.00/12.3k [00:00<?, ?B/s]

conll2003.py:   0%|          | 0.00/9.57k [00:00<?, ?B/s]

The repository for conll2003 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/conll2003.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/983k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14041 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3250 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3453 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

The sentence is split in 'tokens' and tags in 'ner_tags'

In [8]:
example = conll['test'][12]
print(example)

{'id': '12', 'tokens': ['Defender', 'Hassan', 'Abbas', 'rose', 'to', 'intercept', 'a', 'long', 'ball', 'into', 'the', 'area', 'in', 'the', '84th', 'minute', 'but', 'only', 'managed', 'to', 'divert', 'it', 'into', 'the', 'top', 'corner', 'of', 'Bitar', "'s", 'goal', '.'], 'pos_tags': [22, 22, 22, 38, 35, 37, 12, 16, 21, 15, 12, 21, 15, 12, 16, 21, 10, 30, 38, 35, 37, 28, 15, 12, 16, 21, 15, 21, 27, 21, 7], 'chunk_tags': [11, 12, 12, 21, 22, 22, 11, 12, 12, 13, 11, 12, 13, 11, 12, 12, 0, 3, 21, 22, 22, 11, 13, 11, 12, 12, 13, 11, 11, 12, 0], 'ner_tags': [0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]}


In [9]:
## Now if we want to retrieve the tag Names from the Dataset
tag_names = conll['test'].features['ner_tags'].feature.names
print(tag_names)

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']


In [10]:
ner_res = nlp(example['tokens'])


In [11]:
predictions= []

for result in ner_res:
  if len(result) == 0:
    predictions.append('O')
  else:
    predictions.append(result[0]['entity'])

print(predictions)

['O', 'B-PER', 'B-PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'O', 'O', 'O']


In [12]:
#@ Also we can extract the true tags
true_tags = [tag_names[i] for i in example['ner_tags']]
print(true_tags)

['O', 'B-PER', 'I-PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-PER', 'O', 'O', 'O']


In [13]:
## Calculating the accuracy
def cal_accuracy(true_tags, predictions):
  if len(true_tags) != len(predictions):
    raise ValueError

  correct_preds = 0
  total_preds = len(predictions)

  for true, pred in zip(true_tags,predictions):
    if true== pred:
      correct_preds += 1

  accuracy= correct_preds/ total_preds
  return accuracy


In [14]:
accuracy = cal_accuracy(true_tags, predictions)
print(f"Accuracy: {accuracy * 100: .3f}%")

Accuracy:  93.548%


Well, that was for  the particular segment of the dataset. What if we want to see for the entire dataset?


In [15]:
from tqdm import tqdm  #Instantly makes loops show a smart progress meter

true_tags_list= []
preds_tags_list= []

test= conll['test']

for trial in tqdm(test, desc=str(len(test))):

  true_tags_list.append([tag_names[id] for id in trial['ner_tags']])

  test_ner_results = nlp(trial['tokens'])

  predicted_tags = []

  for res in test_ner_results:
    if len(res) == 0:
      predicted_tags.append('O')
    else:
      predicted_tags.append(res[0]['entity'])
  preds_tags_list.append(predicted_tags)

3453:   0%|          | 8/3453 [00:01<15:55,  3.60it/s]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
3453: 100%|██████████| 3453/3453 [08:09<00:00,  7.05it/s]


In [None]:
!pip install evaluate seqeval

> `seqeval` is a Python framework for sequence labeling evaluation. seqeval can evaluate the performance of chunking tasks such as named-entity recognition, part-of-speech tagging, semantic role labeling and so on. \
_**src**: pypi's seqeval discription_

In [24]:
#@ We could also calculate the metrics by importing evaluate

import evaluate

seqeval = evaluate.load("seqeval")

In [27]:
overall_results = seqeval.compute(predictions= preds_tags_list, references= true_tags_list)

print("precision:", overall_results["overall_precision"]),
print("recall:", overall_results["overall_recall"]),
print("f1:", overall_results["overall_f1"]),
print("accuracy:", overall_results["overall_accuracy"])

precision: 0.3569140074330386
recall: 0.49309490084985835
f1: 0.4140956062746264
accuracy: 0.9066867664477226
