### Sentence Token Analysis(NER -  name, location and Organisation) - leveraging pretrain bert and using transformer



##### Dataset Summary
The shared task of CoNLL-2003 concerns language-independent named entity recognition. We will concentrate on four types of named entities: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups.

The CoNLL-2003 shared task data files contain four columns separated by a single space. Each word has been put on a separate line and there is an empty line after each sentence. The first item on each line is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. We are interested in the named entity tag


### About the ner_tag

O: Word doesn’t correspond to any entity.
B-PER/I-PER: Beginning of/inside a person entity.
B-ORG/I-ORG: Beginning of/inside an organization entity.
B-LOC/I-LOC: Beginning of/inside a location entity.
B-MISC/I-MISC: Beginning of/inside a miscellaneous entity."

## -------------------------------------------------------------------------------




#### loading and inspecting the dataset

In [1]:
!pip install transformers
!pip install datasets
!pip install evaluate

Collecting datasets
  Downloading datasets-3.0.1-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.17-py310-none-any.whl.metadata (7.2 kB)
INFO: pip is looking at multiple versions of multiprocess to determine which version is compatible with other requirements. This could take a while.
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.1-py3-none-any.whl (471 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.2 MB/s[0m eta [36m0:00:

In [2]:
import warnings
warnings.filterwarnings("ignore")


In [3]:
# loading dataset from Hugging face
from datasets import load_dataset

Data = load_dataset('conll2003')

conll2003.py:   0%|          | 0.00/9.57k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/12.3k [00:00<?, ?B/s]

The repository for conll2003 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/conll2003.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/983k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14041 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3250 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3453 [00:00<?, ? examples/s]

In [4]:
# inspecting the data - Hugging face dataset can be seen as stack of dict
# The features we are insterested in is the ner_tag( label of the token statement ) and tokens. token are statement that has neen tokenised
Data

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

In [5]:
print(Data['train']['tokens'][0]) # the first statment
print(Data['train']['ner_tags'][0]) # the corresponding ner labels

['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']
[3, 0, 7, 0, 0, 0, 7, 0, 0]


In [6]:
# Before we continue, need to login into hugging face account to be able to save mode during trainin
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [7]:
# printing the corresponding name of each ner_tag label from 0-8
ner_names = Data['train'].features['ner_tags']

ner_ls = ner_names.feature.names

ner_name_dig = {}

for i, name in enumerate(ner_ls):
    ner_name_dig[i] = name

print(ner_name_dig)

{0: 'O', 1: 'B-PER', 2: 'I-PER', 3: 'B-ORG', 4: 'I-ORG', 5: 'B-LOC', 6: 'I-LOC', 7: 'B-MISC', 8: 'I-MISC'}


In [8]:
# the ner_tag label, the corresponding name of the first train token sentence
l = Data['train'][0]['ner_tags']

ner = []
for i in l:
    ner.append(ner_name_dig[i])

#ner = lambda l: l[i] for i in ner_name_dig.keys()

print(Data['train'][0]['ner_tags'])
print(Data['train'][0]['tokens'])
print(ner)



[3, 0, 7, 0, 0, 0, 7, 0, 0]
['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']
['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O']


#### Tokenising the tokenised sentence, generated a longer token and needs to be adjusted with the neg_tag label

In [9]:
from transformers import AutoTokenizer

model_checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [10]:
# tokenization of the first statement - not it increase from 9 to 12 and will generate a mismatch with ner_tag label
token_token = tokenizer(Data['train'][0]['tokens'], is_split_into_words = True)
token_token

{'input_ids': [101, 7270, 22961, 1528, 1840, 1106, 21423, 1418, 2495, 12913, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [11]:
print(token_token.tokens()) # add CLS and SEP, also lamb was divided into subtoken of la and mb
print(len(token_token.tokens()))# from 9 to 12

['[CLS]', 'EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'la', '##mb', '.', '[SEP]']
12


In [12]:
# function to make the ner_tag label equal to tokened sentence, CLS and SEP represented with -100 (will be skiped during training)
def labels_with_tokens(labels, word_ids):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id != current_word:
            current_word = word_id
            label = -100 if word_id is None else labels[word_id]
            new_labels.append(label)
            #print(new_labels)
            #print(current_word)
        elif word_id is None:
            new_labels.append(-100)


        else :
            label = labels[word_id]
            #print(label)
            if label % 2 ==1:
                label+=1
            new_labels.append(label)
    return new_labels





labels_with_tokens(Data["train"][0]["ner_tags"], token_token.word_ids())

[-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, 0, -100]

In [13]:
labels = Data["train"]["ner_tags"][0]
word_ids = token_token.word_ids()
print(labels)# original ner_label
print(labels_with_tokens(labels, word_ids)) # expanded ner_label

[3, 0, 7, 0, 0, 0, 7, 0, 0]
[-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, 0, -100]


In [14]:
# Apply ner_label adjustment on the full dataset

def full_match_token(data):
    token_input = tokenizer(
    data['tokens'], is_split_into_words = True, truncation = True)

    combine_label = data['ner_tags']
    new_label = []
    for i,  label in enumerate(combine_label):
        word_id = token_input.word_ids(i)
        new_label.append(labels_with_tokens(label, word_id))

    token_input['labels'] = new_label
    return token_input


In [15]:
Token_Data = Data.map(full_match_token, batched = True,
                      remove_columns=Data["train"].column_names)

Map:   0%|          | 0/14041 [00:00<?, ? examples/s]

Map:   0%|          | 0/3250 [00:00<?, ? examples/s]

Map:   0%|          | 0/3453 [00:00<?, ? examples/s]

In [16]:
print(Token_Data['train']['input_ids'][0])# corresponding to the tokenizer sentence
Token_Data['train']['labels'][0] #correspoding to the ner_tag





[101, 7270, 22961, 1528, 1840, 1106, 21423, 1418, 2495, 12913, 119, 102]


[-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, 0, -100]

In [17]:
#Using data collator to ensure  batch padding, careful when using with tf
from transformers import DataCollatorForTokenClassification
collator = DataCollatorForTokenClassification(tokenizer = tokenizer, return_tensors = 'tf')



In [18]:
# conversion to tensorflow dataset
tf_train_dataset = Token_Data["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels", "token_type_ids"],
    collate_fn=collator,
    shuffle=True,
    batch_size=32,
)

tf_eval_dataset = Token_Data["validation"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels", "token_type_ids"],
    collate_fn=collator,
    shuffle=False,
    batch_size=32,
)

In [19]:
# revise dictionary of the nar_tag label, needed for decoding back to each name label
ner_name_dig
dig_name_ner = {v: k for k, v in ner_name_dig.items()}
dig_name_ner

{'O': 0,
 'B-PER': 1,
 'I-PER': 2,
 'B-ORG': 3,
 'I-ORG': 4,
 'B-LOC': 5,
 'I-LOC': 6,
 'B-MISC': 7,
 'I-MISC': 8}

In [20]:
# initiating and configuring the model
from transformers import TFAutoModelForTokenClassification

model = TFAutoModelForTokenClassification.from_pretrained(
    model_checkpoint,
    id2label=ner_name_dig,
    label2id=dig_name_ner,
)

model.config.num_labels # 9 total named label

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFBertForTokenClassification.

Some weights or buffers of the TF 2.0 model TFBertForTokenClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


9

In [21]:

from transformers import create_optimizer
import tensorflow as tf

num_epochs = 4
num_train_steps = len(tf_train_dataset) * num_epochs

optimizer, schedule = create_optimizer(
    init_lr=2e-5,
    num_warmup_steps=0,
    num_train_steps=num_train_steps,
    weight_decay_rate=0.01,
)
model.compile(optimizer=optimizer)

In [22]:
# fitting the model and push to hugging face

from transformers.keras_callbacks import PushToHubCallback

callback = PushToHubCallback(output_dir="bert-ner-finetuned-ner", tokenizer=tokenizer)

model.fit(
    tf_train_dataset,
    validation_data=tf_eval_dataset,
    callbacks=[callback],
    epochs=num_epochs,
)



Cloning https://huggingface.co/jeje01/bert-ner-finetuned-ner into local empty directory.


Download file tf_model.h5:   0%|          | 15.0k/411M [00:00<?, ?B/s]

Clean file tf_model.h5:   0%|          | 1.00k/411M [00:00<?, ?B/s]

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<tf_keras.src.callbacks.History at 0x7a86288fb4f0>

In [24]:
! pip install evaluate
! pip install seqeval
# NER is evaluated using the seqeval which is basically combination of evaluation matrices for each NER
import evaluate

metric = evaluate.load("seqeval")

Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16161 sha256=b0707a9b0d2d0fce620868a47d8406e788a3570cd7aaf621b162f3a604f7f53f
  Stored in directory: /root/.cache/pip/wheels/1a/67/4a/ad4082dd7dfc30f2abfe4d80a2ed5926a506eb8a972b4767fa
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2


In [25]:
# fitting on each batch and combined them into a single list before computing the matric
import numpy as np

all_predictions = []
all_labels = []
for batch in tf_eval_dataset:
    logits = model.predict_on_batch(batch)["logits"]
    labels = batch["labels"]
    predictions = np.argmax(logits, axis=-1)
    for prediction, label in zip(predictions, labels):
        for predicted_idx, label_idx in zip(prediction, label):
            if label_idx == -100:
                continue
            all_predictions.append(ner_ls[predicted_idx])
            all_labels.append(ner_ls[label_idx])
metric.compute(predictions=[all_predictions], references=[all_labels])

{'LOC': {'precision': 0.94994617868676,
  'recall': 0.9608056614044638,
  'f1': 0.9553450608930989,
  'number': 1837},
 'MISC': {'precision': 0.8256048387096774,
  'recall': 0.8882863340563991,
  'f1': 0.8557993730407524,
  'number': 922},
 'ORG': {'precision': 0.8923636363636364,
  'recall': 0.9149888143176734,
  'f1': 0.9035346097201766,
  'number': 1341},
 'PER': {'precision': 0.9506107275624004,
  'recall': 0.9717698154180239,
  'f1': 0.9610738255033557,
  'number': 1842},
 'overall_precision': 0.9169941060903732,
 'overall_recall': 0.942611915180074,
 'overall_f1': 0.9296265560165975,
 'overall_accuracy': 0.9847383293106493}

In [26]:
metrics = metric.compute(predictions=[all_predictions], references=[all_labels])

In [27]:
import pandas as pd

pd.DataFrame(metrics)

Unnamed: 0,LOC,MISC,ORG,PER,overall_precision,overall_recall,overall_f1,overall_accuracy
precision,0.949946,0.825605,0.892364,0.950611,0.916994,0.942612,0.929627,0.984738
recall,0.960806,0.888286,0.914989,0.97177,0.916994,0.942612,0.929627,0.984738
f1,0.955345,0.855799,0.903535,0.961074,0.916994,0.942612,0.929627,0.984738
number,1837.0,922.0,1341.0,1842.0,0.916994,0.942612,0.929627,0.984738


In [28]:
# laoding the model from hugging face
from transformers import pipeline

# Replace this with your own checkpoint
model_checkpoint = "jeje01/bert-ner-finetuned-ner"
token_classifier = pipeline(
    "token-classification", model=model_checkpoint, aggregation_strategy="simple"
)


config.json:   0%|          | 0.00/987 [00:00<?, ?B/s]

tf_model.h5:   0%|          | 0.00/431M [00:00<?, ?B/s]

Some layers from the model checkpoint at jeje01/bert-ner-finetuned-ner were not used when initializing TFBertForTokenClassification: ['dropout_37']
- This IS expected if you are initializing TFBertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForTokenClassification were initialized from the model checkpoint at jeje01/bert-ner-finetuned-ner.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForTokenClassification for predictions without further training.


tokenizer_config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/669k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [29]:
 # person, organisation, loc
import pandas as pd

test1 = token_classifier('Dr. Emily Smith, a renowned psychologist and EU researcher, published a groundbreaking study on anxiety disorders in New York City helping thousands of patients cope with their challenges')

test_data = pd.DataFrame(test1)
test_data

Unnamed: 0,entity_group,score,word,start,end
0,PER,0.99737,Emily Smith,4,15
1,ORG,0.953174,EU,45,47
2,LOC,0.996969,New York City,117,130
