<a href="https://colab.research.google.com/github/07Sada/bert-ner/blob/main/notebook/Fine_tuning_BERT_NER.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## First What is BERT?
![BERT Architecture](https://www.researchgate.net/publication/340295341/figure/fig1/AS:874992090771456@1585625779336/BERT-architecture-1.jpg)

BERT stands for <font color="#d966ff">Bidirectional Encoder Representations from Transformers.</font> The name itself gives us several clues to what BERT is all about.

BERT architecture consists of several Transformer encoders stacked together. Each Transformer encoder encapsulates two sub-layers: a self-attention layer and a feed-forward layer.

### There are two different BERT models:

- <font color="#80d4ff"> BERT base </font>, which is a BERT model consists of 12 layers of Transformer encoder, 12 attention heads, 768 hidden size, and 110M parameters.

- <font color="#80d4ff"> BERT large </font>, which is a BERT model consists of 24 layers of Transformer encoder,16 attention heads, 1024 hidden size, and 340 parameters.

BERT Input and Output
BERT model expects a sequence of tokens (words) as an input. In each sequence of tokens, there are two special tokens that BERT would expect as an input:

- <font color="#66ffe0"> [CLS] </font>: This is the first token of every sequence, which stands for classification token.
- <font color="#66ffe0"> [SEP] </font>: This is the token that makes BERT know which token belongs to which sequence. This special token is mainly important for a next sentence prediction task or question-answering task. If we only have one sequence, then this token will be appended to the end of the sequence.

It is also important to note that <font color="#80ffbf"> the maximum size of tokens that can be fed into BERT model is 512.</font> If the tokens in a sequence are less than 512, we can use padding to fill the unused token slots with <font color="#99ff99"> [PAD]</font> token. If the tokens in a sequence are longer than 512, then we need to do a truncation.

And that’s all that BERT expects as input.

<font color="#ffffb3"> BERT model then will output an embedding vector of size 768 in each of the tokens.</font> We can use these vectors as an input for different kinds of NLP applications, whether it is text classification, next sentence prediction, Named-Entity-Recognition (NER), or question-answering.

------------

**For a text classification task**, we focus our attention on the embedding vector output from the special [CLS] token. This means that we’re going to use the embedding vector of size 768 from [CLS] token as an input for our classifier, which then will output a vector of size the number of classes in our classification task.

-----------------------

![Imgur](https://imgur.com/NpeB9vb.png)

-------------------------

In [None]:
# installing the dependancies
%%capture

!pip install transformers datasets tokenizers seqeval -q

<font color="#66ffe0">transformers </font>--> 🤗 Transformers provides APIs and tools to easily download and train state-of-the-art pretrained models.
[official huggingface link](https://huggingface.co/docs/transformers/index)

<font color="#66ffe0">datasets </font>--> 🤗 Datasets is a library for easily accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing (NLP) tasks.
[official huggingface link](https://huggingface.co/docs/datasets/index)

<font color="#66ffe0">tokenizers </font> --> The library contains tokenizers for all the models.
[official huggingface link](https://huggingface.co/docs/transformers/main_classes/tokenizer)

<font color="#66ffe0">seqeval </font> --> seqeval is a Python framework for sequence labeling evaluation. seqeval can evaluate the performance of chunking tasks such as named-entity recognition, part-of-speech tagging, semantic role labeling and so on.
[pypi.org link](https://pypi.org/project/seqeval/)




# Token classification

The first application we’ll explore is token classification. This generic task encompasses any problem that can be formulated as “attributing a label to each token in a sentence,” such as:

<font color="#ffb3b3">**Named entity recognition (NER):**</font> Find the entities (such as persons, locations, or organizations) in a sentence. This can be formulated as attributing a label to each token by having one class per entity and one class for “no entity.”

<font color="#ff99cc">**Part-of-speech tagging (POS):**</font> Mark each word in a sentence as corresponding to a particular part of speech (such as noun, verb, adjective, etc.).

<font color="#d580ff">**Chunking:**</font> Find the tokens that belong to the same entity. This task (which can be combined with POS or NER) can be formulated as attributing one label (usually B-) to any tokens that are at the beginning of a chunk, another label (usually I-) to tokens that are inside a chunk, and a third label (usually O) to tokens that don’t belong to any chunk.

* <font color="#c44dff">O</font> means the word doesn’t correspond to <font color="#c44dff">any entity</font>.

* <font color="#80ffd4">B-PER/I-PER</font> means the word corresponds to the beginning of/is inside a <font color="#80ffd4">person entity.</font>

* <font color="#ffcc99">B-ORG/I-ORG</font> means the word corresponds to the beginning of/is inside an <font color="#ffcc99">organization entity.</font>

* <font color="#ff9933">B-LOC/I-LOC</font> means the word corresponds to the beginning of/is inside a <font color="#ff9933">location entity.</font>

* <font color="#33d6ff">B-MISC/I-MISC</font> means the word corresponds to the beginning of/is inside a <font color="#33d6ff">miscellaneous entity.</font>

----


[Tutorial/course about Token classification on huggingface](https://huggingface.co/learn/nlp-course/chapter7/2)

----

In [None]:
# importing the libraries
import datasets
import numpy as np
from transformers import BertTokenizerFast
## Construct a “fast” BERT tokenizer (backed by HuggingFace’s tokenizers library). Based on WordPiece.
# This tokenizer inherits from PreTrainedTokenizerFast which contains most of the main methods.
# link --> https://huggingface.co/docs/transformers/v4.30.0/en/model_doc/bert#transformers.BertTokenizerFast

from transformers import DataCollatorForTokenClassification
# DataCollatorForTokenClassification is a utility class that helps in preparing data for token classification tasks,
# such as Named Entity Recognition (NER) and Part-of-Speech Tagging (POS). It streamlines the process of batching and padding tokenized data,
# making it easier to train token classification models.
## link--> https://huggingface.co/docs/transformers/v4.30.0/en/main_classes/data_collator#transformers.DataCollatorForTokenClassification

from transformers import AutoModelForTokenClassification
# This is a generic model class that will be instantiated as one of the model classes of the library
# (with a token classification head) when created with the from_pretrained() class method or the from_config() class method.
# link--> https://huggingface.co/docs/transformers/v4.30.0/en/model_doc/auto#transformers.AutoModelForTokenClassification

In [None]:
# downloading the dataset
data = datasets.load_dataset("conll2003")



  0%|          | 0/3 [00:00<?, ?it/s]

<font color="#ff3385">**Dataset Summary**</font>
The shared task of CoNLL-2003 concerns language-independent named entity recognition. We will concentrate on four types of named entities: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups.

The CoNLL-2003 shared task data files contain four columns separated by a single space. Each word has been put on a separate line and there is an empty line after each sentence. The first item on each line is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. The chunk tags and the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only if two phrases of the same type immediately follow each other, the first word of the second phrase will have tag B-TYPE to show that it starts a new phrase. A word with tag O is not part of a phrase. Note the dataset uses IOB2 tagging scheme, whereas the original dataset uses IOB1.

[dataset_link](https://huggingface.co/datasets/conll2003)

In [None]:
print(f"Shape of the dataset\n{data.shape}")

Shape of the dataset
{'train': (14041, 5), 'validation': (3250, 5), 'test': (3453, 5)}


In [None]:
data

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

In [None]:
# checking one sample from the training set
data["train"][0]

{'id': '0',
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}

In [None]:
# breakdown single datapoint
data_keys = [i for i in data['train'][0].keys()]
for i in data_keys:
  print(i,"==>",data['train'][0][i], end='')
  print('\n')

id ==> 0

tokens ==> ['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']

pos_tags ==> [22, 42, 16, 21, 35, 37, 16, 21, 7]

chunk_tags ==> [11, 21, 11, 12, 21, 22, 11, 12, 0]

ner_tags ==> [3, 0, 7, 0, 0, 0, 7, 0, 0]



In [None]:
# ner_tags available in the dataset
data['train'].features['ner_tags']

Sequence(feature=ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None), length=-1, id=None)

In [None]:
# instantiates a fast version of the BERT tokenizer from the Hugging Face Transformers library, using the pre-trained "bert-base-uncased" model.
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

# Problem of consecutive subwords.



>  <font color="#ff9966">Note that transformers are often pretrained with subword tokenizers, meaning that even if your inputs have been split into words already, each of those words could be split again by the tokenizer.</font>



> <font color="#009900">This means that we need to do some processing on our labels as the input ids returned by the tokenizer are longer than the lists of labels our dataset contain.</font>

This is happening, first because some special tokens might be added (we can a [CLS] and a [SEP] above) and then because of those possible splits of words in multiple tokens:


----

 Strategy to handle above - <font color="#ff66ff">Here we set the labels of all special tokens to **-100**</font> <font color="#00ace6">(the index that is ignored by PyTorch)</font> and the labels of all other tokens to the label of the word they come from. Another strategy is to set the label only on the first token obtained from a given word, and give a label of -100 to the other subtokens from the same word. We propose the two strategies here, just change the value of the following flag:

## Below cell are just for checking the output of some variables before applying `tokenize_and_align_labels()`

In [None]:
example_text = data['train'][0]

tokenized_input = tokenizer(example_text['tokens'], is_split_into_words=True)

tokens = tokenizer.convert_ids_to_tokens(tokenized_input['input_ids'])

word_ids = tokenized_input.word_ids()

print(word_ids)

'''
As we can see, it returns a list with the same number of elements as our processed input ids,
mapping special tokens to None and and all the other tokens to their respective word.
This way, we can align the labels with the processed input ids.
'''
tokenized_input

[None, 0, 1, 2, 3, 4, 5, 6, 7, 8, None]


{'input_ids': [101, 7327, 19164, 2446, 2655, 2000, 17757, 2329, 12559, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

## Problem of Sub-Token - The  input ids returned by the tokenizer are longer than the lists of labels our dataset contain.

In [None]:
len(example_text['ner_tags']), len(tokenized_input["input_ids"])

(9, 11)

## The below function `tokenize_and_align_labels` does 2 jobs

1. set –100 as the label for these special tokens and the subwords we wish to mask during training
2. mask the subword representations after the first subword


### Then we align the labels with the token ids using the strategy we picked:

In [None]:
def tokenize_and_align_labels(examples, label_all_tokens=True):
    """
    Function to tokenize and align labels with respect to the tokens. This function is specifically designed for
    Named Entity Recognition (NER) tasks where alignment of the labels is necessary after tokenization.

    Parameters:
    examples (dict): A dictionary containing the tokens and the corresponding NER tags.
                     - "tokens": list of words in a sentence.
                     - "ner_tags": list of corresponding entity tags for each word.

    label_all_tokens (bool): A flag to indicate whether all tokens should have labels.
                             If False, only the first token of a word will have a label,
                             the other tokens (subwords) corresponding to the same word will be assigned -100.

    Returns:
    tokenized_inputs (dict): A dictionary containing the tokenized inputs and the corresponding labels aligned with the tokens.
    """

    tokenized_inputs = tokenizer(examples['tokens'], truncation=True, is_split_into_words=True)
    labels = []

    for i, label in enumerate(examples['ner_tags']):
      word_ids = tokenized_inputs.word_ids(batch_index=i)
      # word_ids() => Return a list mapping the tokens
      # to their actual word in the initial sentence.
      # It returns a list indicating the word corresponding to each token.
      previous_word_idx = None
      label_ids = []
      # special tokens like '<s>' and '<\s>' are originally mapped to None
      # We need to set the label to -100 so they are automatically ignored in the loss function
      for word_idx in word_ids:
        if word_idx is None:
          # set -100 as the label for these special tokens
          label_ids.append(-100)
        # For other tokens in a word, we set the label to either the current label or -100, depending on
        # the label_all tokens flag.
        elif word_idx != previous_word_idx:
          # if current word_idx is != prev then its the most regular case
          # and add the corresponding token
          label_ids.append(label[word_idx])
        else:
          # to take care of sub-words which have the same word_idx
          # set -100 as well for them, but only if label_all_tokens == False
          label_ids.append(label[word_idx] if label_all_tokens else -100)
          # mask the subword representation after the first subword

        previous_word_idx = word_idx
      labels.append(label_ids)
    tokenized_inputs['labels']=labels
    return tokenized_inputs


In [None]:
q = tokenize_and_align_labels(data['train'][4:5])
print(q)

{'input_ids': [[101, 2762, 1005, 1055, 4387, 2000, 1996, 2647, 2586, 1005, 1055, 15651, 2837, 14121, 1062, 9328, 5804, 2056, 2006, 9317, 10390, 2323, 4965, 8351, 4168, 4017, 2013, 3032, 2060, 2084, 3725, 2127, 1996, 4045, 6040, 2001, 24509, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'labels': [[-100, 5, 0, 0, 0, 0, 0, 3, 4, 0, 0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, -100]]}


### So before applying the `tokenize_and_align_labels()` the `tokenized_input` has 3 keys
- input_ids
- token_type_ids
- attention_mask

But after applying `tokenize_and_align_labels()` we have an extra key - `'labels'`

----

In [None]:
for token, label in zip(tokenizer.convert_ids_to_tokens(q['input_ids'][0]), q['labels'][0]):
  print(f"{token:_<40} {label}")

[CLS]___________________________________ -100
germany_________________________________ 5
'_______________________________________ 0
s_______________________________________ 0
representative__________________________ 0
to______________________________________ 0
the_____________________________________ 0
european________________________________ 3
union___________________________________ 4
'_______________________________________ 0
s_______________________________________ 0
veterinary______________________________ 0
committee_______________________________ 0
werner__________________________________ 1
z_______________________________________ 2
##wing__________________________________ 2
##mann__________________________________ 2
said____________________________________ 0
on______________________________________ 0
wednesday_______________________________ 0
consumers_______________________________ 0
should__________________________________ 0
buy_____________________________________ 0
sheep___

In [None]:
tokenized_dataset = data.map(tokenize_and_align_labels, batched=True)



In [None]:
model = AutoModelForTokenClassification.from_pretrained('bert-base-uncased', num_labels=9)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
%%capture

!pip install transformers[torch] -U

In [None]:
data_collator = DataCollatorForTokenClassification(tokenizer)

In [None]:
data_collator

DataCollatorForTokenClassification(tokenizer=BertTokenizerFast(name_or_path='bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True), padding=True, max_length=None, pad_to_multiple_of=None, label_pad_token_id=-100, return_tensors='pt')

In [None]:
metric = datasets.load_metric("seqeval")

Downloading builder script:   0%|          | 0.00/2.47k [00:00<?, ?B/s]

In [None]:
example=data['train'][0]

In [None]:
label_list = data['train'].features['ner_tags'].feature.names
label_list

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

In [None]:
labels = [label_list[i] for i in example['ner_tags']]

metric.compute(predictions=[labels], references=[labels])

{'MISC': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 2},
 'ORG': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1},
 'overall_precision': 1.0,
 'overall_recall': 1.0,
 'overall_f1': 1.0,
 'overall_accuracy': 1.0}

## seqeval - The way the package works by accepting list of lists

The seqeval package expects the predictions and labels as lists of lists, with
each list corresponding to a single example in our validation or test sets. To
integrate these metrics during training, we need a function that can take the
outputs of the model and convert them into the lists that seqeval expects.

The following does the trick by ensuring we ignore the label IDs associated with
subsequent subwords:

## Compute Metrics

This compute_metrics() function first takes the argmax of the logits to convert them to predictions (as usual, the logits and the probabilities are in the same order, so we don’t need to apply the softmax). Then we have to convert both labels and predictions from integers to strings. We remove all the values where the label is -100, then pass the results to the metric.compute() method:

In [None]:
def compute_metrics(eval_preds):
  """
    Function to compute the evaluation metrics for Named Entity Recognition (NER) tasks.
    The function computes precision, recall, F1 score and accuracy.

    Parameters:
    eval_preds (tuple): A tuple containing the predicted logits and the true labels.

    Returns:
    A dictionary containing the precision, recall, F1 score and accuracy.

  """

  pred_logits, labels = eval_preds

  pred_logits = np.argmax(pred_logits, axis=2)
  # the logits and the probabilities are in the same order,
  # so we don't need to apply the softmax

  # we remove all the values where the label is -100
  predictions=[
      [label_list[eval_preds] for (eval_preds, l) in zip(prediction, label) if l != -100]
      for prediction, label in zip(pred_logits, labels)
  ]

  true_labels=[
      [label_list[l] for (eval_preds, l) in zip(prediction, label) if l!=-100]
      for prediction, label in zip(pred_logits, labels)
  ]

  results = metric.compute(predictions=predictions, references=true_labels)
  return {
      "precision":results["overall_precision"],
      "recall":results["overall_recall"],
      "f1":results["overall_f1"],
      "accuracy":results["overall_accuracy"]
  }

### `predictions` will print a long 2d tensor like below

```
[['O', 'O', 'B-ORG', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], ['B-LOC', 'O', 'O', 'O', 'O', 'O'], ['B-MISC', 'I-MISC', 'O', 'O', 'O', 'O', 'B-PER', 'I-PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'O', 'B-ORG', 'O', ['O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'B-ORG', 'O', 'B-ORG', 'I-ORG', 'O', 'B-ORG', 'B-ORG', 'B-ORG', 'I-ORG', 'O', 'B-ORG', 'I-ORG', 'O', 'O', 'O', 'O', 'O'], ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'],

---

---

, ['O', 'O', 'B-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']]

```

In [None]:
# to save the model on huggingface repo
from huggingface_hub import notebook_login

notebook_login()

In [None]:
from transformers import TrainingArguments, Trainer
args = TrainingArguments(
  "BERT-ner",
  evaluation_strategy = "epoch",
  learning_rate=2e-5,
  per_device_train_batch_size=16,
  per_device_eval_batch_size=16,
  num_train_epochs=3,
  weight_decay=0.01,
  push_to_hub=True
  )

In [None]:
trainer = Trainer(
    model,
    args,
   train_dataset=tokenized_dataset["train"],
   eval_dataset=tokenized_dataset["validation"],
   data_collator=data_collator,
   tokenizer=tokenizer,
   compute_metrics=compute_metrics
)

Cloning https://huggingface.co/Sadashiv/BERT-ner into local empty directory.


In [None]:
trainer.train()



Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.0252,0.06516,0.941413,0.94194,0.941676,0.9854
2,0.0121,0.061535,0.94072,0.949771,0.945224,0.986719
3,0.0079,0.066387,0.944901,0.951561,0.948219,0.987243


TrainOutput(global_step=2634, training_loss=0.014319271052888423, metrics={'train_runtime': 659.5426, 'train_samples_per_second': 63.867, 'train_steps_per_second': 3.994, 'total_flos': 1024113336121080.0, 'train_loss': 0.014319271052888423, 'epoch': 3.0})

In [None]:
trainer.push_to_hub("Training Completed")

Upload file pytorch_model.bin:   0%|          | 1.00/415M [00:00<?, ?B/s]

Upload file runs/Jul21_08-13-59_b31798a51a37/events.out.tfevents.1689927276.b31798a51a37.1201.1:   0%|        …

To https://huggingface.co/Sadashiv/BERT-ner
   a41b61d..1c00547  main -> main

   a41b61d..1c00547  main -> main

To https://huggingface.co/Sadashiv/BERT-ner
   1c00547..4fb06ea  main -> main

   1c00547..4fb06ea  main -> main



'https://huggingface.co/Sadashiv/BERT-ner/commit/1c005473d2acf11bfa086dfe956bd4c518602f01'

In [None]:
model.config.id2label = {str(i): label for i, label in enumerate(label_list)}
model.config.label2id = {label: str(i) for i, label in enumerate(label_list)}

In [None]:
repo_name = "Sadashiv/BERT-ner"
model.config.push_to_hub(repo_name)

CommitInfo(commit_url='https://huggingface.co/Sadashiv/BERT-ner/commit/3b2db08d1bbd80d2c8e2c0c8b3b5b77a04dc06a8', commit_message='Upload config', commit_description='', oid='3b2db08d1bbd80d2c8e2c0c8b3b5b77a04dc06a8', pr_url=None, pr_revision=None, pr_num=None)

## Saving the model and tokenizer Locally

In [None]:
model.save_pretrained("ner_model")

In [None]:
tokenizer.save_pretrained("tokenizer")

('tokenizer/tokenizer_config.json',
 'tokenizer/special_tokens_map.json',
 'tokenizer/vocab.txt',
 'tokenizer/added_tokens.json',
 'tokenizer/tokenizer.json')

In [None]:
id2label = {
    str(i): label for i,label in enumerate(label_list)
}
label2id = {
    label: str(i) for i,label in enumerate(label_list)
}

In [None]:
id2label

{'0': 'O',
 '1': 'B-PER',
 '2': 'I-PER',
 '3': 'B-ORG',
 '4': 'I-ORG',
 '5': 'B-LOC',
 '6': 'I-LOC',
 '7': 'B-MISC',
 '8': 'I-MISC'}

In [None]:
label2id

{'O': '0',
 'B-PER': '1',
 'I-PER': '2',
 'B-ORG': '3',
 'I-ORG': '4',
 'B-LOC': '5',
 'I-LOC': '6',
 'B-MISC': '7',
 'I-MISC': '8'}

In [None]:
import json

In [None]:
config = json.load(open("ner_model/config.json"))

In [None]:
config["id2label"] = id2label
config["label2id"] = label2id

In [None]:
json.dump(config, open("ner_model/config.json","w"))

In [None]:
model_fine_tuned = AutoModelForTokenClassification.from_pretrained("ner_model")

In [None]:
from transformers import pipeline

In [None]:
nlp = pipeline("ner", model=model_fine_tuned, tokenizer=tokenizer)


example = "Bill Gates is the Founder of Microsoft"

ner_results = nlp(example)

print(ner_results)

[{'entity': 'B-PER', 'score': 0.9955214, 'index': 1, 'word': 'bill', 'start': 0, 'end': 4}, {'entity': 'I-PER', 'score': 0.9952893, 'index': 2, 'word': 'gates', 'start': 5, 'end': 10}, {'entity': 'B-ORG', 'score': 0.993664, 'index': 7, 'word': 'microsoft', 'start': 29, 'end': 38}]


In [None]:
ner_results

[{'entity': 'B-PER',
  'score': 0.9955214,
  'index': 1,
  'word': 'bill',
  'start': 0,
  'end': 4},
 {'entity': 'I-PER',
  'score': 0.9952893,
  'index': 2,
  'word': 'gates',
  'start': 5,
  'end': 10},
 {'entity': 'B-ORG',
  'score': 0.993664,
  'index': 7,
  'word': 'microsoft',
  'start': 29,
  'end': 38}]