# **Hugging Face Transformer Model**
- Download **BERT** **transformer** based model from hugging face
- Predict **NER(Named entity Reconition)** in the data Using pretrained model
- Do the Fine tuning of that pretrained model using our dataset **conll2003**
- Do the Validation(Evalution) during Fine tuning
- Use the Fine tuned moddel and do the inferencing on new data to identify NER in that data
- **CoNLL-2003** is a **named entity recognition** dataset released as a part of CoNLL-2003 shared task: language-independent named entity recognition.
- Here **BERT LLM model** used

In [None]:
"""!pip install transformers
!pip install datasets
!pip install tokenizer
!pip install seqeval

#either you can use above statement or use single command
# Install
!pip install transformers datasets tokenizers seqeval -q

!pip install transformers[torch]
"""

'!pip install transformers\n!pip install datasets\n!pip install tokenizer\n!pip install seqeval\n\n#either you can use above statement or use single command\n# Install\n!pip install transformers datasets tokenizers seqeval -q\n\n!pip install transformers[torch]\n'

In [None]:
import datasets
import numpy as np
from transformers import BertTokenizerFast
from transformers import DataCollatorForTokenClassification
from transformers import AutoModelForTokenClassification

## Downloading Hugging face **conll2003 - NER** data

- **CoNLL-2003** is a **named entity recognition** dataset released as a part of CoNLL-2003 shared task: language-independent named entity recognition.



In [None]:
conll2003 = datasets.load_dataset("conll2003")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
conll2003

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

## Here Feature will have list of items, id, token,pos_tags, chunk_tags, ner_tags
- Data has 3 set of data.
- id is unique no to each row. its like rownumber
- Here Tokens will have actual data as a list of token,
- pos_tags will positional tag of that token data
- ner_tags - > Will have named entity for each token

In [None]:
conll2003["train"]

Dataset({
    features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
    num_rows: 14041
})

In [None]:
conll2003["train"][0]
#1st Line of Data

{'id': '0',
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}

In [None]:
conll2003["train"][1]

{'id': '1',
 'tokens': ['Peter', 'Blackburn'],
 'pos_tags': [22, 22],
 'chunk_tags': [11, 12],
 'ner_tags': [1, 2]}

In [None]:
conll2003["train"][0]['tokens']

['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']

In [None]:
conll2003["train"][0]['ner_tags']

[3, 0, 7, 0, 0, 0, 7, 0, 0]

In [None]:
conll2003["train"][14040]

{'id': '14040',
 'tokens': ['Swansea', '1', 'Lincoln', '2'],
 'pos_tags': [21, 11, 22, 11],
 'chunk_tags': [11, 12, 12, 12],
 'ner_tags': [3, 0, 3, 0]}

In [None]:
conll2003["train"]

Dataset({
    features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
    num_rows: 14041
})

In [None]:
#These are distinct NER taags, which is going to be our output labels
conll2003["train"].features['ner_tags']

Sequence(feature=ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None), length=-1, id=None)

In [None]:
#These are distinct NER taags
conll2003["train"].features['id']

Value(dtype='string', id=None)

In [None]:
# This gives description of that dataset, which NER specific dataset
conll2003["train"].description

'The shared task of CoNLL-2003 concerns language-independent named entity recognition. We will concentrate on\nfour types of named entities: persons, locations, organizations and names of miscellaneous entities that do\nnot belong to the previous three groups.\n\nThe CoNLL-2003 shared task data files contain four columns separated by a single space. Each word has been put on\na separate line and there is an empty line after each sentence. The first item on each line is a word, the second\na part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. The chunk tags\nand the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only\nif two phrases of the same type immediately follow each other, the first word of the second phrase will have tag\nB-TYPE to show that it starts a new phrase. A word with tag O is not part of a phrase. Note the dataset uses IOB2\ntagging scheme, whereas the original dataset uses 

## Use Pretrained Model - **bert-base-uncased**'s Tokenizer
- Here uncased means whole data trained on lower case . English and english are same
- BERT base model (uncased) - Pretrained model on English language using a **masked language modeling (MLM)** objective
- **BertTokenizerFast** class has both way tokenization classes. 1 converts string to token, token to string
- **BertTokenizerFast.convert_tokens_to_string** converts ids to string, not tokens to string

In [None]:
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

In [None]:
conll2003['train'][0]

{'id': '0',
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}

In [None]:
#Taking 1 row from dataset
example_text = conll2003['train'][0]
example_text

{'id': '0',
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}

In [None]:
example_text["tokens"]

['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']

### Apply this BERT tokenizer on 1 row for testing
- It converts Data to integer, based on BERTS Tokenizer dictonary
- Once we apply tokenizer, it adds CLS(101) and SEPERATOR/SEP(102) on whole corpus, Its dummy

In [None]:
tokenized_input = tokenizer(example_text["tokens"], is_split_into_words=True)

In [None]:
tokenized_input["input_ids"]

[101, 7327, 19164, 2446, 2655, 2000, 17757, 2329, 12559, 1012, 102]

In [None]:
 tokenizer(['Prabha'], is_split_into_words=True)

{'input_ids': [101, 10975, 7875, 3270, 102], 'token_type_ids': [0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1]}

In [None]:
tokenized_input

{'input_ids': [101, 7327, 19164, 2446, 2655, 2000, 17757, 2329, 12559, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

### Apply this BERT tokenizer.convert_ids_to_tokens on 1 row for testing
- It converts integer to string/data, based on BERTS Tokenizer dictonary

In [None]:
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])

In [None]:
tokens

['[CLS]',
 'eu',
 'rejects',
 'german',
 'call',
 'to',
 'boycott',
 'british',
 'lamb',
 '.',
 '[SEP]']

In [None]:
word_ids = tokenized_input.word_ids()

print(word_ids)

[None, 0, 1, 2, 3, 4, 5, 6, 7, 8, None]


In [None]:
example_text["ner_tags"]

[3, 0, 7, 0, 0, 0, 7, 0, 0]

In [None]:
# Add unique id to each lable/ner_tags
for i, label in enumerate(example_text["ner_tags"]):
  print(i,label)

0 3
1 0
2 7
3 0
4 0
5 0
6 7
7 0
8 0


## Define Function **tokenize_and_align_labels**
-  Here this function converts token/word to integer/tokenized. It applies BERT'S tokenizers on whole data
- It adds Dummy NER to LABEL for starting CLS and ending SEP as -100

- Labels will holds NER details of each token with starting CLS= -100 and ending SEP= -100, iIts needed because tokenizer adds these 2 extra after tokenization

In [None]:
def tokenize_and_align_labels(examples, label_all_tokens=True):

    #tokeinze ids
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
    labels = []


    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        # word_ids() => Return a list mapping the tokens
        # to their actual word in the initial sentence.
        # It Returns a list indicating the word corresponding to each token.

        previous_word_idx = None
        label_ids = []
        # Special tokens like `` and `<\s>` are originally mapped to None
        # We need to set the label to -100 so they are automatically ignored in the loss function.
        for word_idx in word_ids:
            if word_idx is None:
                # set –100 as the label for these special tokens
                label_ids.append(-100)

            # For the other tokens in a word, we set the label to either the current label or -100, depending on
            # the label_all_tokens flag.
            elif word_idx != previous_word_idx:
                # if current word_idx is != prev then its the most regular case
                # and add the corresponding token
                label_ids.append(label[word_idx])
            else:
                # to take care of sub-words which have the same word_idx
                # set -100 as well for them, but only if label_all_tokens == False
                label_ids.append(label[word_idx] if label_all_tokens else -100)
                # mask the subword representations after the first subword

            previous_word_idx = word_idx
        labels.append(label_ids)
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [None]:
conll2003["train"][4:5]

{'id': ['4'],
 'tokens': [['Germany',
   "'s",
   'representative',
   'to',
   'the',
   'European',
   'Union',
   "'s",
   'veterinary',
   'committee',
   'Werner',
   'Zwingmann',
   'said',
   'on',
   'Wednesday',
   'consumers',
   'should',
   'buy',
   'sheepmeat',
   'from',
   'countries',
   'other',
   'than',
   'Britain',
   'until',
   'the',
   'scientific',
   'advice',
   'was',
   'clearer',
   '.']],
 'pos_tags': [[22,
   27,
   21,
   35,
   12,
   22,
   22,
   27,
   16,
   21,
   22,
   22,
   38,
   15,
   22,
   24,
   20,
   37,
   21,
   15,
   24,
   16,
   15,
   22,
   15,
   12,
   16,
   21,
   38,
   17,
   7]],
 'chunk_tags': [[11,
   11,
   12,
   13,
   11,
   12,
   12,
   11,
   12,
   12,
   12,
   12,
   21,
   13,
   11,
   12,
   21,
   22,
   11,
   13,
   11,
   1,
   13,
   11,
   17,
   11,
   12,
   12,
   21,
   1,
   0]],
 'ner_tags': [[5,
   0,
   0,
   0,
   0,
   3,
   4,
   0,
   0,
   0,
   1,
   2,
   0,
   0,
   0,
   0,
   0,


In [None]:
q=tokenize_and_align_labels(conll2003["train"][4:5])
q
# Here this function converts token/word to integer/tokenized
# Labels will holds NER details of each token with starting CLS= -100 and ending SEP= -100, iIts needed because tokenizer adds these 2 extra after tokenization

{'input_ids': [[101, 2762, 1005, 1055, 4387, 2000, 1996, 2647, 2586, 1005, 1055, 15651, 2837, 14121, 1062, 9328, 5804, 2056, 2006, 9317, 10390, 2323, 4965, 8351, 4168, 4017, 2013, 3032, 2060, 2084, 3725, 2127, 1996, 4045, 6040, 2001, 24509, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'labels': [[-100, 5, 0, 0, 0, 0, 0, 3, 4, 0, 0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, -100]]}

In [None]:
# Used convert_ids_to_tokens to convert token value to actual word and appended lables(NER) as --
for token, label in zip(tokenizer.convert_ids_to_tokens(q["input_ids"][0]),q["labels"][0]):
    print(f"{token:_<40} {label}")

[CLS]___________________________________ -100
germany_________________________________ 5
'_______________________________________ 0
s_______________________________________ 0
representative__________________________ 0
to______________________________________ 0
the_____________________________________ 0
european________________________________ 3
union___________________________________ 4
'_______________________________________ 0
s_______________________________________ 0
veterinary______________________________ 0
committee_______________________________ 0
werner__________________________________ 1
z_______________________________________ 2
##wing__________________________________ 2
##mann__________________________________ 2
said____________________________________ 0
on______________________________________ 0
wednesday_______________________________ 0
consumers_______________________________ 0
should__________________________________ 0
buy_____________________________________ 0
sheep___

### Applying above function on entire data
- Here we preprocess our dataset as expected in the format of pretrained models input

In [None]:
## Applying on entire data
tokenized_datasets = conll2003.map(tokenize_and_align_labels, batched=True)

In [None]:
tokenized_datasets["train"][0]

{'id': '0',
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0],
 'input_ids': [101,
  7327,
  19164,
  2446,
  2655,
  2000,
  17757,
  2329,
  12559,
  1012,
  102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'labels': [-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, -100]}

In [None]:
#Earlier Original data
conll2003["train"][0]

{'id': '0',
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}

#### **Conclusion**
- Here we can see that function - **tokenize_and_align_labels**  added **'input_ids'** (Tokenized value) and  **'labels'** (NER value with -100 at the end and beginning) on original data, This is the format we wanted to use the existing pretrained model to fine tune.

## Initialize Pretrained Classification model from - **bert-base-uncased**  Using class **AutoModelForTokenClassification**

In [None]:
model = AutoModelForTokenClassification.from_pretrained("bert-base-uncased",num_labels=9)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### To fine tune above pretrained model use **TrainingArguments**

- Inside this we define all the hyperparameter and ephoc=1

In [None]:
from transformers import TrainingArguments, Trainer

In [None]:
#Initialize this data_collator by using bert's tokenizer
data_collator=DataCollatorForTokenClassification(tokenizer)

In [None]:
args = TrainingArguments(
"test-ner", # This is fine tune model new name
evaluation_strategy = "epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=1,
weight_decay=0.01
)

### Define Evaluation matric logic - **compute_metrics**

In [None]:
def compute_metrics(eval_preds):
    pred_logits, labels = eval_preds

    pred_logits = np.argmax(pred_logits, axis=2)
    # the logits and the probabilities are in the same order,
    # so we don’t need to apply the softmax

    # We remove all the values where the label is -100
    predictions = [
        [label_list[eval_preds] for (eval_preds, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(pred_logits, labels)
    ]

    true_labels = [
      [label_list[l] for (eval_preds, l) in zip(prediction, label) if l != -100]
       for prediction, label in zip(pred_logits, labels)
   ]
    results = metric.compute(predictions=predictions, references=true_labels)

    return {
          "precision": results["overall_precision"],
          "recall": results["overall_recall"],
          "f1": results["overall_f1"],
          "accuracy": results["overall_accuracy"],
  }

### Get the Label List

- Which is used inside **compute_metrics** function

In [None]:
label_list = conll2003["train"].features["ner_tags"].feature.names
label_list

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

In [None]:
metric=datasets.load_metric("seqeval")

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


## Fine tune the model

### Trainer - > This will fine tune pretrained model with our preprocessed data and hyperparameter

In [None]:
#Define Trainer all hyperparameter with its dataset, model, tokenizer datacollector etc

trainer=Trainer(
   model,  #Pretrained model initiated from AutoModelForTokenClassification
   args,  # All the hyper parameter
   train_dataset=tokenized_datasets["train"], # Pass our preprocessed/tokenized train data
   eval_dataset=tokenized_datasets["validation"], # Pass our preprocessed validation data
   data_collator=data_collator, # Define the tokenizer used
   tokenizer=tokenizer, # Define the tokenizer used
   compute_metrics=compute_metrics
)

In [None]:
#Train the model / Fine tune the model
trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.0525,0.061754,0.921638,0.936794,0.929154,0.983526


TrainOutput(global_step=878, training_loss=0.051343681057384724, metrics={'train_runtime': 172.9966, 'train_samples_per_second': 81.163, 'train_steps_per_second': 5.075, 'total_flos': 342221911376202.0, 'train_loss': 0.051343681057384724, 'epoch': 1.0})

## Save the Fine tuned Model as **ner_model** and Tokenizer as **tokenizer** in local

In [None]:
#Save the model
model.save_pretrained("ner_model")

In [None]:
#Save the Tokenizer
tokenizer.save_pretrained("tokenizer")

('tokenizer/tokenizer_config.json',
 'tokenizer/special_tokens_map.json',
 'tokenizer/vocab.txt',
 'tokenizer/added_tokens.json',
 'tokenizer/tokenizer.json')

## Update config.py's **id2label** and **label2id**
- During Fine tune model save, it creates  config but wont have actual mapping of label, it will have label1,2,3 .., There include mapping of lables both way (number--> string and string to number)

In [None]:
import json

In [None]:
config=json.load(open("/content/ner_model/config.json"))
config

{'_name_or_path': 'bert-base-uncased',
 'architectures': ['BertForTokenClassification'],
 'attention_probs_dropout_prob': 0.1,
 'classifier_dropout': None,
 'gradient_checkpointing': False,
 'hidden_act': 'gelu',
 'hidden_dropout_prob': 0.1,
 'hidden_size': 768,
 'id2label': {'0': 'LABEL_0',
  '1': 'LABEL_1',
  '2': 'LABEL_2',
  '3': 'LABEL_3',
  '4': 'LABEL_4',
  '5': 'LABEL_5',
  '6': 'LABEL_6',
  '7': 'LABEL_7',
  '8': 'LABEL_8'},
 'initializer_range': 0.02,
 'intermediate_size': 3072,
 'label2id': {'LABEL_0': 0,
  'LABEL_1': 1,
  'LABEL_2': 2,
  'LABEL_3': 3,
  'LABEL_4': 4,
  'LABEL_5': 5,
  'LABEL_6': 6,
  'LABEL_7': 7,
  'LABEL_8': 8},
 'layer_norm_eps': 1e-12,
 'max_position_embeddings': 512,
 'model_type': 'bert',
 'num_attention_heads': 12,
 'num_hidden_layers': 12,
 'pad_token_id': 0,
 'position_embedding_type': 'absolute',
 'torch_dtype': 'float32',
 'transformers_version': '4.35.2',
 'type_vocab_size': 2,
 'use_cache': True,
 'vocab_size': 30522}

In [None]:
label_list

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

In [None]:
## Map the model output (Number) --> Text
id2label = {
    str(i): label for i,label in enumerate(label_list)
}
id2label

{'0': 'O',
 '1': 'B-PER',
 '2': 'I-PER',
 '3': 'B-ORG',
 '4': 'I-ORG',
 '5': 'B-LOC',
 '6': 'I-LOC',
 '7': 'B-MISC',
 '8': 'I-MISC'}

In [None]:
## Map the  Text -->model output
label2id = {
    label: str(i) for i,label in enumerate(label_list)
}
label2id

{'O': '0',
 'B-PER': '1',
 'I-PER': '2',
 'B-ORG': '3',
 'I-ORG': '4',
 'B-LOC': '5',
 'I-LOC': '6',
 'B-MISC': '7',
 'I-MISC': '8'}

In [None]:
config["id2label"] = id2label

In [None]:
config["label2id"] = label2id

In [None]:
config

{'_name_or_path': 'bert-base-uncased',
 'architectures': ['BertForTokenClassification'],
 'attention_probs_dropout_prob': 0.1,
 'classifier_dropout': None,
 'gradient_checkpointing': False,
 'hidden_act': 'gelu',
 'hidden_dropout_prob': 0.1,
 'hidden_size': 768,
 'id2label': {'0': 'O',
  '1': 'B-PER',
  '2': 'I-PER',
  '3': 'B-ORG',
  '4': 'I-ORG',
  '5': 'B-LOC',
  '6': 'I-LOC',
  '7': 'B-MISC',
  '8': 'I-MISC'},
 'initializer_range': 0.02,
 'intermediate_size': 3072,
 'label2id': {'O': '0',
  'B-PER': '1',
  'I-PER': '2',
  'B-ORG': '3',
  'I-ORG': '4',
  'B-LOC': '5',
  'I-LOC': '6',
  'B-MISC': '7',
  'I-MISC': '8'},
 'layer_norm_eps': 1e-12,
 'max_position_embeddings': 512,
 'model_type': 'bert',
 'num_attention_heads': 12,
 'num_hidden_layers': 12,
 'pad_token_id': 0,
 'position_embedding_type': 'absolute',
 'torch_dtype': 'float32',
 'transformers_version': '4.35.2',
 'type_vocab_size': 2,
 'use_cache': True,
 'vocab_size': 30522}

In [None]:
#Update config.py's id2label and label2id
json.dump(config,open("/content/ner_model/config.json","w"))

## Invoke/Call fine tuned new model - **ner_model** from Local

In [None]:
model_fine_tuned=AutoModelForTokenClassification.from_pretrained("ner_model")

In [None]:
model_fine_tuned

BertForTokenClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, el

## Use **transformer pipeline** to do inference of fine tuned model

In [None]:
from transformers import pipeline

In [None]:
#Initialize the pipeline
nlp_pipeline=pipeline("ner", # Task we need to define for pipeline, this is ner task now
                      model=model_fine_tuned, #Fine tuned model
                      tokenizer=tokenizer #Tokenizer of that fine tuned model
                      )

In [None]:
nlp_pipeline

<transformers.pipelines.token_classification.TokenClassificationPipeline at 0x7dc4f79ff130>

In [None]:
example="sudhanshu kumar is a foundar of iNeuron"
nlp_pipeline(example)

[{'entity': 'B-PER',
  'score': 0.99745697,
  'index': 1,
  'word': 'sud',
  'start': 0,
  'end': 3},
 {'entity': 'B-PER',
  'score': 0.9974349,
  'index': 2,
  'word': '##han',
  'start': 3,
  'end': 6},
 {'entity': 'B-PER',
  'score': 0.9977197,
  'index': 3,
  'word': '##shu',
  'start': 6,
  'end': 9},
 {'entity': 'I-PER',
  'score': 0.99450976,
  'index': 4,
  'word': 'kumar',
  'start': 10,
  'end': 15},
 {'entity': 'B-ORG',
  'score': 0.9819257,
  'index': 10,
  'word': 'in',
  'start': 32,
  'end': 34},
 {'entity': 'B-ORG',
  'score': 0.9871458,
  'index': 11,
  'word': '##eur',
  'start': 34,
  'end': 37},
 {'entity': 'B-ORG',
  'score': 0.98911977,
  'index': 12,
  'word': '##on',
  'start': 37,
  'end': 39}]

In [None]:
example="sunny is a founder of microsoft"
nlp_pipeline(example)

[{'entity': 'B-PER',
  'score': 0.99303186,
  'index': 1,
  'word': 'sunny',
  'start': 0,
  'end': 5},
 {'entity': 'B-ORG',
  'score': 0.94221765,
  'index': 6,
  'word': 'microsoft',
  'start': 22,
  'end': 31}]

In [None]:
example="apple launch mobile while eating apple which taste like orange"
nlp_pipeline(example)

[{'entity': 'B-ORG',
  'score': 0.9346753,
  'index': 1,
  'word': 'apple',
  'start': 0,
  'end': 5},
 {'entity': 'B-MISC',
  'score': 0.8264191,
  'index': 6,
  'word': 'apple',
  'start': 33,
  'end': 38}]

In [None]:
example="vikas is working ai engineer in google"
nlp_pipeline(example)

[{'entity': 'B-PER',
  'score': 0.9960549,
  'index': 1,
  'word': 'vi',
  'start': 0,
  'end': 2},
 {'entity': 'B-PER',
  'score': 0.9959552,
  'index': 2,
  'word': '##kas',
  'start': 2,
  'end': 5},
 {'entity': 'B-ORG',
  'score': 0.9780191,
  'index': 8,
  'word': 'google',
  'start': 32,
  'end': 38}]

In [None]:
example="apple founder loves eating apple"
nlp_pipeline(example)

[{'entity': 'B-ORG',
  'score': 0.9772743,
  'index': 1,
  'word': 'apple',
  'start': 0,
  'end': 5}]

In [None]:
example="Microsoft Windows created their software by idea that came from the window of the house"
nlp_pipeline(example)

[{'entity': 'B-ORG',
  'score': 0.9837806,
  'index': 1,
  'word': 'microsoft',
  'start': 0,
  'end': 9},
 {'entity': 'I-ORG',
  'score': 0.9599731,
  'index': 2,
  'word': 'windows',
  'start': 10,
  'end': 17}]

# **END**