# Named Entity Recognition Project

## Reference: https://huggingface.co/course/chapter7/2

### Named Entity Recognition(NER) task falls under token_classification and this project done by using Pytorch framework

### Dataset Link : https://huggingface.co/datasets/conll2003

### Model Link : https://huggingface.co/bert-base-uncased

### installing requirements

### seqeval : seqeval is an evaluation metrics, which is used for NER 

In [2]:
# The -q option means to give less output during the installation process. This option is useful to reduce the clutter on the screen.

!pip install transformers datasets tokenizers seqeval -q

###  This is a command to install four Python packages using the pip package manager. The packages are:

- **transformers**: A library that provides state-of-the-art natural language processing models and toos


- **datasets**: A library that provides easy access to a wide range of datasets and metrics for natural language processn


- **tokenizers**: A library that provides fast and customizable tokenization methods for natural language procesi
n.
- **seqeval**: A library that provides sequence labeling evaluation metrics such as precision, recall, and F1-cw-do-i-fix-this.

In [3]:
import datasets
import numpy as np

from transformers import BertTokenizerFast
from transformers import DataCollatorForTokenClassification
from transformers import AutoModelForTokenClassification  # load my model

#### Data collators: 
Data collators are objects that will form a batch by using a list of dataset elements as input. These elements are of the same type as the elements of train_dataset or eval_dataset . To be able to build batches, data collators may apply some processing (like padding). Some of them also apply some random data augmentation (like random masking) on the formed batch.

### retreivng dataset

In [4]:
conll = datasets.load_dataset("conll2003")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [5]:
conll

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

In [6]:
conll['train']

Dataset({
    features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
    num_rows: 14041
})

In [7]:
conll['train'][2]

{'id': '2',
 'tokens': ['BRUSSELS', '1996-08-22'],
 'pos_tags': [22, 11],
 'chunk_tags': [11, 12],
 'ner_tags': [5, 0]}

In [8]:
conll['train'][1]

{'id': '1',
 'tokens': ['Peter', 'Blackburn'],
 'pos_tags': [22, 22],
 'chunk_tags': [11, 12],
 'ner_tags': [1, 2]}

In [9]:
conll['train'].features["pos_tags"]

Sequence(feature=ClassLabel(names=['"', "''", '#', '$', '(', ')', ',', '.', ':', '``', 'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'NN|SYM', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB'], id=None), length=-1, id=None)

In [10]:
conll['train'].features["chunk_tags"]

Sequence(feature=ClassLabel(names=['O', 'B-ADJP', 'I-ADJP', 'B-ADVP', 'I-ADVP', 'B-CONJP', 'I-CONJP', 'B-INTJ', 'I-INTJ', 'B-LST', 'I-LST', 'B-NP', 'I-NP', 'B-PP', 'I-PP', 'B-PRT', 'I-PRT', 'B-SBAR', 'I-SBAR', 'B-UCP', 'I-UCP', 'B-VP', 'I-VP'], id=None), length=-1, id=None)

In [11]:
conll['train'].features["ner_tags"]

Sequence(feature=ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None), length=-1, id=None)

In [12]:
conll['train'].description

''

### retreiving our bert tokenizer from hugging face

In [13]:
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

###### checking how tokenizer working

In [14]:
conll["train"][0]["tokens"]

['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']

In [15]:
checking_tokenizer = conll["train"][0]
tokenized_input = tokenizer(checking_tokenizer['tokens'], is_split_into_words=True)
tokenized_input   # [cls-101, sep-102]

{'input_ids': [101, 7327, 19164, 2446, 2655, 2000, 17757, 2329, 12559, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [16]:
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
print("tokens : ", tokens)
print()

word_ids = tokenized_input.word_ids() # integer encoding
print("unique_num_word : ", word_ids)    # word embedding (BERT)

tokens :  ['[CLS]', 'eu', 'rejects', 'german', 'call', 'to', 'boycott', 'british', 'lamb', '.', '[SEP]']

unique_num_word :  [None, 0, 1, 2, 3, 4, 5, 6, 7, 8, None]


In [17]:
len(checking_tokenizer['ner_tags']), len(tokenized_input["input_ids"])

(9, 11)

In [18]:
len(tokens)

11

In [19]:
# In "ner_tags" --> it is not considering ["cls" and "sep"]

len(conll["train"][0]["ner_tags"])

9

In [20]:
# the length of tokens and ner_tags are different, so we make same length 'ner_tags' equal to length 'tokens'.

# In "ner_tags" --> it is not considering ["cls" and "sep"], so solution is add [-100 at starting, -100 at ending].

# so, in pytorch during training time ignore [-100, -100].

conll["train"][0]["ner_tags"]

[3, 0, 7, 0, 0, 0, 7, 0, 0]

### The below function tokenize_and_align_labels does 2 jobs

**1. set -100 as the label for these special tokens and the subwords we wish to mask during training.**

**2. mask the subword representations after the first subword.**

## Then we align the labels with the token ids using the strategy we picked:

In [21]:
# We are making the same length by using below code :

def tokenize_and_align_labels(examples, label_all_tokens=True):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        '''word_ids() => Return a list mapping the tokens
        to their actual word in the initial sentence.
        It Returns a list indicating the word corresponding to each token.'''
        previous_word_idx = None
        label_ids = []
        # Special tokens like `` and `<\s>` are originally mapped to None
        # We need to set the label to -100 so they are automatically ignored during training and validation step.
        for word_idx in word_ids:
            if word_idx is None:
                # set –100 as the label for these special tokens
                label_ids.append(-100)
            # For the other tokens in a word, we set the label to either the current label or -100, depending on
            # the label_all_tokens flag.
            elif word_idx != previous_word_idx:
                # if current word_idx is != prev then its the most regular case
                # and add the corresponding token
                label_ids.append(label[word_idx])
            else:
                # to take care of sub-words which have the same word_idx
                # set -100 as well for them, but only if label_all_tokens == False
                label_ids.append(label[word_idx] if label_all_tokens else -100)
                # mask the subword representations after the first subword

            previous_word_idx = word_idx
        labels.append(label_ids)
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [22]:
ts = tokenize_and_align_labels(conll["train"][0:1])
print(ts)

{'input_ids': [[101, 7327, 19164, 2446, 2655, 2000, 17757, 2329, 12559, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'labels': [[-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, -100]]}


In [23]:
for token, label in zip(tokenizer.convert_ids_to_tokens(ts["input_ids"][0]), ts["labels"][0]):
  print(f"{token:_<40} {label}")

[CLS]___________________________________ -100
eu______________________________________ 3
rejects_________________________________ 0
german__________________________________ 7
call____________________________________ 0
to______________________________________ 0
boycott_________________________________ 0
british_________________________________ 7
lamb____________________________________ 0
._______________________________________ 0
[SEP]___________________________________ -100


In [24]:
# Applied on entire datasets

tokenized_datasets = conll.map(tokenize_and_align_labels, batched=True)

In [25]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 3453
    })
})

In [26]:
tokenized_datasets['train'][0]

{'id': '0',
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0],
 'input_ids': [101,
  7327,
  19164,
  2446,
  2655,
  2000,
  17757,
  2329,
  12559,
  1012,
  102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'labels': [-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, -100]}

### Retreiving our bert-model from hugging face

In [27]:
# num_labels = 9 --> conll['train'].features["ner_tags"] output is --> ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

model = AutoModelForTokenClassification.from_pretrained("bert-base-uncased", num_labels=9)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


##### fine tuning the bert model, so we need to setup trainingarguments

In [28]:
!pip install accelerate -U



In [29]:
# defining training args
from transformers import TrainingArguments, Trainer

args = TrainingArguments("test-ner",
                         evaluation_strategy="epoch",
                         learning_rate=2e-5,
                         per_device_train_batch_size=16,
                         per_device_eval_batch_size=16,
                         num_train_epochs=3,
                         weight_decay=0.01)

In [30]:
data_collator = DataCollatorForTokenClassification(tokenizer)

#### In NER --> the evaluation metrics is "seqeval"

In [31]:
metric = datasets.load_metric("seqeval")

  metric = datasets.load_metric("seqeval")
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/2.47k [00:00<?, ?B/s]

###### checking how metrics will work



In [33]:
ex = conll['train'][0]

label_list = conll['train'].features['ner_tags'].feature.names

label_list

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

In [34]:
for i in ex['ner_tags']:
  print(i)

3
0
7
0
0
0
7
0
0


In [35]:
labels = [label_list[i] for i in ex["ner_tags"]]

labels

['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O']

In [37]:
labels

['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O']

In [36]:
metric.compute(predictions=[labels], references=[labels])

{'MISC': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 2},
 'ORG': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1},
 'overall_precision': 1.0,
 'overall_recall': 1.0,
 'overall_f1': 1.0,
 'overall_accuracy': 1.0}

### Compute Metrics : This compute_metrics() function first takes the argmax of the logits to convert them to predictions (as usual, the logits and the probabilities are in the same order, so we don’t need to apply the softmax). Then we have to convert both labels and predictions from integers to strings. We remove all the values where the label is -100, then pass the results to the metric.compute() method:

In [39]:
def compute_metrics(eval_preds):
    pred_logits, labels = eval_preds

    pred_logits = np.argmax(pred_logits, axis=2)
    # the logits and the probabilities are in the same order,
    # so we don’t need to apply the softmax

    # We remove all the values where the label is -100
    predictions = [
        [label_list[eval_preds] for (eval_preds, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(pred_logits, labels)
    ]

    true_labels = [
      [label_list[l] for (eval_preds, l) in zip(prediction, label) if l != -100]
       for prediction, label in zip(pred_logits, labels)
   ]
    results = metric.compute(predictions=predictions, references=true_labels)

    return {
          "precision": results["overall_precision"],
          "recall": results["overall_recall"],
          "f1": results["overall_f1"],
          "accuracy": results["overall_accuracy"],
  }

#### Training

In [41]:
trainer = Trainer(
   model,
   args,
   train_dataset=tokenized_datasets["train"],
   eval_dataset=tokenized_datasets["validation"],
   data_collator=data_collator,
   tokenizer=tokenizer,
   compute_metrics=compute_metrics
)

In [42]:
trainer.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.2203,0.064435,0.910681,0.931872,0.921154,0.981921
2,0.0445,0.055693,0.929518,0.941269,0.935357,0.985067
3,0.0265,0.056309,0.935252,0.945296,0.940247,0.985718


TrainOutput(global_step=2634, training_loss=0.07670928333053646, metrics={'train_runtime': 512.9304, 'train_samples_per_second': 82.122, 'train_steps_per_second': 5.135, 'total_flos': 1024113336121080.0, 'train_loss': 0.07670928333053646, 'epoch': 3.0})

## save model

In [43]:
## Save model

model.save_pretrained("ner_model")

In [44]:
## Save tokenizer

tokenizer.save_pretrained("tokenizer")

('tokenizer/tokenizer_config.json',
 'tokenizer/special_tokens_map.json',
 'tokenizer/vocab.txt',
 'tokenizer/added_tokens.json',
 'tokenizer/tokenizer.json')

In [45]:
id2label = {
    str(i): label for i,label in enumerate(label_list)
}

label2id = {
    label: str(i) for i,label in enumerate(label_list)
}


In [46]:
id2label

{'0': 'O',
 '1': 'B-PER',
 '2': 'I-PER',
 '3': 'B-ORG',
 '4': 'I-ORG',
 '5': 'B-LOC',
 '6': 'I-LOC',
 '7': 'B-MISC',
 '8': 'I-MISC'}

In [47]:
label2id

{'O': '0',
 'B-PER': '1',
 'I-PER': '2',
 'B-ORG': '3',
 'I-ORG': '4',
 'B-LOC': '5',
 'I-LOC': '6',
 'B-MISC': '7',
 'I-MISC': '8'}

## Loading model & prediction

In [48]:
import json

In [50]:
config = json.load(open("/content/ner_model/config.json"))
config["id2label"] = id2label
config["label2id"] = label2id
json.dump(config, open("/content/ner_model/config.json","w"))


In [51]:
model_fine_tuned = AutoModelForTokenClassification.from_pretrained("/content/ner_model")

In [52]:
from transformers import pipeline

In [53]:
nlp = pipeline("ner", model=model_fine_tuned, tokenizer=tokenizer)


example = "Bill Gates is the Founder of Microsoft"

ner_results = nlp(example)

print(ner_results)

[{'entity': 'B-PER', 'score': 0.99485284, 'index': 1, 'word': 'bill', 'start': 0, 'end': 4}, {'entity': 'I-PER', 'score': 0.9961118, 'index': 2, 'word': 'gates', 'start': 5, 'end': 10}, {'entity': 'B-ORG', 'score': 0.97106916, 'index': 7, 'word': 'microsoft', 'start': 29, 'end': 38}]


In [54]:
nlp = pipeline("ner", model=model_fine_tuned, tokenizer=tokenizer)

# my name is not trained in pretrained data
# so, it will divide the word into sub-words

example = "Narender is the Founder of Microsoft"

ner_results = nlp(example)

print(ner_results)

[{'entity': 'B-PER', 'score': 0.99737203, 'index': 1, 'word': 'na', 'start': 0, 'end': 2}, {'entity': 'B-PER', 'score': 0.99773717, 'index': 2, 'word': '##ren', 'start': 2, 'end': 5}, {'entity': 'B-PER', 'score': 0.9980496, 'index': 3, 'word': '##der', 'start': 5, 'end': 8}, {'entity': 'B-ORG', 'score': 0.9781677, 'index': 8, 'word': 'microsoft', 'start': 27, 'end': 36}]
