#### What is Named-Entity-Recognition?
Named Entity Recognition (NER) is a natural language processing technique that identifies and classifies named entities in text into predefined categories such as person names, organizations, locations, time expressions, quantities, monetary values, and more.
The key steps in NER are:
* Tokenization: The input text is split into tokens like words, phrases or sentences.
* Entity identification: Potential named entities are detected using linguistic rules or statistical methods by recognizing patterns like capitalization in names.
* Entity classification: The identified entities are categorized into predefined classes like "Person", "Organization" or "Location" using machine learning models trained on labeled datasets.
* Contextual analysis: The surrounding context is considered to improve accuracy. For example, in "Apple released a new iPhone", the context helps recognize "Apple" as an organization.
* Post-processing: Results may be refined by resolving ambiguities, merging multi-token entities, or using knowledge bases to enhance entity data.


We are fine tuning BERT model by google on CoNLL2003 dataset. Which contains labelled examples  of named entities, we can train our model to recognize and classify differnt types of  named entities, such as person, locations, organizations, and more.

#### A bit more about the dataset
The CoNLL2003 dataset consists of a wide collection of English news articles from the Reuters Corpus, annotated with named entity labels. It is commonly used as a benchmark dataset for named entity recognition (NER) tasks in natural language processing.

In [1]:
## import dependencies
# !pip install seqeval
# !pip install transformers
# !pip install datasets

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments, pipeline
from datasets import load_dataset, load_metric, Dataset, DatasetDict
import numpy as np
from seqeval.metrics import f1_score, precision_score, recall_score, classification_report



  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Check for CUDA availibility
print("CUDA available: ", torch.cuda.is_available())

CUDA available:  False


In [3]:
# Read the data
def read_conll_file(file_path):
    with open(file_path, "r") as f:
        content = f.read().strip()
        sentences = content.split("\n\n")
        data = []
        for sentence in sentences:
            tokens = sentence.split("\n")
            token_data = []
            for token in tokens:
                token_data.append(token.split())
            data.append(token_data)
            
    return data

train_data = read_conll_file("dataset/conll2003/eng.train")
test_data = read_conll_file("dataset/conll2003/eng.testa")
validation_data = read_conll_file("dataset/conll2003/eng.testb")

## 

In [4]:
## Data Preprocessing
def prepare_dataset(data, label_map):
    formatted_data = {"tokens": [], "ner_tags": []}

    for i, sentence in enumerate(data):
        tokens = [token_data[0] for token_data in sentence]
        ner_tags = [label_map[token_data[3]] for token_data in sentence]
        formatted_data["tokens"].append(tokens)
        formatted_data["ner_tags"].append(ner_tags)
    return Dataset.from_dict(formatted_data)



    

In [5]:
label_list = sorted(list(set([token_data[3] for sentence in train_data for token_data in sentence])))
label_map = {label: i for i, label in enumerate(label_list)}


train_dataset = prepare_dataset(train_data, label_map)
validation_dataset = prepare_dataset(validation_data, label_map)
test_dataset = prepare_dataset(test_data, label_map)


datasets = DatasetDict({
    "train": train_dataset,
    "validation": validation_dataset,
    "test": test_dataset,
})


In [6]:
# Intiate tokenizer and the model
model_name = 'bert-base-cased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels = len(label_list))


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [53]:
## Tokenization function
def tokenize_and_align_labels(dataset):
    """
    This function will tokenize the input words and align the labels to it for training
    Returns: a dictionary having tokens (tokenized words) and labels (label tokens)
    """
    # return an embedding for each token
    tokenized_inputs = tokenizer(
        dataset["tokens"], truncation=True, is_split_into_words=True, padding=True
    )
    labels = [] # list to store the target values for each token
    
    for i, label in enumerate(dataset["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        # Loop ot create a list of labels  for each entry
        for word_idx in word_ids:
            if word_idx == None: # if the word is not part of the bocan vocab
                label_ids.append(-100)
            elif word_idx != previous_word_idx: # if the word is a new unique word
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids) # appending the labels for the current entry
    tokenized_inputs["labels"] = labels
    return tokenized_inputs
        

tokenized_dataset = datasets.map(tokenize_and_align_labels, batched=True)


Map: 100%|██████████| 14987/14987 [00:00<00:00, 19394.29 examples/s]
Map: 100%|██████████| 3684/3684 [00:00<00:00, 19561.81 examples/s]
Map: 100%|██████████| 3466/3466 [00:00<00:00, 18801.19 examples/s]


In [55]:
# tokenized_dataset is a dataset dictionary with each train, test and eval dataset having these keys:
# tokens: actual tokens
# net_tags: actual ner tag ids 
# input_ids: token ids
# token_type_ids: indicating another sequence 
# attention mask: 0/1 to indicate where model has to attend or not
# labels: actual final labels for training

# dict_keys(['tokens', 'ner_tags', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'])


In [56]:
## Define metrics
def compute_metrics(eval_prediction):
    predictions, labels = eval_prediction
    predictions = np.argmax(predictions, axis=2)


    # Remove ignored index (special tokens)
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]


    return {
        "precision": precision_score(true_labels, true_predictions),
        "recall": recall_score(true_labels, true_predictions),
        "f1": f1_score(true_labels, true_predictions),
        "classification_report": classification_report(true_labels, true_predictions),
    }


In [139]:
# Data Collator to dynamically pad the sentences to the longest length in a batch during collation, 
# instead of padding the whole dataset to the maximum length.
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

# set training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="steps",
    eval_steps=500,
    save_steps=500,
    num_train_epochs=1,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    logging_steps=100,
    learning_rate=5e-5,
    metric_for_best_model="f1",
)



trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)



In [140]:
## Model training
trainer.train()


  0%|          | 0/1874 [00:50<?, ?it/s]


[A[A

RuntimeError: MPS backend out of memory (MPS allocated: 16.89 GB, other allocations: 3.17 GB, max allowed: 20.40 GB). Tried to allocate 975.47 MB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).

In [88]:
len(tokenized_dataset["train"][0]["input_ids"]), len(tokenized_dataset["train"][0]["input_ids"])

83

In [66]:
tokenized_dataset["train"][0].keys()

dict_keys(['tokens', 'ner_tags', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'])

In [69]:
data_collator

DataCollatorForTokenClassification(tokenizer=BertTokenizerFast(name_or_path='bert-base-cased', vocab_size=28996, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}, padding=True, max_length=None, pad_to_multiple_of=None, label_pad_token_id=-100, re