## **Chapter 1: Introduction to BERT**

## **Overview**

BERT, or Bidirectional Encoder Representations from Transformers, represents a revolutionary approach in NLP. Developed by Google, BERT's key innovation is its deep bidirectionality, allowing the model to understand the context of a word based on all of its surroundings (left and right of the word).

* Transformers: The backbone of BERT, transformers use an architecture that weights the influence of different words on each other's context. Unlike directional models, which read the text input sequentially (left-to-right or right-to-left), transformers read the entire sequence of words at once. This allows for more contextually informed representations of each word.

* Pre-training and Fine-tuning: BERT is pre-trained on a large corpus of text and then fine-tuned for specific tasks. Pre-training involves learning general language representations from a large text dataset (like Wikipedia). Fine-tuning adapts these representations to specific NLP tasks using smaller task-specific datasets.

* Bidirectionality: Traditional language models were either trained to understand language from left to right or vice versa. BERT, however, is trained to understand context in both directions simultaneously. This is achieved through a mechanism called Masked Language Model (MLM), where some percentage of the input tokens are masked at random, and the model learns to predict them based on the unmasked tokens.

# **Chapter 2: Environment Setup**
## **Overview**

Setting up the environment involves installing the necessary Python libraries and ensuring that your system or development environment is ready to handle the tasks of loading data, training models, and evaluating their performance.

Installation of Libraries

To work with the BERT model, specific libraries need to be installed that facilitate model loading, data manipulation, and computation. The primary library used is transformers, which provides access to BERT and other pre-trained models. Additionally, libraries like datasets help in loading and handling popular NLP datasets.

In [None]:
!pip install transformers datasets tokenizers seqeval -q

This command installs the transformers library for accessing pre-trained models, datasets for dataset management, tokenizers for efficient text tokenization, and seqeval for evaluation metrics specific to sequence labeling tasks.

In [None]:
import datasets
import numpy as np
from transformers import BertTokenizerFast
from transformers import DataCollatorForTokenClassification
from transformers import AutoModelForTokenClassification

* datasets: Handles loading and preparing datasets.
* numpy: Used for numerical operations.
* BertTokenizerFast: Provides a faster tokenization method.
* DataCollatorForTokenClassification: Prepares batches of data.
* AutoModelForTokenClassification: Loads a model pre-trained on token
* classification tasks.

# **Chapter 3: Data Loading and Preprocessing**
## **Overview**

Loading and preprocessing data are critical steps in any machine learning workflow, especially in NLP. These steps ensure that the dataset is in the right format for the model to process effectively.

**Loading the CoNLL2003 Dataset**

The CoNLL2003 dataset is widely used for named entity recognition (NER), a common task in NLP where the goal is to identify and classify named entities in text into predefined categories such as the names of persons, organizations, locations, etc.

In [None]:
conll2003 = datasets.load_dataset("conll2003")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
conll2003

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

In [None]:
conll2003.shape

{'train': (14041, 5), 'validation': (3250, 5), 'test': (3453, 5)}

In [None]:
conll2003["train"][0]

{'id': '0',
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}

In [None]:
conll2003["train"].features["ner_tags"]

Sequence(feature=ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None), length=-1, id=None)

* O means the word doesn’t correspond to any entity.
* B-PER/I-PER means the word corresponds to the beginning of/is inside a person entity.
* B-ORG/I-ORG means the word corresponds to the beginning of/is inside an organization entity.
* B-LOC/I-LOC means the word corresponds to the beginning of/is inside a location entity.
* B-MISC/I-MISC means the word corresponds to the beginning of/is inside a miscellaneous entity.

In [None]:
conll2003['train'].description

'The shared task of CoNLL-2003 concerns language-independent named entity recognition. We will concentrate on\nfour types of named entities: persons, locations, organizations and names of miscellaneous entities that do\nnot belong to the previous three groups.\n\nThe CoNLL-2003 shared task data files contain four columns separated by a single space. Each word has been put on\na separate line and there is an empty line after each sentence. The first item on each line is a word, the second\na part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. The chunk tags\nand the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only\nif two phrases of the same type immediately follow each other, the first word of the second phrase will have tag\nB-TYPE to show that it starts a new phrase. A word with tag O is not part of a phrase. Note the dataset uses IOB2\ntagging scheme, whereas the original dataset uses 

**Preprocessing for BERT**

Preprocessing involves adapting the dataset to the format required by BERT for effective learning and prediction. This includes tokenization and aligning labels with BERT's token outputs.



Tokenization in transformers like BERT involves splitting text into smaller pieces called tokens, which can be further divided into subwords. This can lead to a mismatch between the number of tokens and their corresponding labels. Here's how to align them effectively:

1. **Special Tokens**: Tokens such as [CLS] and [SEP] are assigned a label of -100, indicating to PyTorch's CrossEntropyLoss function to exclude them from loss calculations. This ensures they do not affect the model training, as they don’t represent actual words from the input.

2. **Subword Tokens**: For words split into subwords, there are two labeling approaches:
   - **First Token Labeling**: Assign the label of the entire word to the first token and label the rest as -100. This focuses training on the first part of the word.
   - **Uniform Labeling**: Assign the same label to all subword tokens, treating each part equally in training.

Using the label -100 helps to focus learning on meaningful tokens and prevents non-representative tokens from influencing the model’s performance.

In [None]:
tokenizer = BertTokenizerFast.from_pretrained("nlpaueb/legal-bert-base-uncased")



In [None]:
example_text = conll2003['train'][0]

tokenized_input = tokenizer(example_text["tokens"], is_split_into_words=True)

tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])

word_ids = tokenized_input.word_ids()

print(word_ids)


''' As we can see, it returns a list with the same number of elements as our processed input ids, mapping special tokens to None and all other tokens to their respective word. This way, we can align the labels with the processed input ids. '''

tokenized_input

[None, 0, 1, 2, 3, 4, 5, 6, 6, 7, 8, None]


{'input_ids': [101, 501, 5714, 1600, 1842, 211, 21215, 3585, 178, 7846, 117, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

The **word_ids** function maps tokens to their original words (necessary because BERT tokenizer can split words into subwords). This mapping is used to ensure that labels correspond correctly to their respective tokens, an essential step for training the model on tasks like NER.

In [None]:
len(example_text['ner_tags']), len(tokenized_input["input_ids"])
# (9, 11)

(9, 12)

The below function prepares text for training by aligning labels with tokens in two key ways:

Ignoring Unnecessary Tokens: It sets the label -100 for special tokens like [CLS] and [SEP] and for any additional subword parts after the first one. This tells the training process to ignore these tokens because they don't correspond to real data or they're less relevant.

Aligning Labels: It ensures each token that should be considered during training has the correct label from the original data. If a word is split into multiple tokens, depending on the chosen strategy, either only the first token or all tokens are assigned the original word’s label.

In [None]:
def tokenize_and_align_labels(examples, label_all_tokens=True):
    """
    Function to tokenize and align labels with respect to the tokens. This function is specifically designed for
    Named Entity Recognition (NER) tasks where alignment of the labels is necessary after tokenization.

    Parameters:
    examples (dict): A dictionary containing the tokens and the corresponding NER tags.
                     - "tokens": list of words in a sentence.
                     - "ner_tags": list of corresponding entity tags for each word.

    label_all_tokens (bool): A flag to indicate whether all tokens should have labels.
                             If False, only the first token of a word will have a label,
                             the other tokens (subwords) corresponding to the same word will be assigned -100.

    Returns:
    tokenized_inputs (dict): A dictionary containing the tokenized inputs and the corresponding labels aligned with the tokens.
    """
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        # word_ids() => Return a list mapping the tokens
        # to their actual word in the initial sentence.
        # It Returns a list indicating the word corresponding to each token.
        previous_word_idx = None
        label_ids = []
        # Special tokens like `<s>` and `<\s>` are originally mapped to None
        # We need to set the label to -100 so they are automatically ignored in the loss function.
        for word_idx in word_ids:
            if word_idx is None:
                # set –100 as the label for these special tokens
                label_ids.append(-100)
            # For the other tokens in a word, we set the label to either the current label or -100, depending on
            # the label_all_tokens flag.
            elif word_idx != previous_word_idx:
                # if current word_idx is != prev then its the most regular case
                # and add the corresponding token
                label_ids.append(label[word_idx])
            else:
                # to take care of sub-words which have the same word_idx
                # set -100 as well for them, but only if label_all_tokens == False
                label_ids.append(label[word_idx] if label_all_tokens else -100)
                # mask the subword representations after the first subword

            previous_word_idx = word_idx
        labels.append(label_ids)
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [None]:
q = tokenize_and_align_labels(conll2003['train'][4:5])
print(q)

{'input_ids': [[101, 1221, 110, 163, 900, 211, 207, 274, 403, 110, 163, 2824, 195, 526, 532, 188, 3298, 13898, 235, 4149, 786, 222, 15034, 2765, 305, 2778, 4899, 3681, 238, 779, 231, 268, 6852, 890, 598, 207, 2137, 2580, 246, 969, 577, 117, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'labels': [[-100, 5, 0, 0, 0, 0, 0, 3, 4, 0, 0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 5, 0, 0, 0, 0, 0, 0, 0, 0, -100]]}


In [None]:
for token, label in zip(tokenizer.convert_ids_to_tokens(q["input_ids"][0]),q["labels"][0]):
    print(f"{token:_<40} {label}")

[CLS]___________________________________ -100
germany_________________________________ 5
'_______________________________________ 0
s_______________________________________ 0
representative__________________________ 0
to______________________________________ 0
the_____________________________________ 0
european________________________________ 3
union___________________________________ 4
'_______________________________________ 0
s_______________________________________ 0
veterinar_______________________________ 0
##y_____________________________________ 0
committee_______________________________ 0
we______________________________________ 1
##r_____________________________________ 1
##ner___________________________________ 1
zw______________________________________ 2
##ing___________________________________ 2
##mann__________________________________ 2
said____________________________________ 0
on______________________________________ 0
wednesday_______________________________ 0
consumer

In [None]:
tokenized_datasets = conll2003.map(tokenize_and_align_labels, batched=True)

# **Chapter 4: Model Configuration and Fine-tuning**
## **Overview**

Model configuration involves setting up the BERT model for the specific task of token classification. Fine-tuning is the process of adapting a pre-trained model to a specific dataset or task by continuing the training process with task-specific data.


**Setting up the Model**

The model used is a variant of BERT, specifically adapted for legal texts (nlpaueb/legal-bert-base-uncased), making it highly relevant for datasets involving legal or formal language.

In [None]:
model = AutoModelForTokenClassification.from_pretrained("nlpaueb/legal-bert-base-uncased", num_labels=9)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at nlpaueb/legal-bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


This code initializes a BERT model for token classification with the number of labels equal to the number of named entity types in the CoNLL2003 dataset. The model is loaded with weights pre-trained on legal text data, offering a robust starting point for further training.

In [None]:
pip install accelerate -U

In [None]:
from transformers import TrainingArguments, Trainer
args = TrainingArguments(
"test-ner",
evaluation_strategy = "epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=3,
weight_decay=0.01,
)

This block sets various training parameters like batch size, learning rate, and the number of epochs. These parameters are crucial for controlling the training process and ensuring the model learns effectively without overfitting.

In [None]:
data_collator = DataCollatorForTokenClassification(tokenizer)

A data collator prepares batches of data, handling token padding and creating attention masks necessary for training the model efficiently.

In [None]:
metric = datasets.load_metric("seqeval")

  metric = datasets.load_metric("seqeval")
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/2.47k [00:00<?, ?B/s]

In [None]:
example = conll2003['train'][0]

In [None]:
label_list = conll2003["train"].features["ner_tags"].feature.names

label_list

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

In [None]:
labels = [label_list[i] for i in example["ner_tags"]]

metric.compute(predictions=[labels], references=[labels])

{'MISC': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 2},
 'ORG': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1},
 'overall_precision': 1.0,
 'overall_recall': 1.0,
 'overall_f1': 1.0,
 'overall_accuracy': 1.0}

In [None]:
def compute_metrics(eval_preds):
    """
    Function to compute the evaluation metrics for Named Entity Recognition (NER) tasks.
    The function computes precision, recall, F1 score and accuracy.

    Parameters:
    eval_preds (tuple): A tuple containing the predicted logits and the true labels.

    Returns:
    A dictionary containing the precision, recall, F1 score and accuracy.
    """
    pred_logits, labels = eval_preds

    pred_logits = np.argmax(pred_logits, axis=2)
    # the logits and the probabilities are in the same order,
    # so we don’t need to apply the softmax

    # We remove all the values where the label is -100
    predictions = [
        [label_list[eval_preds] for (eval_preds, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(pred_logits, labels)
    ]

    true_labels = [
      [label_list[l] for (eval_preds, l) in zip(prediction, label) if l != -100]
       for prediction, label in zip(pred_logits, labels)
   ]
    results = metric.compute(predictions=predictions, references=true_labels)
    return {
   "precision": results["overall_precision"],
   "recall": results["overall_recall"],
   "f1": results["overall_f1"],
  "accuracy": results["overall_accuracy"],
  }

This compute_metrics() function first takes the argmax of the logits to convert them to predictions (as usual, the logits and the probabilities are in the same order, so we don’t need to apply the softmax). Then we have to convert both labels and predictions from integers to strings. We remove all the values where the label is -100, then pass the results to the metric.compute() method

In [None]:
trainer = Trainer(
    model,
    args,
   train_dataset=tokenized_datasets["train"],
   eval_dataset=tokenized_datasets["validation"],
   data_collator=data_collator,
   tokenizer=tokenizer,
   compute_metrics=compute_metrics
)

The **Trainer** from Hugging Face’s Transformers library handles the training process. It leverages the training arguments, model, datasets, and tokenizer to manage the training loop, including backpropagation and evaluation.

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.3145,0.105518,0.871666,0.884512,0.878042,0.968774
2,0.0872,0.089533,0.905817,0.902123,0.903966,0.975013
3,0.0538,0.088669,0.906511,0.914913,0.910693,0.97619


TrainOutput(global_step=2634, training_loss=0.12611545372298927, metrics={'train_runtime': 548.0854, 'train_samples_per_second': 76.855, 'train_steps_per_second': 4.806, 'total_flos': 1101309468061158.0, 'train_loss': 0.12611545372298927, 'epoch': 3.0})

Model was evaluated  using metrics that are relevant to the NER task, such as precision, recall, and the F1 score, which provide a balanced measure of the model's accuracy and its ability to handle all classes in the dataset.



# **Chapter 6: Conclusion and Further Steps**

## **Overview**

This final chapter summarizes the project, reflects on the lessons learned, and outlines potential areas for future research or improvement based on the results obtained from fine-tuning the BERT model for named entity recognition (NER).

**Project Summary**

Throughout this project, we've taken significant steps in advancing NLP tasks using a fine-tuned BERT model. Starting with an introduction to the BERT framework, we prepared the environment and datasets for training, set up and fine-tuned the model, and evaluated its performance rigorously through multiple metrics. The project demonstrated not only the power of transformer models in handling complex NLP tasks but also the importance of meticulous data preprocessing and model tuning.

**Achievements**

- **Model Performance**: The model achieved impressive precision, recall, F1 score, and accuracy, showing strong predictive capabilities on the CoNLL2003 dataset.
- **Understanding of BERT**: Gained a deep understanding of how BERT processes language data and the implications of its bidirectional nature and subword tokenization strategy.
- **Technical Skills**: Enhanced skills in Python, PyTorch, and the Hugging Face Transformers library, which are crucial for NLP tasks.

**Challenges**

- **Data Alignment**: Addressing issues related to tokenization and alignment of labels with BERT’s subword tokens was challenging and required careful handling to ensure data integrity.
- **Model Tuning**: Deciding on the best model configuration and stopping criteria to avoid overfitting while maintaining high performance on unseen data.


---




In [None]:
model.save_pretrained("ner_model")

In [None]:
tokenizer.save_pretrained("tokenizer")

('tokenizer/tokenizer_config.json',
 'tokenizer/special_tokens_map.json',
 'tokenizer/vocab.txt',
 'tokenizer/added_tokens.json',
 'tokenizer/tokenizer.json')

In [None]:
id2label = {
    str(i): label for i,label in enumerate(label_list)

}
label2id = {
    label: str(i) for i,label in enumerate(label_list)
}

In [None]:
import json

In [None]:
config = json.load(open("ner_model/config.json"))

In [None]:
config["id2label"] = id2label
config["label2id"] = label2id

In [None]:
json.dump(config, open("ner_model/config.json","w"))

In [None]:
model_fine_tuned = AutoModelForTokenClassification.from_pretrained("ner_model")

In [None]:
from transformers import pipeline
from tabulate import tabulate

nlp = pipeline("ner", model=model_fine_tuned, tokenizer=tokenizer)

example = "Bill Gates is the Founder of Microsoft"

ner_results = nlp(example)

table = [["Index", "Word", "Entity", "Score"]]
for item in ner_results:
    table.append([item["index"], item["word"], item["entity"], item["score"]])

print(tabulate(table, headers="firstrow"))


  Index  Word    Entity       Score
-------  ------  --------  --------
      1  bill    B-PER     0.997214
      2  gate    I-PER     0.997775
      3  ##s     I-PER     0.997131
      8  micro   B-ORG     0.988478
      9  ##soft  B-ORG     0.978292


In [None]:
import fitz  # Library which helps in working with PDF files

def convert_pdf_to_text(pdf_path):
    """Converts a PDF file to plain text.

    Args:
        pdf_path (str): The file path of the PDF document.

    Returns:
        str: The extracted text from the PDF.
    """
    doc = fitz.open(pdf_path)  # Open the PDF file
    text = ''  # Initialize an empty string to store the extracted text
    for page in doc:  # Loop through each page in the PDF
        text += page.get_text()  # Extract text from the current page and add it to the text variable
    return text  # Return the collected text


In [None]:
convert_pdf_to_text("/content/crl.a._3_q_2021.pdf")


In [None]:
ner_results = nlp(convert_pdf_to_text("/content/crl.a._3_q_2021.pdf"))

table = [["Index", "Word", "Entity", "Score"]]
for item in ner_results:
    table.append([item["index"], item["word"], item["entity"], item["score"]])

print(tabulate(table, headers="firstrow"))
