# Data Preprocessing and Feature Engineering

## Introduction
In this notebook, we will preprocess the **CoNLL-2003** dataset and perform feature engineering to prepare it for training a Named Entity Recognition (NER) model. The main steps include:

- Tokenizing the sentences using a pre-trained BERT tokenizer.
- Aligning the NER tags with the tokenized inputs.
- Addressing class imbalance using class weights.
- Saving the processed dataset for efficient model training.

These steps are crucial to ensure that our model receives clean, well-structured data and can effectively learn to recognize entities.


In [1]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (1

In [7]:
# Import necessary libraries
from datasets import load_dataset
from transformers import BertTokenizerFast
import numpy as np
import pandas as pd
from sklearn.utils.class_weight import compute_class_weight
import torch

# Load the CoNLL-2003 dataset
dataset = load_dataset("conll2003")

# Initialize the BERT tokenizer
tokenizer = BertTokenizerFast.from_pretrained("bert-base-cased")

print("\n\n\nDataset  Downloaded")





Dataset  Downloaded


## Tokenization and Tag Alignment
We will tokenize the sentences using the BERT tokenizer. Since the tokenizer may split words into sub-tokens, we need to ensure that the NER tags align correctly with these sub-tokens. We will use the **BERTTokenizerFast** for this task, as it handles sub-token alignment efficiently.


In [8]:
# Define a function for tokenization and tag alignment
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        aligned_labels = []
        previous_word_id = None
        for word_id in word_ids:
            if word_id is None:
                aligned_labels.append(-100)  # Ignored by the loss function
            elif word_id != previous_word_id:
                aligned_labels.append(label[word_id])
            else:
                aligned_labels.append(-100)
            previous_word_id = word_id
        labels.append(aligned_labels)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

# Apply the tokenization and alignment function to the dataset
tokenized_datasets = dataset.map(tokenize_and_align_labels, batched=True)


Map:   0%|          | 0/14041 [00:00<?, ? examples/s]

Map:   0%|          | 0/3250 [00:00<?, ? examples/s]

Map:   0%|          | 0/3453 [00:00<?, ? examples/s]

## Handling Class Imbalance
As observed in the data exploration step, the dataset is imbalanced, with the majority of tokens labeled as "O". We will calculate class weights to account for this imbalance during training. These weights will be used to adjust the loss function, giving more importance to underrepresented classes.


In [9]:
# Extract all NER tags from the training set
all_labels = [label for labels in tokenized_datasets['train']['labels'] for label in labels if label != -100]

# Calculate class weights
class_weights = compute_class_weight(class_weight='balanced', classes=np.unique(all_labels), y=all_labels)
class_weights = torch.tensor(class_weights, dtype=torch.float32)

# Display the calculated class weights
print("Class Weights:", class_weights)


Class Weights: tensor([ 0.1334,  3.4280,  4.9966,  3.5793,  6.1081,  3.1687, 19.5545,  6.5807,
        19.5884])


## Padding and Data Collation
To optimize memory usage during training, we will use dynamic padding, which adjusts the padding size to the length of the longest sequence in each batch. This helps reduce unnecessary padding tokens and speeds up training.


In [10]:
# Define a data collator for dynamic padding
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)


## Saving the Preprocessed Data
We will save the preprocessed training, validation, and test sets to disk. This will make it easier to load the data in the next notebook for model training.


In [11]:
# Save the tokenized datasets
tokenized_datasets["train"].save_to_disk("tokenized_train_dataset")
tokenized_datasets["validation"].save_to_disk("tokenized_validation_dataset")
tokenized_datasets["test"].save_to_disk("tokenized_test_dataset")

print("Preprocessed datasets saved successfully.")


Saving the dataset (0/1 shards):   0%|          | 0/14041 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/3250 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/3453 [00:00<?, ? examples/s]

Preprocessed datasets saved successfully.


## Preliminary Observations

- The tokenization and label alignment were successful, and we ensured that sub-tokens were correctly aligned with the original NER tags.
- We calculated class weights to handle the class imbalance. These weights will be used to modify the loss function during training, giving more importance to underrepresented classes.

### Analysis of Class Weights
The calculated class weights are as follows:

**Class Weights:**
- **O (Outside):** 0.1334
- **B-PER (Beginning of Person):** 3.4280
- **I-PER (Inside of Person):** 4.9966
- **B-ORG (Beginning of Organization):** 3.5793
- **I-ORG (Inside of Organization):** 6.1081
- **B-LOC (Beginning of Location):** 3.1687
- **I-LOC (Inside of Location):** 19.5545
- **B-MISC (Beginning of Miscellaneous):** 6.5807
- **I-MISC (Inside of Miscellaneous):** 19.5884

### Observations:
1. The **O** class has the lowest weight (0.1334), which is expected because it is the most common class in the dataset. The low weight indicates that we will not penalize the model heavily for misclassifying tokens as "O".
2. The **I-LOC** and **I-MISC** classes have the highest weights (19.5545 and 19.5884, respectively). This suggests that these entity types are the least represented in the dataset. During training, we will place greater emphasis on correctly identifying these underrepresented entities.
3. The weights for **B-PER**, **B-ORG**, and **B-LOC** are moderate, indicating that while these classes are less common than the "O" class, they are not as rare as the **I-LOC** or **I-MISC** tags.
4. The significant disparity between the weights for **beginning (B-)** and **inside (I-)** tags (e.g., **B-LOC** vs. **I-LOC**) highlights the importance of correctly recognizing multi-word entities. Misclassifying an "inside" tag (I-) could lead to incomplete entity recognition, especially for nested or compound entities.

### Next Steps:
- We should use these class weights during training to balance the loss function and address the dataset's imbalance. This approach will help improve our model's performance on rare classes.
- It is important to monitor the model's precision and recall for these underrepresented classes (e.g., **I-LOC** and **I-MISC**) to ensure it is not biased towards predicting the majority "O" class.