<a href="https://colab.research.google.com/github/Beabir/TaskModule4_MNER/blob/main/02_Preprocessing_data_en.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Preprocessing the dataset en**

This process will load the WikiANN dataset for English, preprocess it for NER tasks, and train the Babelscape/wikineural-multilingual-ner model on the data.

The dataset uses IOB2 format with tags for LOC (location), PER (person), and ORG (organization).

For multilingual NER, we can repeat this process for other languages by changing the language code when loading the dataset.

The 'unimelb-nlp/wikiann' dataset covers 282 languages, making it an excellent resource for multilingual NER tasks, but we only use de, fr, it and en.

To evaluate our model on other languages or perform zero-shot cross-lingual transfer, we can load the dataset for a different language and use our trained model to make predictions on that data.

In [1]:
# Set up the environment
!pip install datasets
import torch
from datasets import load_dataset
import re
from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer, DataCollatorForTokenClassification
from transformers import pipeline

Defaulting to user installation because normal site-packages is not writeable


In [2]:
# function to remove punctuation
def remove_punctuation(example):
    # removes punctuation and emtpy tokens
    example["tokens"] = [re.sub(r'[^\w\s]', '', token) for token in example["tokens"] if re.sub(r'[^\w\s]', '', token)]
    return example

# load dataset
dataset = load_dataset("unimelb-nlp/wikiann", "en")

# remove punctuation in all splits (train, test, evaluation)
dataset = dataset.map(remove_punctuation)

print(dataset["train"][5]) 

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

{'tokens': ['St', 'Mary', 's', 'Catholic', 'Church', 'Sandusky', 'Ohio'], 'ner_tags': [3, 4, 4, 4, 4, 4, 4, 4, 4, 4], 'langs': ['en', 'en', 'en', 'en', 'en', 'en', 'en', 'en', 'en', 'en'], 'spans': ["ORG: St. Mary 's Catholic Church ( Sandusky , Ohio )"]}


In [3]:
# Load the pre-trained model and tokenizer
model_name = "Babelscape/wikineural-multilingual-ner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

In [4]:
# Create NER pipeline
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, grouped_entities=True)



Device set to use cpu


In [5]:
# Function to process the dataset
def process_dataset(examples):
    texts = [" ".join(tokens) for tokens in examples["tokens"]]
    ner_results = ner_pipeline(texts)

    processed_results = []
    for result in ner_results:
        entities = []
        for entity in result:
            entities.append({
                "entity": entity["entity_group"],
                "start": entity["start"],
                "end": entity["end"]
            })
        processed_results.append({"entities": entities})

    return {"processed_ner": processed_results}





In [6]:
   # Process the dataset
   processed_dataset = dataset.map(process_dataset, batched=True, batch_size=8)

# Now we can use the processed_dataset for further tasks or evaluation

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

**NB:** This ensures that the process_dataset function returns a dictionary with a new column "processed_ner" containing the NER results for each example.

The map function will then correctly add this new column to our dataset.

In [None]:
# If we need to preserve the original columns of our dataset, we can return them along with the new "processed_ner" column in the process_dataset function:
def process_dataset(examples):
    # ... (previous processing code)

    return {
        **examples,  # Include all original columns
        "processed_ner": processed_results
    }
# This approach will keep all existing columns and add the new "processed_ner" column to our dataset.


If we want to use the model for fine-tuning we need to add a loop.

In [7]:
# Define label list (make sure this matches the labels in your dataset)
label_list = ["O", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC"]
id2label = {i: label for i, label in enumerate(label_list)}
label2id = {label: i for i, label in enumerate(label_list)}

In [8]:
# Check if the label_list matches the labels in my dataset.
# 1. Extract unique labels from my dataset:
unique_dataset_labels = set()
for example in dataset["train"]["ner_tags"]:
    unique_dataset_labels.update(example)

In [9]:
 # 2. Compare the extracted labels with your defined label_list:
 label_list = ["O", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC"]
dataset_labels = sorted(list(unique_dataset_labels))

print("Labels in dataset:", dataset_labels)
print("Labels in label_list:", label_list)

if set(dataset_labels) == set(label_list):
    print("Label lists match!")
else:
    print("Label lists do not match. Differences:")
    print("In dataset but not in label_list:", set(dataset_labels) - set(label_list))
    print("In label_list but not in dataset:", set(label_list) - set(dataset_labels))


Labels in dataset: [0, 1, 2, 3, 4, 5, 6]
Labels in label_list: ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']
Label lists do not match. Differences:
In dataset but not in label_list: {0, 1, 2, 3, 4, 5, 6}
In label_list but not in dataset: {'B-ORG', 'I-PER', 'B-PER', 'I-ORG', 'I-LOC', 'B-LOC', 'O'}


In [10]:
# It appears that the dataset is using numeric labels (0-6) instead of string labels. This is a common format where each number corresponds to a specific entity type. We need to adjust our approach to handle this numeric labeling. Here's how we can modify our code to work with these numeric labels:
# First, we define a mapping between the numeric labels and their corresponding string representations:
id2label = {
    0: "O",
    1: "B-PER",
    2: "I-PER",
    3: "B-ORG",
    4: "I-ORG",
    5: "B-LOC",
    6: "I-LOC"
}
label2id = {v: k for k, v in id2label.items()}


In [11]:
# We can verify the labels in the dataset:
unique_labels = set()
for example in dataset["train"]["ner_tags"]:
    unique_labels.update(example)

print("Unique labels in the dataset:", sorted(unique_labels))
print("Label mapping:")
for id, label in id2label.items():
    print(f"{id}: {label}")


Unique labels in the dataset: [0, 1, 2, 3, 4, 5, 6]
Label mapping:
0: O
1: B-PER
2: I-PER
3: B-ORG
4: I-ORG
5: B-LOC
6: I-LOC


In [12]:
# Now, we can update our tokenize_and_align_labels function to use these numeric labels:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs


In [13]:

tokenized_dataset = dataset.map(tokenize_and_align_labels, batched=True)

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

In [14]:
print(tokenized_dataset['train'][5])

{'tokens': ['St', 'Mary', 's', 'Catholic', 'Church', 'Sandusky', 'Ohio'], 'ner_tags': [3, 4, 4, 4, 4, 4, 4, 4, 4, 4], 'langs': ['en', 'en', 'en', 'en', 'en', 'en', 'en', 'en', 'en', 'en'], 'spans': ["ORG: St. Mary 's Catholic Church ( Sandusky , Ohio )"], 'input_ids': [101, 10838, 12176, 187, 15473, 12690, 35071, 107320, 10157, 13608, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'labels': [-100, 3, 4, 4, 4, 4, 4, -100, -100, 4, -100]}


In [15]:
tokenized_dataset.save_to_disk("Data_en")


Saving the dataset (0/1 shards):   0%|          | 0/10000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/10000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/20000 [00:00<?, ? examples/s]

In [None]:
# link to preprocessed, tokenized data: https://ondemand.hpc.unibe.ch/node/gnode29.ubelix.unibe.ch/11422/lab/tree/Data_en (I hope that works...)