<a href="https://colab.research.google.com/github/AndriiShvahuliak/Data-Science-Internship-Test/blob/main/Task_1/Task_1_full_code_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project: Named Entity Recognition for Mountain Names

This project aims to fine-tune a BERT model for Named Entity Recognition (NER) specifically for identifying mountain names in a given text. The project uses a labeled dataset in the CoNLL format, which consists of sentences with associated entity tags. The BERT model is trained to recognize these entities and can be evaluated with new sentences after training.

The workflow includes parsing the dataset, tokenizing the input text, splitting the data into training and testing sets, training the model, and finally, using the model to make predictions on new sentences.


## Dataset Parsing and Preparation

In this cell, we install the necessary libraries and define a function to parse the CoNLL dataset. The function reads the dataset file, extracts sentences and their corresponding entity labels, and prepares the data for tokenization. Finally, we create a Hugging Face dataset object from the parsed data.


In [5]:
!pip install transforms datasets
from datasets import Dataset
from transformers import BertTokenizer, BertForTokenClassification, Trainer, TrainingArguments, BertTokenizerFast
import torch

# Function to parse the CoNLL dataset
def parse_conll(data_file):
    sentences = []
    labels = []
    current_sentence = []
    current_labels = []

    with open(data_file, 'r') as f:
        for line in f:
            if line.strip() == "":
                if current_sentence:
                    sentences.append(current_sentence)
                    labels.append(current_labels)
                    current_sentence = []
                    current_labels = []
                continue

            word, tag = line.split()
            current_sentence.append(word)
            current_labels.append(tag)

    if current_sentence:
        sentences.append(current_sentence)
        labels.append(current_labels)

    return sentences, labels

# Parse the dataset
sentences, labels = parse_conll('labeled_mountains_dataset.conll')

# Prepare the dataset for BERT tokenization
data = {"tokens": sentences, "ner_tags": labels}

# Create a Hugging Face dataset object
dataset = Dataset.from_dict(data)


## Tokenizer and Label Alignment

In this cell, we load a pre-trained BERT tokenizer and model for token classification. We define a label list representing the entity types and create a function to tokenize the sentences while aligning the labels accordingly. This function ensures that labels are correctly assigned to the tokenized inputs, taking into account padding and truncation.


In [6]:
# Load the pre-trained fast BERT tokenizer
tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')
model = BertForTokenClassification.from_pretrained('bert-base-cased', num_labels=len(label_list))
# Define the label list (assume "O", "B-LOC", and "I-LOC" are the only labels)
label_list = ["O", "B-LOC", "I-LOC"]

# Function to tokenize and align labels with padding and truncation
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"],
        truncation=True,
        padding=True,
        max_length=128,  # Adjust this based on your dataset's typical sentence length
        is_split_into_words=True
    )

    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        label_ids = []
        previous_word_idx = None
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)  # Special token (like [CLS], [SEP])
            elif word_idx != previous_word_idx:
                label_ids.append(label_list.index(label[word_idx]))  # Real label
            else:
                label_ids.append(-100)  # Subword token
            previous_word_idx = word_idx
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

# Apply tokenization and label alignment to the dataset
tokenized_datasets = dataset.map(tokenize_and_align_labels, batched=True)


model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/230 [00:00<?, ? examples/s]

# Dataset Splitting and Model Training

Here we split the dataset into training and evaluation sets, ensuring that 90% of the data is used for training and 10% for evaluation. We apply the tokenization process to both the training and evaluation datasets. Training arguments are then defined, and the Trainer instance is created to manage the training process. Finally, we call the training method to fine-tune the model.


In [7]:
# Split the dataset into train and test sets (e.g., 90% train, 10% test)
train_test_split = dataset.train_test_split(test_size=0.1)
train_dataset = train_test_split['train']
eval_dataset = train_test_split['test']

# Apply tokenization to the train and evaluation datasets
tokenized_train_dataset = train_dataset.map(tokenize_and_align_labels, batched=True)
tokenized_eval_dataset = eval_dataset.map(tokenize_and_align_labels, batched=True)

# Define training arguments with evaluation enabled
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",  # Evaluate at each epoch
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

# Create the Trainer instance with both train and eval datasets
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_eval_dataset,  # Provide evaluation dataset
    tokenizer=tokenizer,
)

# Fine-tune the model
trainer.train()


Map:   0%|          | 0/207 [00:00<?, ? examples/s]

Map:   0%|          | 0/23 [00:00<?, ? examples/s]

Epoch,Training Loss,Validation Loss
1,No log,0.229976
2,No log,0.079729
3,No log,0.048559


TrainOutput(global_step=39, training_loss=0.2855092806693835, metrics={'train_runtime': 162.7289, 'train_samples_per_second': 3.816, 'train_steps_per_second': 0.24, 'total_flos': 5070836030112.0, 'train_loss': 0.2855092806693835, 'epoch': 3.0})

# NER Prediction Function

Finally we define a function to make predictions using the fine-tuned model on new sentences. The function tokenizes the input sentence, obtains predictions from the model, and converts these predictions back to their corresponding labels. We also include a test example to demonstrate the prediction capabilities of the model.


In [11]:
# Function to get NER predictions on a sentence
def predict_ner(sentence):
    # Tokenize the sentence
    inputs = tokenizer(sentence, return_tensors="pt")

    # Get predictions
    outputs = model(**inputs).logits

    # Get predicted labels
    predictions = outputs.argmax(dim=2)

    # Convert predictions to labels
    predicted_labels = [label_list[pred] for pred in predictions[0].tolist()]

    # Tokenized words
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

    # Print the words along with their predicted labels
    for token, label in zip(tokens, predicted_labels):
        print(f"{token}: {label}")

# Test the model on a new sentence
sentence = "Mount Everest is the highest peak in the world"
predict_ner(sentence)


[CLS]: O
Mount: B-LOC
Everest: I-LOC
is: O
the: O
highest: O
peak: O
in: O
the: O
world: O
[SEP]: O
