##Task 2.2

*  train NER model for extracting animal titles from the text. Please use some
transformer-based model (not LLM).

####Making data

For training a transformer-based model for our purpose we need to have a dataset with tagged words, where there are tags for animals. Unfortunately, I did not manage to find a good dataset online, that would contain not too specific or scientific animal names as well as having words tagged. This is why I decided to generate my own dataset.

To generate dataset I combined the LLM approach and sample-sentence approach. The overall process is like so:
1.  Our goal is to generate a sentence
2.  A sentence consists of 3 parts: beginning, middle section and ending.
3.  There are 10 sample sentences of each category: beginning, middle and ending; generated by ChatGPT-4o.
4.  Every beginning phrase contains a placeholder for animal name, whereas middle sections and endings might or might not contain such a placeholder.
5.  One of each sections are selected at random, merged into one sentence and then a random animal name is put into the placeholder.

Because there are 10 different animals and 10 of each type of sentence sections, there are 10000 possible unique sentences. I generate 5000.

It is important to highlight, that this approach is not ideal and sentences sometimes make little sence, for example: *Once upon a time, a dog wandered into a mysterious forest. It met a wise old owl who shared a mysterious riddle. At last, the dog found what it had been searching for all along.*

The dataset is then saved into the $ner\_animal\_generated\_dataset.csv$ file.

In [1]:
import random
import numpy as np
import pandas as pd
import string

animal_names = ["dog", "horse", "elephant", "butterfly", "chicken", "cat", "cow", "sheep", "spider", "squirrel"]

beginnings = [
    "Once upon a time, a {animal} wandered into a mysterious forest.",
    "In a quiet village, a {animal} discovered an ancient secret.",
    "A curious {animal} stumbled upon a hidden cave.",
    "Long ago, a {animal} set off on a grand adventure.",
    "A lonely {animal} roamed the vast plains in search of something special.",
    "One evening, a {animal} found itself in an enchanted garden.",
    "Deep in the jungle, a {animal} heard a strange sound.",
    "A {animal} in the meadow noticed something glowing in the distance.",
    "Under the bright moon, a {animal} felt a strange pull towards the river.",
    "A {animal} in the desert uncovered a long-lost relic."
]

middles = [
    "It met a wise old {animal} who shared a mysterious riddle.",
    "A sudden storm forced it to seek shelter in a hidden cavern.",
    "The {animal} found a map leading to a legendary treasure.",
    "An unexpected friend, a talking bird, guided the {animal} along the way.",
    "It had to solve a puzzle to continue its journey.",
    "A mischievous creature tried to trick the {animal} out of its findings.",
    "The path was blocked by a giant boulder, but a kind {animal} helped move it.",
    "A magical pond reflected the {animal}'s deepest dreams.",
    "The {animal} discovered an ancient book filled with forgotten wisdom.",
    "A hidden passage led the {animal} into a secret underground world."
]

endings = [
    "At last, the {animal} found what it had been searching for all along.",
    "It returned home, wiser and braver than before.",
    "The journey changed the {animal} forever, filling its heart with joy.",
    "A newfound friendship made the adventure truly special.",
    "The {animal} realized that the real treasure was the memories made.",
    "With the mystery solved, the {animal} could finally rest.",
    "The enchanted land bid the {animal} farewell as it continued its journey.",
    "Having learned an important lesson, the {animal} shared its story with others.",
    "The {animal} knew it would return one day for another grand adventure.",
    "As the sun set, the {animal} smiled, knowing its adventure was only the beginning."
]

# Generate unique sentences
unique_sentences = set()
while len(unique_sentences) < 5000:
    animal = random.choice(animal_names)
    sentence = f"{random.choice(beginnings)} {random.choice(middles)} {random.choice(endings)}".format(animal=animal)

    if sentence not in unique_sentences:
        unique_sentences.add(sentence)

# Restructure the dataset
restructured_data = []
sentence_id = 1

for sentence in unique_sentences:
    words = sentence.split()
    labels = ["B-ANIMAL" if word.lower() in animal_names else "O" for word in words]

    for word, label in zip(words, labels):
        restructured_data.append((sentence_id, word.strip(string.punctuation), label))

    sentence_id += 1

# Create a DataFrame
df_unique_sentences = pd.DataFrame(restructured_data, columns=["Sentence Number", "Word", "Label"])

# Save the dataset to CSV
file_path_unique_sentences = "ner_animal_generated_dataset.csv"
df_unique_sentences.to_csv(file_path_unique_sentences, index=False)

# Provide the file to the user
file_path_unique_sentences

'ner_animal_generated_dataset.csv'

####Data preprocessing

For solving the problem I chose to go with the BERT model to classify words in the sentence. This is why I will be using BERT tokenizer.

In [2]:
import torch
import torch.nn as nn
import tensorflow as tf
from torch.utils.data import TensorDataset, DataLoader
from transformers import BertTokenizerFast, BertForTokenClassification, logging

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from transformers import BertTokenizerFast

from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from tqdm import tqdm


At first we split the dataset into two parts: sentences and tags for words in these sentences. We as well need a label encoder to fit to the tags from the dataset. For this I use the sklearn $preprocessing.LabelEncoder$.

In [3]:
def process_data(data_path):
    df = pd.read_csv(data_path, encoding="latin-1")
    df["Sentence Number"] = df["Sentence Number"].fillna(method="ffill")

    enc_label = preprocessing.LabelEncoder()

    df["Label"] = enc_label.fit_transform(df["Label"])

    sentences = df.groupby("Sentence Number")["Word"].apply(list).values
    tag = df.groupby("Sentence Number")["Label"].apply(list).values
    return sentences, tag, enc_label

sentence,tag,enc_label = process_data("ner_animal_generated_dataset.csv")
animal_id = enc_label.transform(["B-ANIMAL"])[0]
o_id      = enc_label.transform(["O"])[0]
pad_id    = max(animal_id, o_id) + 1


  df["Sentence Number"] = df["Sentence Number"].fillna(method="ffill")


Because models do not work with text data we need to create tokens from words in the sentences. For this I use $BertTokenizerFast$.

In [4]:
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
MAX_LEN = 64

def tokenize(sentences, tags, max_len=MAX_LEN, batch_size=32):
    all_input_ids = []
    all_attention_masks = []
    all_labels = []

    n = len(sentences)

    for start in range(0, n, batch_size):
        end = min(start + batch_size, n)
        batch_sents = list(sentences[start:end])
        batch_tags  = tags[start:end]

        encoded = tokenizer(
            batch_sents,
            is_split_into_words=True,
            add_special_tokens=True,
            max_length=max_len,
            padding="max_length",
            truncation=True,
            return_attention_mask=True,
        )

        batch_input_ids = encoded["input_ids"]
        batch_attention = encoded["attention_mask"]

        batch_labels = []
        for i, word_labels in enumerate(batch_tags):
            word_ids = encoded.word_ids(batch_index=i)

            token_labels = []
            for wid in word_ids:
                if wid is None:
                    token_labels.append(pad_id)
                else:
                    token_labels.append(word_labels[wid])

            batch_labels.append(token_labels)

        all_input_ids.append(np.array(batch_input_ids))
        all_attention_masks.append(np.array(batch_attention))
        all_labels.append(np.array(batch_labels))

    input_ids = np.concatenate(all_input_ids, axis=0)
    attention_masks = np.concatenate(all_attention_masks, axis=0)
    labels = np.concatenate(all_labels, axis=0)

    return input_ids, attention_masks, labels


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


We then split the data into training and testing datasets and tokenize them.

In [5]:
X_train,X_test,y_train,y_test = train_test_split(sentence,tag,random_state=42,test_size=0.1)
X_train.shape,X_test.shape,y_train.shape,y_test.shape

((4500,), (500,), (4500,), (500,))

In [6]:
input_ids,attention_mask, labels = tokenize(X_train, y_train)

In [7]:
val_input_ids,val_attention_mask, val_labels = tokenize(X_test, y_test)

In [8]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

input_ids_t     = torch.tensor(input_ids, dtype=torch.long)
attention_mask_t = torch.tensor(attention_mask, dtype=torch.long)
labels_t        = torch.tensor(labels, dtype=torch.long)

val_input_ids_t       = torch.tensor(val_input_ids, dtype=torch.long)
val_attention_mask_t  = torch.tensor(val_attention_mask, dtype=torch.long)
val_labels_t          = torch.tensor(val_labels, dtype=torch.long)


In [9]:
train_dataset = TensorDataset(input_ids_t, attention_mask_t, labels_t)
val_dataset   = TensorDataset(val_input_ids_t, val_attention_mask_t, val_labels_t)

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader   = DataLoader(val_dataset, batch_size=16, shuffle=False)


Because different sentences have different sizes we have data inconsistency. For the model to work we need to pad our sentences. I choose to add padding to make all sentences 64 tokens long.

In [10]:
# TEST: Checking Padding and Truncation length's
was = list()
for i in range(len(input_ids)):
    was.append(len(input_ids[i]))
set(was)

{64}

In [11]:
# Train Padding
test_tag = list()
for i in range(len(y_test)):
    test_tag.append(np.array(y_test[i] + [1] * (MAX_LEN-len(y_test[i]))))

# TEST:  Checking Padding Length
was = list()
for i in range(len(test_tag)):
    was.append(len(test_tag[i]))
set(was)

{64}

In [12]:
# Train Padding
train_tag = list()
for i in range(len(y_train)):
    train_tag.append(np.array(y_train[i] + [1] * (MAX_LEN-len(y_train[i]))))

# TEST:  Checking Padding Length
was = list()
for i in range(len(train_tag)):
    was.append(len(train_tag[i]))
set(was)

{64}

####Model compilation and training

Now it comes to designing a model. As mentioned above, I use the $TFBertModel$ to classify tokens as well as also add a couple of additional layers.

What is important however is that classes in our dataset are highly impalanced, i.e. there are much more "other" words than there are words that are tagged as signing animals. This is why we need to create our own loss function, which highly rewards the correct classification of animal tag to account for class imbalance.

In [17]:



# def custom_loss(y_true, y_pred):
#   loss = tf.keras.losses.sparse_categorical_crossentropy(y_true, y_pred, from_logits=False)
#   mask = tf.cast(tf.not_equal(y_true, pad_id), tf.float32)
#   animal_mask = tf.cast(tf.equal(y_true, animal_id), tf.float32)
#   weight = 1.0 + 99.0 * animal_mask
#   loss = loss * mask * weight
#   return tf.reduce_sum(loss) / tf.reduce_sum(mask)

# def custom_accuracy(y_true, y_pred):
#   y_pred_labels = tf.argmax(y_pred, axis=-1, output_type=y_true.dtype)
#   mask = tf.cast(tf.not_equal(y_true, pad_id), tf.float32)
#   matches = tf.cast(tf.equal(y_true, y_pred_labels), tf.float32)
#   matches = matches * mask
#   return tf.reduce_sum(matches) / tf.reduce_sum(mask)

# def create_model(bert_model,max_len = MAX_LEN):
#   input_ids = tf.keras.Input(shape = (max_len,),dtype = 'int32')
#   attention_masks = tf.keras.Input(shape = (max_len,),dtype = 'int32')
#   bert_output = bert_model(input_ids,attention_mask = attention_masks,return_dict =True)
#   embedding = tf.keras.layers.Dropout(0.3)(bert_output[0])
#   output = tf.keras.layers.Dense(3,activation = 'softmax')(embedding)
#   model = tf.keras.models.Model(inputs = [input_ids,attention_masks],outputs = [output])
#   model.compile(optimizer=tf.keras.optimizers.AdamW(learning_rate=1e-5), loss=custom_loss, metrics=[custom_accuracy])
#   return model

def custom_accuracy_pytorch(logits, labels, pad_id):
    preds = torch.argmax(logits, dim=-1)
    mask = labels != pad_id
    correct = ((preds == labels) & mask).sum().item()
    total = mask.sum().item()
    if total == 0:
        return 0.0
    return correct / total

model = BertForTokenClassification.from_pretrained('bert-base-uncased',num_labels=3)
model.to(device)
class_weights = torch.tensor([1.0, 100.0, 0.0], device=device)
criterion = nn.CrossEntropyLoss(
    weight=class_weights,
    ignore_index=pad_id,
)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)


Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [14]:
# model.summary()

In [15]:
# history_bert = model.fit([input_ids,attention_mask],np.array(train_tag),validation_data = ([val_input_ids,val_attention_mask],np.array(test_tag)),epochs = 30,batch_size = 32)

In [19]:
for epoch in range(3):
    model.train()
    total_loss = 0.0

    for batch in train_loader:
        input_ids_b, attention_mask_b, labels_b = [t.to(device) for t in batch]
        optimizer.zero_grad()
        outputs = model(input_ids=input_ids_b, attention_mask=attention_mask_b,)
        logits = outputs.logits
        loss = criterion(logits.view(-1, 3), labels_b.view(-1),)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    avg_train_loss = total_loss / len(train_loader)

    model.eval()
    val_acc = 0.0
    n_batches = 0
    with torch.no_grad():
        for batch in val_loader:
            input_ids_b, attention_mask_b, labels_b = [t.to(device) for t in batch]
            outputs = model(input_ids=input_ids_b,attention_mask=attention_mask_b,)
            logits = outputs.logits
            val_acc += custom_accuracy_pytorch(logits, labels_b, pad_id)
            n_batches += 1

    val_acc /= max(n_batches, 1)
    print(f"Epoch {epoch+1}: train loss={avg_train_loss:.4f}, val masked acc={val_acc:.4f}")

Epoch 1: train loss=0.0001, val masked acc=1.0000
Epoch 2: train loss=0.0001, val masked acc=1.0000
Epoch 3: train loss=0.0000, val masked acc=1.0000


####Results

Here is a bit of code using which we can test our classifier. As you can see, it does not produce correct output. After a bit of research I realized that despite manipulating the dataset and adding a custom-made loss function, the classifier still abuses the rules and assigns "other" class to all tokens, and because there are much less animal tokens than all the rest, it gets its score.

In [None]:
# def pred(val_input_ids,val_attention_mask):
#     return model.predict([val_input_ids,val_attention_mask])

# def testing(val_input_ids,val_attention_mask,enc_tag,y_test):
#     val_input = val_input_ids.reshape(1,MAX_LEN)
#     val_attention = val_attention_mask.reshape(1,MAX_LEN)

#     # Print Original Sentence
#     sentence = tokenizer.decode(val_input_ids[val_input_ids > 0])
#     print("Original Text : ",str(sentence))
#     print("\n")
#     print(y_test)
#     true_enc_tag = enc_tag.inverse_transform(y_test)

#     print("Original Tags : " ,str(true_enc_tag))
#     print("\n")

#     predictions = pred(val_input,val_attention)
#     pred_with_pad = np.argmax(predictions,axis = -1)
#     pred_without_pad = pred_with_pad[pred_with_pad>0]
#     pred_enc_tag = enc_tag.inverse_transform(pred_without_pad)
#     print("Predicted Tags : ",pred_enc_tag)

In [20]:
def pred(val_input_ids, val_attention_mask):
    model.eval()
    with torch.no_grad():
        input_ids_t = torch.tensor(val_input_ids, dtype=torch.long).unsqueeze(0).to(device)
        attention_t = torch.tensor(val_attention_mask, dtype=torch.long).unsqueeze(0).to(device)
        outputs = model(input_ids=input_ids_t, attention_mask=attention_t)
        logits = outputs.logits
        probs = torch.softmax(logits, dim=-1)
    return probs.squeeze(0).cpu().numpy()

In [None]:
def testing(val_input_ids, val_attention_mask, enc_tag, y_true_ids):
    sentence_ids = val_input_ids[val_input_ids > 0]
    sentence = tokenizer.decode(sentence_ids)
    print("Original Text:", sentence)
    print()

    print("True label ids:", y_true_ids)
    true_no_pad = y_true_ids[y_true_ids != pad_id]
    true_tags = enc_tag.inverse_transform(true_no_pad)
    print("Original Tags:", true_tags)
    print()

    probs = pred(val_input_ids, val_attention_mask)
    pred_ids = np.argmax(probs, axis=-1)
    pred_no_pad = pred_ids[pred_ids != pad_id]
    pred_tags = enc_tag.inverse_transform(pred_no_pad)
    print("Predicted Tags:", pred_tags)

    animals = []
    usable_ids = val_input_ids[val_input_ids > 0]
    tokens = tokenizer.convert_ids_to_tokens(usable_ids)
    print("Tokens:", tokens)
    print()
    for token, tag in zip(tokens, pred_tags):
        if tag == "B-ANIMAL":
            clean = token.replace("##", "")
            animals.append(clean)
    print("Predicted animal words:", animals)

In [30]:
idx = 25
testing(val_input_ids[idx], val_attention_mask[idx], enc_label, val_labels[idx])


Original Text: [CLS] long ago a chicken set off on a grand adventure a mischievous fox tried to trick the chicken out of its findings the journey changed the chicken forever filling its heart with joy [SEP]

True label ids: [2 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
Original Tags: ['O' 'O' 'O' 'B-ANIMAL' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O'
 'O' 'B-ANIMAL' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'B-ANIMAL' 'O' 'O' 'O' 'O'
 'O' 'O']

Predicted Tags: ['O' 'O' 'O' 'O' 'B-ANIMAL' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O'
 'O' 'O' 'B-ANIMAL' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'B-ANIMAL' 'O' 'O' 'O'
 'O' 'O' 'O' 'B-ANIMAL' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O'
 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O']
Tokens: ['[CLS]', 'long', 'ago', 'a', 'chicken', 'set', 'off', 'on', 'a', 'grand', 'adventure', 'a', 'mischievous', 'fox', 'tried', 'to', 'trick', 'the', 'chicken', 'o

Now in order to not retrain the model every time we launch the program I save it to the file.

In [None]:
torch.save(model.state_dict(), "ner.pt")

####Conclusions

In this task I learned to work with NLP, word tokenization and BERT model with Tensorflow. I tried to built a token classifier to recognize names of animals, however could not achive a high enough learning result to actually produce correct results on validation sentences.