##Task 2.2

*  train NER model for extracting animal titles from the text. Please use some
transformer-based model (not LLM).

####Making data

For training a transformer-based model for our purpose we need to have a dataset with tagged words, where there are tags for animals. Unfortunately, I did not manage to find a good dataset online, that would contain not too specific or scientific animal names as well as having words tagged. This is why I decided to generate my own dataset.

To generate dataset I combined the LLM approach and sample-sentence approach. The overall process is like so:
1.  Our goal is to generate a sentence
2.  A sentence consists of 3 parts: beginning, middle section and ending.
3.  There are 10 sample sentences of each category: beginning, middle and ending; generated by ChatGPT-4o.
4.  Every beginning phrase contains a placeholder for animal name, whereas middle sections and endings might or might not contain such a placeholder.
5.  One of each sections are selected at random, merged into one sentence and then a random animal name is put into the placeholder.

Because there are 10 different animals and 10 of each type of sentence sections, there are 10000 possible unique sentences. I generate 5000.

It is important to highlight, that this approach is not ideal and sentences sometimes make little sence, for example: *Once upon a time, a dog wandered into a mysterious forest. It met a wise old owl who shared a mysterious riddle. At last, the dog found what it had been searching for all along.*

The dataset is then saved into the $ner\_animal\_generated\_dataset.csv$ file.

In [1]:
import random
import numpy as np
import pandas as pd
import string

animal_names = ["dog", "horse", "elephant", "butterfly", "chicken", "cat", "cow", "sheep", "spider", "squirrel"]

beginnings = [
    "Once upon a time, a {animal} wandered into a mysterious forest.",
    "In a quiet village, a {animal} discovered an ancient secret.",
    "A curious {animal} stumbled upon a hidden cave.",
    "Long ago, a {animal} set off on a grand adventure.",
    "A lonely {animal} roamed the vast plains in search of something special.",
    "One evening, a {animal} found itself in an enchanted garden.",
    "Deep in the jungle, a {animal} heard a strange sound.",
    "A {animal} in the meadow noticed something glowing in the distance.",
    "Under the bright moon, a {animal} felt a strange pull towards the river.",
    "A {animal} in the desert uncovered a long-lost relic."
]

middles = [
    "It met a wise old owl who shared a mysterious riddle.",
    "A sudden storm forced it to seek shelter in a hidden cavern.",
    "The {animal} found a map leading to a legendary treasure.",
    "An unexpected friend, a talking parrot, guided the {animal} along the way.",
    "It had to solve a puzzle to continue its journey.",
    "A mischievous fox tried to trick the {animal} out of its findings.",
    "The path was blocked by a giant boulder, but a kind bear helped move it.",
    "A magical pond reflected the {animal}'s deepest dreams.",
    "The {animal} discovered an ancient book filled with forgotten wisdom.",
    "A hidden passage led the {animal} into a secret underground world."
]

endings = [
    "At last, the {animal} found what it had been searching for all along.",
    "It returned home, wiser and braver than before.",
    "The journey changed the {animal} forever, filling its heart with joy.",
    "A newfound friendship made the adventure truly special.",
    "The {animal} realized that the real treasure was the memories made.",
    "With the mystery solved, the {animal} could finally rest.",
    "The enchanted land bid the {animal} farewell as it continued its journey.",
    "Having learned an important lesson, the {animal} shared its story with others.",
    "The {animal} knew it would return one day for another grand adventure.",
    "As the sun set, the {animal} smiled, knowing its adventure was only the beginning."
]

# Generate unique sentences
unique_sentences = set()
while len(unique_sentences) < 5000:
    animal = random.choice(animal_names)
    sentence = f"{random.choice(beginnings)} {random.choice(middles)} {random.choice(endings)}".format(animal=animal)

    if sentence not in unique_sentences:
        unique_sentences.add(sentence)

# Restructure the dataset
restructured_data = []
sentence_id = 1

for sentence in unique_sentences:
    words = sentence.split()
    labels = ["B-ANIMAL" if word.lower() in animal_names else "O" for word in words]

    for word, label in zip(words, labels):
        restructured_data.append((sentence_id, word.strip(string.punctuation), label))

    sentence_id += 1

# Create a DataFrame
df_unique_sentences = pd.DataFrame(restructured_data, columns=["Sentence Number", "Word", "Label"])

# Save the dataset to CSV
file_path_unique_sentences = "ner_animal_generated_dataset.csv"
df_unique_sentences.to_csv(file_path_unique_sentences, index=False)

# Provide the file to the user
file_path_unique_sentences

'ner_animal_generated_dataset.csv'

####Data preprocessing

For solving the problem I chose to go with the BERT model to classify words in the sentence. This is why I will be using BERT tokenizer.

In [2]:
import tensorflow as tf
from transformers import TFBertModel, logging
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from transformers import BertTokenizerFast

from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from tqdm import tqdm


At first we split the dataset into two parts: sentences and tags for words in these sentences. We as well need a label encoder to fit to the tags from the dataset. For this I use the sklearn $preprocessing.LabelEncoder$.

In [3]:
def process_data(data_path):
    df = pd.read_csv(data_path, encoding="latin-1")
    df.loc[:, "Sentence Number"] = df["Sentence Number"].fillna(method="ffill")

    enc_label = preprocessing.LabelEncoder()

    df.loc[:, "Label"] = enc_label.fit_transform(df["Label"])

    sentences = df.groupby("Sentence Number")["Word"].apply(list).values
    tag = df.groupby("Sentence Number")["Label"].apply(list).values
    return sentences, tag, enc_label

sentence,tag,enc_label = process_data("ner_animal_generated_dataset.csv")


  df.loc[:, "Sentence Number"] = df["Sentence Number"].fillna(method="ffill")


Because models do not work with text data we need to create tokens from words in the sentences. For this I use $BertTokenizerFast$.

In [4]:
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
MAX_LEN = 64
def tokenize(data,max_len = MAX_LEN):
    input_ids = list()
    attention_mask = list()
    for i in tqdm(range(len(data))):
        encoded = tokenizer.encode_plus(data[i],
                                        add_special_tokens = True,
                                        max_length = MAX_LEN,
                                        is_split_into_words=True,
                                        return_attention_mask=True,
                                        padding = 'max_length',
                                        truncation=True,return_tensors = 'np')


        input_ids.append(encoded['input_ids'])
        attention_mask.append(encoded['attention_mask'])
    return np.vstack(input_ids),np.vstack(attention_mask)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


We then split the data into training and testing datasets and tokenize them.

In [5]:
X_train,X_test,y_train,y_test = train_test_split(sentence,tag,random_state=42,test_size=0.1)
X_train.shape,X_test.shape,y_train.shape,y_test.shape

((4500,), (500,), (4500,), (500,))

In [6]:
input_ids,attention_mask = tokenize(X_train,max_len = MAX_LEN)

100%|██████████| 4500/4500 [00:01<00:00, 3035.86it/s]


In [7]:
val_input_ids,val_attention_mask = tokenize(X_test,max_len = MAX_LEN)

100%|██████████| 500/500 [00:00<00:00, 3066.08it/s]


Because different sentences have different sizes we have data inconsistency. For the model to work we need to pad our sentences. I choose to add padding to make all sentences 64 tokens long.

In [8]:
# TEST: Checking Padding and Truncation length's
was = list()
for i in range(len(input_ids)):
    was.append(len(input_ids[i]))
set(was)

{64}

In [9]:
# Train Padding
test_tag = list()
for i in range(len(y_test)):
    test_tag.append(np.array(y_test[i] + [1] * (MAX_LEN-len(y_test[i]))))

# TEST:  Checking Padding Length
was = list()
for i in range(len(test_tag)):
    was.append(len(test_tag[i]))
set(was)

{64}

In [10]:
# Train Padding
train_tag = list()
for i in range(len(y_train)):
    train_tag.append(np.array(y_train[i] + [1] * (MAX_LEN-len(y_train[i]))))

# TEST:  Checking Padding Length
was = list()
for i in range(len(train_tag)):
    was.append(len(train_tag[i]))
set(was)

{64}

####Model compilation and training

Now it comes to designing a model. As mentioned above, I use the $TFBertModel$ to classify tokens as well as also add a couple of additional layers.

What is important however is that classes in our dataset are highly impalanced, i.e. there are much more "other" words than there are words that are tagged as signing animals. This is why we need to create our own loss function, which highly rewards the correct classification of animal tag to account for class imbalance.

In [11]:



def custom_loss(y_true, y_pred):
  # Compute the sparse categorical crossentropy loss per token.
  loss = tf.keras.losses.sparse_categorical_crossentropy(y_true, y_pred)
  # For tokens where y_true is 0 ("B-ANIMAL"), multiply the loss by a factor (e.g. 100)
  animal_weight = tf.where(tf.equal(y_true, 0), 100.0, 1.0)
  weighted_loss = loss * animal_weight
  return tf.reduce_mean(weighted_loss)

def create_model(bert_model,max_len = MAX_LEN):
  input_ids = tf.keras.Input(shape = (max_len,),dtype = 'int32')
  attention_masks = tf.keras.Input(shape = (max_len,),dtype = 'int32')
  bert_output = bert_model(input_ids,attention_mask = attention_masks,return_dict =True)
  embedding = tf.keras.layers.Dropout(0.3)(bert_output[0])
  output = tf.keras.layers.Dense(2,activation = 'softmax')(embedding)
  model = tf.keras.models.Model(inputs = [input_ids,attention_masks],outputs = [output])
  model.compile(optimizer=tf.keras.optimizers.AdamW(lr=0.00001), loss=custom_loss, metrics=["accuracy"])
  return model

bert_model = TFBertModel.from_pretrained(pretrained_model_name_or_path='bert-base-uncased', num_labels=2)
model = create_model(bert_model, MAX_LEN)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

In [12]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_1 (InputLayer)        [(None, 64)]                 0         []                            
                                                                                                  
 input_2 (InputLayer)        [(None, 64)]                 0         []                            
                                                                                                  
 tf_bert_model (TFBertModel  TFBaseModelOutputWithPooli   1094822   ['input_1[0][0]',             
 )                           ngAndCrossAttentions(last_   40         'input_2[0][0]']             
                             hidden_state=(None, 64, 76                                           
                             8),                                                              

In [13]:
history_bert = model.fit([input_ids,attention_mask],np.array(train_tag),validation_data = ([val_input_ids,val_attention_mask],np.array(test_tag)),epochs = 30,batch_size = 32)

Epoch 1/30




Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


####Results

Here is a bit of code using which we can test our classifier. As you can see, it does not produce correct output. After a bit of research I realized that despite manipulating the dataset and adding a custom-made loss function, the classifier still abuses the rules and assigns "other" class to all tokens, and because there are much less animal tokens than all the rest, it gets its score.

Unfortunately I was not able to beat this issue.

In [14]:
def pred(val_input_ids,val_attention_mask):
    return model.predict([val_input_ids,val_attention_mask])

def testing(val_input_ids,val_attention_mask,enc_tag,y_test):
    val_input = val_input_ids.reshape(1,MAX_LEN)
    val_attention = val_attention_mask.reshape(1,MAX_LEN)

    # Print Original Sentence
    sentence = tokenizer.decode(val_input_ids[val_input_ids > 0])
    print("Original Text : ",str(sentence))
    print("\n")
    print(y_test)
    true_enc_tag = enc_tag.inverse_transform(y_test)

    print("Original Tags : " ,str(true_enc_tag))
    print("\n")

    predictions = pred(val_input,val_attention)
    pred_with_pad = np.argmax(predictions,axis = -1)
    pred_without_pad = pred_with_pad[pred_with_pad>0]
    pred_enc_tag = enc_tag.inverse_transform(pred_without_pad)
    print("Predicted Tags : ",pred_enc_tag)

In [15]:
testing(val_input_ids[0],val_attention_mask[0],enc_label,y_test[0])

Original Text :  [CLS] one evening a horse found itself in an enchanted garden the horse discovered an ancient book filled with forgotten wisdom the horse knew it would return one day for another grand adventure [SEP]


[1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
Original Tags :  ['O' 'O' 'O' 'B-ANIMAL' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'B-ANIMAL' 'O' 'O' 'O'
 'O' 'O' 'O' 'O' 'O' 'O' 'B-ANIMAL' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O'
 'O']


Predicted Tags :  []


Now in order to not retrain the model every time we launch the program I save it to the file.

In [19]:
path='ner.h5'
model.save(path)

  saving_api.save_model(


####Conclusions

In this task I learned to work with NLP, word tokenization and BERT model with Tensorflow. I tried to built a token classifier to recognize names of animals, however could not achive a high enough learning result to actually produce correct results on validation sentences.