# NER Training Notebook
This notebook contains data preprocessing, model training, saving and loading weights, and predictions

In [1]:
import json
import os
import sys
from sklearn.model_selection import train_test_split

In [2]:
project_root = os.path.abspath("..")  
sys.path.append(os.path.join(project_root, "models"))

from ner_model import NerModel

2025-03-03 16:42:02.795170: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX, in other operations, rebuild TensorFlow with the appropriate compiler flags.
  from .autonotebook import tqdm as notebook_tqdm


##### Dataset is generated using script <code>datasets/text/generate_dataset.py</code>

In [3]:
with open("../datasets/text/ner_dataset.json", "r") as f:
    dataset = json.load(f)

print(dataset[:5])

[{'tokens': ['The', 'squirrel', 'is', 'looking', 'for', 'its', 'family.'], 'labels': ['O', 'B-SQUIRREL', 'O', 'O', 'O', 'O', 'O']}, {'tokens': ['There', 'was', 'a', 'chicken', 'near', 'the', 'river.'], 'labels': ['O', 'O', 'O', 'B-CHICKEN', 'O', 'O', 'O']}, {'tokens': ['I', 'saw', 'a', 'horse', 'at', 'the', 'animal', 'shelter.'], 'labels': ['O', 'O', 'O', 'B-HORSE', 'O', 'O', 'O', 'O']}, {'tokens': ['I', 'saw', 'a', 'cat', 'at', 'the', 'animal', 'shelter.'], 'labels': ['O', 'O', 'O', 'B-CAT', 'O', 'O', 'O', 'O']}, {'tokens': ['Did', 'you', 'hear', 'that?', 'It', 'was', 'a', 'wolf.'], 'labels': ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']}]


## Dataset splitting

In [4]:
train_data, temp_data = train_test_split(dataset, test_size=0.3, random_state=42)
valid_data, test_data = train_test_split(temp_data, test_size=0.33, random_state=42)

print(f"Train: {len(train_data)}, Validation: {len(valid_data)}, Test: {len(test_data)}")

Train: 2100, Validation: 603, Test: 297


In [5]:
test_data[0]

{'tokens': ['Look', 'at', 'how', 'fast', 'the', 'butterfly', 'can', 'run!'],
 'labels': ['O', 'O', 'O', 'O', 'O', 'B-BUTTERFLY', 'O', 'O']}

## Preparing data for training

In [6]:
x_train = [sample["tokens"] for sample in train_data]
y_train = [sample["labels"] for sample in train_data]

x_val = [sample["tokens"] for sample in valid_data]
y_val = [sample["tokens"] for sample in valid_data]


## NER Model

In [7]:
ner_model = NerModel()

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForTokenClassification: ['vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing TFDistilBertForTokenClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForTokenClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForTokenClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able t

### TRAINING and SAVING WEIGHTS
If you want to train model by yourself then run this block of cells 

In [None]:
ner_model.train(x_train, y_train, x_val, y_val, epochs=10)

In [None]:
ner_model.save_weights("../src/weights/new_ner_model.h5")

### LOADING WEIGHTS

This weights were saved from a model trained with parameters <code>epochs=10, batch_size=32</code>. 
And the result of training was <br><i>loss: 0.9633 - accuracy: 0.9383<br>val_loss: 1.0828 - val_accuracy: 0.9028</i>

In [8]:
ner_model.load_weights("../src/weights/ner_model.h5")

The results of the NER model are quite good, accurately identifying animal names in sentences. If a sentence does not contain any animals, the model correctly returns no detected entities. When a sentence includes multiple animals, the model successfully recognizes and labels each of them correctly.

In [9]:
ner_model.predict("This is my cat")

{'B-CAT'}

In [12]:
ner_model.predict("What will you say if do not want talk about that?")

set()

In [13]:
ner_model.predict("Can you imageine I was an elephant and a balck cat today!")

{'B-CAT', 'B-ELEPHANT'}