# Model Validation

In this notebook, we will:

1. Load the trained NER model and tokenizer from the `model/` directory.
2. Load and preprocess the validation data from `data/validation.conll`.
3. Run predictions on the validation set.
4. Evaluate the model's performance using standard NER metrics.

Let's get started!

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
!pip install transformers seqeval

Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16162 sha256=3df1324aff632f02c5fbc62398a623017fa957e46d200e781d071cad1aad592f
  Stored in directory: /root/.cache/pip/wheels/5f/b8/73/0b2c1a76b701a677653dd79ece07cfabd7457989dbfbdcd8d7
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2


In [4]:
# Import required libraries
import os
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
from seqeval.metrics import classification_report, f1_score, accuracy_score
from typing import List

In [5]:
# Load the trained model and tokenizer
model_dir = '/content/drive/MyDrive/arner'  # path to the model directory
tokenizer = AutoTokenizer.from_pretrained(model_dir, use_fast=True)
model = AutoModelForTokenClassification.from_pretrained(model_dir)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
print(f"Using device: {device}")
model.eval()  # set model to evaluation mode

Using device: cuda


XLMRobertaForTokenClassification(
  (roberta): XLMRobertaModel(
    (embeddings): XLMRobertaEmbeddings(
      (word_embeddings): Embedding(250002, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): XLMRobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x XLMRobertaLayer(
          (attention): XLMRobertaAttention(
            (self): XLMRobertaSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): XLMRobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768

In [7]:
# Get label mappings from the model config
label2id = model.config.label2id  # label name -> id
id2label = model.config.id2label  # id (as int or str) -> label name
num_labels = len(label2id)
print(f"Number of labels: {num_labels}")

Number of labels: 35


In [31]:
def ReadConll(filename):
    import pandas as pd
    df = pd.read_csv(filename,
                    sep = ' ', header = None, keep_default_na = False,
                    names = ['words','labels',"blank"],
                    quoting = 3,
                     skip_blank_lines = False,
                     encoding="utf8")
    df = df[~df['words'].astype(str).str.startswith('#')] # Remove the -DOCSTART- header
    df['sentence_id'] = (df.words == '').cumsum()
    print(df[df.words != ''])
    return df[df.words != '']

def ClsReportNerModel(test_conll_path, tokenizer, model, device):
    !pip install seqeval
    import warnings
    warnings.filterwarnings('ignore')

    test = ReadConll(test_conll_path)
    sents_tokens_list, truth_list = [],[]
    model = model.to(device)
    for i in test.sentence_id.unique():
        sents_tokens_list.append(list(test[test.sentence_id == i].words))
        truth_list.append(list(test[test.sentence_id == i].labels))
    tokens,preds,truths= [],[],[]
    for sentence_idx, sent_token_list in enumerate(sents_tokens_list):
        print(f"Processing sentence {sentence_idx+1}/{len(sents_tokens_list)}")

        model_inputs = tokenizer(sent_token_list, is_split_into_words = True, truncation=True,
                                        padding=False, max_length=256, return_tensors="pt").to(device)
        word_ids = model_inputs.word_ids() # sub tokenlar sent_token_list deki hangi idxteki tokena ait
        # ornek word_ids = [None, 0, 1, 2, 2, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 11, 12, 13, 14, 15, None]
        outputs = model(**model_inputs)
        predictions = outputs.logits.argmax(dim=-1).tolist()[0]
        idx = 1
        while idx < len(word_ids)-1: # sondaki None icin islem yapmamak icin -1 yapildi
            word_id1 = word_ids[idx]
            word_id2 = word_ids[idx + 1]
            label = model.config.id2label[predictions[idx]]
            if word_id1 == word_id2:
                while word_id1 == word_ids[idx]:
                    idx +=1
                idx -=1

            token = sent_token_list[word_ids[idx]]
            truth = truth_list[sentence_idx][word_ids[idx]]
            tokens.append(token)
            preds.append(label)
            truths.append(truth)
            idx +=1
    from seqeval.metrics import classification_report
    print(classification_report([truths], [preds], digits = 4, mode = 'strict'))
    from sklearn.metrics import classification_report
    print(classification_report(truths, preds, digits = 4))


In [33]:
ClsReportNerModel("/content/drive/MyDrive/arner/validation.conll", tokenizer, model, device)


                      words       labels blank  sentence_id
1                        Ma            O                  0
2                      mère            O                  0
3                    Astrit  B-GIVENNAME                  0
4                      Nani            O                  0
5                      Kofi    B-SURNAME                  0
...                     ...          ...   ...          ...
1644052                 876            O              82930
1644053                  19            O              82930
1644054                  et            O              82930
1644055                 les            O              82930
1644056  31-80-113-166-197.     B-TAXNUM              82930

[1478169 rows x 4 columns]
                  precision    recall  f1-score   support

             AGE     0.9670    0.9870    0.9769      4217
     BUILDINGNUM     0.9791    0.9738    0.9764      7830
            CITY     0.9618    0.9715    0.9666     13260
CREDITCARDNUMBER   

## Validation Output Assessment

Based on the validation output, here's an assessment of the model's performance:

The output provides two classification reports: one using `seqeval` metrics (strict mode) and another using `sklearn` metrics.

**Seqeval Classification Report (Strict Mode):**

This report provides precision, recall, and F1-score for each individual named entity (B- and I- tags) and also overall micro, macro, and weighted averages.

*   **Overall Performance:** The micro, macro, and weighted averages show strong performance across the board, with F1-scores around 0.95. This indicates that the model is generally performing well in identifying and classifying named entities.
*   **Entity-Specific Performance:** Most entity types have high precision, recall, and F1-scores (above 0.90), suggesting the model is effective at recognizing these entities. Some entities like `AGE`, `DATE`, `EMAIL`, `TELEPHONENUM`, and `TIME` have near-perfect scores.
*   **Lower Performing Entities:** Entities like `DRIVERLICENSENUM`, `SEX`, and `SOCIALNUM` have slightly lower F1-scores (though still respectable, mostly above 0.80). This might indicate that these entity types are more challenging for the model to identify accurately, potentially due to less training data or more complex patterns.
*   **"O" (Outside) Tag:** The "O" tag, representing tokens that are not part of any named entity, has very high precision, recall, and F1-score (around 0.998), which is expected and good.

**Sklearn Classification Report:**

This report provides precision, recall, and F1-score for the unique labels in the dataset, which includes both the B- and I- tags merged into a single category for each entity type (e.g., "AGE" instead of "B-AGE" and "I-AGE"). It also includes the "O" tag.

*   **Overall Performance:** Similar to the seqeval report, the overall averages are very high (around 0.99 for accuracy, and 0.95 for macro and weighted averages of precision, recall, and f1-score), indicating strong performance.
*   **Entity-Specific Performance:** This report confirms the trends seen in the seqeval report, with most entity types having excellent scores. The entities with slightly lower scores in the seqeval report (like `DRIVERLICENSENUM`, `SEX`, and `SOCIALNUM`) still show lower F1-scores here compared to the top-performing entities.

**Overall Assessment:**

The model demonstrates strong performance on the validation set. The high F1-scores across most entity types indicate that the model is effectively identifying and classifying named entities. While some entity types have slightly lower scores, the overall performance is impressive. The slight differences between the seqeval and sklearn reports are due to the different evaluation methodologies (seqeval's strict mode is more sensitive to boundary errors).

To further improve the model, you could consider:

*   Investigating the entity types with lower scores to understand why they are more challenging.
*   Potentially augmenting the training data for these lower-performing entity types.
*   Experimenting with different model architectures or hyperparameters.
*   Analyzing specific examples where the model made incorrect predictions to identify patterns in errors.