# Off-the-Shelf NER Evaluation on Fontys Terminology
I have completed the quick evaluation as teacher Iman suggested (using 10 sentences containing Fontys ICT-specific terms)

**Results**
- spaCy transformer (en_core_web_trf achieved the highest score: F1 ≈ 0.58
- spaCy (en_core_web_sm): F1 ≈ 0.53
- BERT NERT (dslim): F1 ≈0.19
- Flair: F1 ≈0.19

The low performance is primarily because pre-trained models only recognize standard categories (PERSON, ORG, LOC, etc) while Fontys terminology requires 7 custom entity types (EVENT, PROGRAM, MINOR, COURSE, TOOL, SYSTEM, BUILDING) that don't exist in their training data.

The 0.19 scores for BERT/Flair are especially telling - they're barely better than random guessing because your entities don't map to their pre-defined categories at all!


**Observations**
- Off-the-shelf models can detect some terms (InnovationLab, FHICT, Canvas, SpeedGrader...), but still miss many internal terminology items (student+, OIL, course codes, minor names...).
- Accuracy of 0.53–0.58 remains low compared to practical requirements (>0.85–0.90).

**Improvement**

I intent to built a larger dataset of more than 200 sentences and fine-tuning a specialized NER model for Fontys ICT to achieve higher performance.

In [None]:
!pip install spacy transformers flair seqeval datasets
!python -m spacy download en_core_web_trf
!pip install https://github.com/flairNLP/flair/releases/download/v0.13.1/flair-0.13.1-py3-none-any.whl

Collecting flair
  Downloading flair-0.15.1-py3-none-any.whl.metadata (12 kB)
Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting boto3>=1.20.27 (from flair)
  Downloading boto3-1.42.1-py3-none-any.whl.metadata (6.8 kB)
Collecting conllu<5.0.0,>=4.0 (from flair)
  Downloading conllu-4.5.3-py2.py3-none-any.whl.metadata (19 kB)
Collecting deprecated>=1.2.13 (from flair)
  Downloading deprecated-1.3.1-py2.py3-none-any.whl.metadata (5.9 kB)
Collecting ftfy>=6.1.0 (from flair)
  Downloading ftfy-6.3.1-py3-none-any.whl.metadata (7.3 kB)
Collecting langdetect>=1.0.9 (from flair)
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m56.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25

In [None]:
import spacy
from transformers import pipeline
import flair.models
from flair.data import Sentence
from flair.models import SequenceTagger
from datasets import load_dataset
import torch
from seqeval.metrics import classification_report, f1_score
import pandas as pd


In [None]:
sentences = [
    "Last week we had the OIL introduction for all new student+ members.",
    "The InnovationLab will be open during the entire OIL week.",
    "FHICT students can choose the minor AI Engineering.",
    "Please upload your project to Canvas before the SEM1A deadline.",
    "Have you registered for the Business IT & Management minor?",
    "The SpeedGrader tool in Canvas is really helpful.",
    "Student+ is the new onboarding platform at Fontys.",
    "Check the Rubrics and Outcomes for ININ1A on Canvas.",
    "The InnovationLab is located in the TQ building.",
    "OIL stands for Orientation Introduction Learning."
]

gold_tags = [
    ["O","O","O","O","B-EVENT","O","O","O","O","O","B-PROGRAM","O"],           # 11
    ["O","B-ORG","O", "O","O","O","O","O","B-EVENT","O"],                       # 9
    ["B-ORG","O","O","O","O","O","B-MINOR","I-MINOR"],                     # 8
    ["O","O","O","O","O","B-SYSTEM","O","O","B-COURSE","O"],               # 10
    ["O","O","O","O","O","B-MINOR","I-MINOR","I-MINOR","I-MINOR","O"],     # 10
    ["O","B-TOOL","O","O","B-SYSTEM","O","O","O"],                         # 8
    ["B-PROGRAM","O","O","O","O","O","O","B-ORG"],                         # 8
    ["O","O","B-TOOL","O","B-TOOL","O","B-COURSE","O","B-SYSTEM"],         # 9
    ["O","B-ORG","O","O","O","O","B-BUILDING","O"],                        # 8
    ["B-EVENT","O","O","B-EVENT","I-EVENT","I-EVENT"]                      # 6
]

# Try to run off-the-sheld NER models
1. spaCy small
2. spaCy transformer
3. dslim/bert-base-NER
5. Flair NER


In [None]:
def fontys(y_true, y_pred):
  return f1_score(y_true, y_pred)
results =[]

# I- & B- for BIO but the NER models doesn't understand so defining simplify_tags to combine B- & I- into one tags
def simplify_tags(tag_list):
    new = []
    for tags in tag_list:
        new_tags = ["B-FONTYS" if t.startswith("B-") else "I-FONTYS" if t.startswith("I-") else "O" for t in tags]
        new.append(new_tags)
    return new

gold_simple = simplify_tags(gold_tags)


In [None]:
def spacy_predict(model_name, sentences):
    nlp = spacy.load(model_name)
    preds = []
    for sent in sentences:
        doc = nlp(sent)
        tags = ["O"] * len(sent.split())
        for ent in doc.ents:
            if ent.label_ in ["ORG","PRODUCT","EVENT","FAC","WORK_OF_ART","NORP","GPE"]:
                try:
                    start_token = next(i for i, token in enumerate(doc) if token.idx >= ent.start_char)
                    tags[start_token] = "B-FONTYS"
                    for i in range(start_token+1, len(tags)):
                        if doc[i].idx < ent.end_char:
                            tags[i] = "I-FONTYS"
                        else:
                            break
                except:
                    continue
        preds.append(tags)
    return preds

## spaCy Small (en_core_web_sm)
A lightweight, efficient NER model that uses statistical methods (CNN) rather than transformers. It's faster and requires less memory than the transformer version, making it suitable for real-time applications. 

**Strengths**: Fast processing, lower resource requirements

**Limitations**: Lower accuracy than transformer models, less context understanding

In [None]:
results = []

# 1. spaCy small
print("Running spaCy small...")
pred = spacy_predict("en_core_web_sm", sentences)
f1 = f1_score(gold_simple, pred)
results.append({"Model": "spaCy en_core_web_sm", "F1 (Fontys terms)": round(f1, 3)})

Running spaCy small...


## spaCy Transformer (en_core_web_trf)
A NER model built on transformer architecture (similar to BERT). It uses deep learning with attention mechanisms to understand context and relationships between words. The "trf" (transformer) variant is the most accurate spaCy model but requires more computational resources. It's pre-trained on web text and can recognize standard entity types like PERSON, ORG (organization), GPE (geopolitical entity), DATE, etc.

**Strengths**: High accuracy on general English text, understands context well

**Limitations**: Trained on general domain, not specialized for educational/institutional terminology

In [None]:
# 2. spaCy transformer
print("Running spaCy transformer...")
pred = spacy_predict("en_core_web_trf", sentences)
f1 = f1_score(gold_simple, pred)
results.append({"Model": "spaCy en_core_web_trf", "F1 (Fontys terms)": round(f1, 3)})

Running spaCy transformer...


## BERT NERT (dslim/bert-base-NER)
A BERT-base model fine-tuned specifically for Named Entity Recognition by dslim (a Hugging Face community contributor). This model is trained on the CoNLL-2003 dataset and recognizes four standard entity types: PER (person), LOC (location), ORG (organization), and MISC (miscellaneous). It's a compact, pre-trained model readily available for immediate use without additional training.

**Strengths**: Easy to deploy, good baseline performance on standard entities, optimized for inference speed

**Limitations**: Limited to four entity types from CoNLL-2003, trained on news domain, doesn't recognize domain-specific or organizational terminology

In [None]:
# 3. BERT NERT (dslim)
ner = pipeline("ner", model="dslim/bert-base-NER", aggregation_strategy="simple")
pred3 = []
for sent in sentences:
    tags = ["O"] * len(sent.split())
    outs = ner(sent)
    for item in outs:
      word = item['word'].replace('##', '')
      if item['entity_group'] in ["ORG", "MISC"]:
            try:
                idx = [i for i, w in enumerate(sent.split()) if word in w][0]
                tags[idx] = 'B-' + item['entity_group']
            except:
                pass
    pred3.append(tags)
f1 = fontys(gold_tags, pred3)
results.append({"Model": "dslim/bert-base-NER", "F1 (Fontys terms)": round(f1, 3)})

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cuda:0


## Flair OntoNotes
Flair is a powerful NLP framework that uses contextualized string embeddings. The OntoNotes version is trained on the OntoNotes 5.0 corpus, which includes 18 entity types (PERSON, ORG, GPE, DATE, TIME, MONEY, PERCENT, etc.). Flair models capture character-level and word-level context, making them effective for various text types.

**Strengths**: Handles diverse entity types, good with morphological variations

**Limitations**: Trained on general text (news, web, conversations), not domain-specific terminology

In [None]:
# 4. Flair NER
tagger = SequenceTagger.load("flair/ner-english-ontonotes-large")
pred4 =[]
for sent in sentences:
    sentence = Sentence(sent)
    tagger.predict(sentence)
    tags = ['O'] * len(sent.split())
    for entity in sentence.get_spans('ner'):
        if entity.tag in ["ORG", "PRODUCT", "EVENT"]:
            words_from_entity = entity.text.split()

            sentence_tokens_for_matching = sent.split()
            start_idx = -1
            for i, s_token in enumerate(sentence_tokens_for_matching):
                # Robustly match entity's first word with sentence token (handling punctuation difference)
                if words_from_entity[0] == s_token or words_from_entity[0] == s_token.strip('.,;!?"\''):
                    start_idx = i
                    break

            if start_idx != -1:
                tags[start_idx] = 'B-' + entity.tag
                for i in range(1, len(words_from_entity)):
                    if start_idx + i < len(tags):
                        tags[start_idx + i] = 'I-' + entity.tag
    pred4.append(tags)
f1 = fontys(gold_tags, pred4)
results.append({"Model": "Flair OntoNotes Large", "F1 (Fontys terms)": round(f1, 3)})

pytorch_model.bin:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/616 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

2025-12-03 08:30:49,697 SequenceTagger predicts: Dictionary with 76 tags: <unk>, O, B-CARDINAL, E-CARDINAL, S-PERSON, S-CARDINAL, S-PRODUCT, B-PRODUCT, I-PRODUCT, E-PRODUCT, B-WORK_OF_ART, I-WORK_OF_ART, E-WORK_OF_ART, B-PERSON, E-PERSON, S-GPE, B-DATE, I-DATE, E-DATE, S-ORDINAL, S-LANGUAGE, I-PERSON, S-EVENT, S-DATE, B-QUANTITY, E-QUANTITY, S-TIME, B-TIME, I-TIME, E-TIME, B-GPE, E-GPE, S-ORG, I-GPE, S-NORP, B-FAC, I-FAC, E-FAC, B-NORP, E-NORP, S-PERCENT, B-ORG, E-ORG, B-LANGUAGE, E-LANGUAGE, I-CARDINAL, I-ORG, S-WORK_OF_ART, I-QUANTITY, B-MONEY


In [None]:
print(pd.DataFrame(results))

                   Model  F1 (Fontys terms)
0   spaCy en_core_web_sm              0.533
1  spaCy en_core_web_trf              0.581
2    dslim/bert-base-NER              0.187
3  Flair OntoNotes Large              0.187
