# Nya-Hoba NER — Colab Notebook (Lite Version)
---

**Named Entity Recognition for Low-Resource African Languages: A Transformer-Based Case Study on Nya-Hoba**

**Owner:** Chahyaandida Ishaya

---
Note: This notebook is optimized for Google Colab free tier. It uses a smaller model (`Davlan/afro-xlmr-mini`) and lighter settings for faster training.


## PROJECT OBJECTIVES
---

- **General Objective:** Design and evaluate a transformer-based NER system for Nya-Hoba.
- **Specific Objectives:**
  1. Collect, clean, and annotate a Nya-Hoba text corpus for NER tasks.
  2. Develop baseline NER models using traditional machine learning approaches for benchmarking.
  3. Fine-tune transformer-based models for NER on Nya-Hoba.
  4. Evaluate model performance using Precision, Recall, and F1-score.
  5. Release an open-source dataset and pre-trained models.
---

## **1. Setup**
Run the code cell to install required packages. On Colab this may take a few minutes.

You may want to manually install a CUDA-compatible `torch` build if you intend to use GPU acceleration.


In [2]:
!pip install -q scikit-learn sklearn-crfsuite pandas joblib transformers datasets seqeval torch
import sklearn, pandas, joblib, transformers, torch
print('sklearn', sklearn.__version__)
print('pandas', pandas.__version__)
print('transformers', transformers.__version__)
print('torch', torch.__version__, 'CUDA:', torch.cuda.is_available())

sklearn 1.6.1
pandas 2.2.2
transformers 4.56.1
torch 2.8.0+cu126 CUDA: False


## **2. Data Generation & Cleaning**
 This stage expands small seed lists of entity words ( names, times, animals, and locations) into larger synthetic datasets through controlled randomization and pattern-based augmentation.

In [3]:
import json
import random
import os

#Making directory for data
os.makedirs('/content/data', exist_ok=True)
print('Your Data is located in /content/data/dataset.conll')

# ---------------------------
# STEP 1. Fixed lists
# ---------------------------
time_words = [
    "Zǝkǝu", "Sakana", "ǝna", "ǝhna", "Pǝshinda", "Pishinda",
    "Fer pǝchingǝ zekeu", "Pǝr pǝchingǝ zekeu", "Pǝchi", "Hya"
]

animal_words = [
    "Kwa", "Mabǝlang", "Tǝga", "Gwanba", "Ha'l", "Dlǝgwam", "Thla",
    "Kǝtǝn", "Chiwar", "Lǝvari", "Mapǝla'u", "Litsa"
]

person_words_fixed = [
    "Chahyaandida", "Chabiya", "Hyellama", "Hyelnaya", "Wandiya", "Hyel", "Yesu",
    "Chataimada", "Chatramada", "Nanunaya", "Mapida", "Shimbal", "Chai",
    "Hyellachardati", "Hyellachardati", "Wamanyi", "Miyaninyi", "Miyakindahyelni", "Miyaninyi"
]

# ---------------------------
# STEP 2. Synthetic PERSON names
# ---------------------------
base_names = [
    "Abubakar", "Ibrahim", "Musa", "Usman", "Kabiru", "Bello", "Suleiman",
    "Ahmad", "Aliyu", "Shehu", "Aminu", "Habiba", "Fatima", "Aisha", "Zainab", "Hauwa",
    "Ruqayya", "Maryam", "Khadija", "Sa'adatu", "Yakubu", "Ismaila", "Nasiru", "Idris",
    "John", "Paul", "Peter", "James", "Joseph", "Stephen", "Samuel",
    "David", "Daniel", "Thomas", "Andrew", "Philip", "Simon", "Nathaniel",
    "Grace", "Joyce", "Ruth", "Esther", "Naomi", "Sarah", "Deborah",
    "Ndyako", "Pwakina", "Gargam", "Kwada", "Tizhe", "Lazarus", "Kwapre",
    "Nzoka", "Jauro", "Birma", "Fwa", "Tumba", "Dlama", "Nuhu", "Zira", "Bitrus",
    "Vandi", "Nggada", "Gimba", "Danjuma"
]

prefixes = ["Alhaji", "Malam", "Doctor", "Pastor", "Chief", "Prince", "Princess", "Rev"]
suffixes = ["Abubakar", "Musa", "Ibrahim", "Aliyu", "Yakubu", "Bitrus", "Danjuma", "Zira", "Vandi", "Nuhu"]
syllables = ["Nga", "Fwa", "Tiz", "Lam", "Bok", "Ngu", "Pwa", "Kiri", "Shaf", "Loru", "Baga", "Dla", "Hoba", "Zar", "Yam", "Kwada"]

def make_variants(base_list, prefixes, suffixes, syllables, target=2000, max_attempts=20000):
    items = set(base_list)
    attempts = 0
    while len(items) < target and attempts < max_attempts:
        r = random.random()
        if r < 0.3 and prefixes:
            new = random.choice(prefixes) + " " + random.choice(base_list)
        elif r < 0.6 and suffixes:
            new = random.choice(base_list) + " " + random.choice(suffixes)
        elif r < 0.8 and syllables:
            new = random.choice(syllables) + random.choice(syllables)
        else:
            new = random.choice(base_list) + " " + random.choice(base_list)
        items.add(new)
        attempts += 1

    # Fill with duplicates if still short
    items = list(items)
    while len(items) < target:
        items.append(random.choice(items))
    return items[:target]

random.seed(2025)
all_person_names = make_variants(base_names + person_words_fixed, prefixes, suffixes, syllables, 2000)

# ---------------------------
# STEP 2b. Expand TIME and ANIMAL with variants to 2000
# ---------------------------
time_prefixes = ["Early", "Late", "Mid", "Pre", "Post"]
time_suffixes = ["time", "hour", "day", "night", "season"]
time_syllables = ["Zi", "Sa", "Na", "Ku", "Lo", "Mi", "Ta"]

animal_prefixes = ["Wild", "Big", "Little", "Young", "Old"]
animal_suffixes = ["beast", "cub", "ling", "hunter", "creature"]
animal_syllables = ["Ka", "Mo", "La", "Ti", "Ro", "Zu", "Ba"]

all_time_words = make_variants(time_words, time_prefixes, time_suffixes, time_syllables, 2000)
all_animal_words = make_variants(animal_words, animal_prefixes, animal_suffixes, animal_syllables, 2000)

# ---------------------------
# STEP 3. Location generator
# ---------------------------
base_places = [
    "Yola", "Jimeta", "Numan", "Ganye", "Gombi", "Hong", "Mubi", "Michika", "Madagali",
    "Maiha", "Fufore", "Song", "Demsa", "Guyuk", "Jada", "Lamurde", "Mayo-Belwa",
    "Shelleng", "Toungo", "Pella", "Uba", "Dirma", "Holma", "Kala'a", "Garkida",
    "Borrong", "Mayo-Lope", "Shuwa", "Mayo-Balewa", "River Benue", "Mayo Ine",
    "Mayo Nguli", "Mayo Sanzu", "Kiri Dam", "Mandara Mountains", "Zumo Hill", "Fali Hills"
]

prefixes_loc = ["New", "Old", "Upper", "Lower", "North", "South", "East", "West", "Mayo", "Wuro", "Gidan", "Bari"]
suffixes_loc = ["Gari", "Ward", "Hill", "Village", "Settlement", "Bridge", "Camp", "Market", "River", "Valley", "Peak", "Forest", "Reserve", "Dam"]
syllables_loc = ["Kwa", "Ngu", "Mayo", "Zar", "Kiri", "Wuro", "Tula", "Nguwa", "Ganye", "Song", "Lam", "Mubi", "Pella", "Hoba", "Beli", "Tambo", "Shaf", "Loru", "Baga", "Zumo"]

all_places = make_variants(base_places, prefixes_loc, suffixes_loc, syllables_loc, 2000)

# ---------------------------
# STEP 4. Annotation helper
# ---------------------------
def make_annotation(word, label):
    return {
        "data": {"text": word},
        "annotations": [{
            "result": [{
                "value": {
                    "start": 0,
                    "end": len(word),
                    "text": word,
                    "labels": [label]
                },
                "from_name": "label",
                "to_name": "text",
                "type": "labels"
            }]
        }]
    }

# Build datasets
time_tasks = [make_annotation(w, "TIME") for w in all_time_words]          # expanded 2000
animal_tasks = [make_annotation(w, "ANIMAL") for w in all_animal_words]    # expanded 2000
person_tasks = [make_annotation(w, "PERSON") for w in all_person_names]    # expanded 2000
location_tasks = [make_annotation(loc, "LOCATION") for loc in all_places]  # expanded 2000

# ---------------------------
# STEP 5. Merge datasets
# ---------------------------
merged = time_tasks + animal_tasks + person_tasks + location_tasks

with open("/content/data/merged_dataset.json", "w", encoding="utf-8") as f:
    json.dump(merged, f, indent=2, ensure_ascii=False)

print(f"✅ Saved {len(merged)} tasks -> merged_dataset.json")
print(f"  TIME: {len(time_tasks)}")
print(f"  ANIMAL: {len(animal_tasks)}")
print(f"  PERSON: {len(person_tasks)}")
print(f"  LOCATION: {len(location_tasks)}")


Your Data is located in /content/data/dataset.conll
✅ Saved 8000 tasks -> merged_dataset.json
  TIME: 2000
  ANIMAL: 2000
  PERSON: 2000
  LOCATION: 2000


## **3. Data Annotation**
 The generated data is then annotated with entity labels and merged into a unified dataset, ensuring sufficient volume and diversity for NER model training while maintaining consistency and quality.

In [4]:
# Load your merged dataset
with open("/content/data/merged_dataset.json", "r", encoding="utf-8") as f:
    data = json.load(f)

conll_lines = []

for task in data:
    text = task["data"]["text"]
    anns = task["annotations"][0]["result"] if task["annotations"] else []

    # Start with "O" for each token
    tokens = text.split()
    labels = ["O"] * len(tokens)

    for ann in anns:
        value = ann["value"]
        start = value["start"]
        end = value["end"]
        label = value["labels"][0]

        # Find which tokens are covered by this annotation
        covered = []
        running_index = 0
        for i, tok in enumerate(tokens):
            token_start = running_index
            token_end = running_index + len(tok)
            if token_end > start and token_start < end:
                covered.append(i)
            running_index = token_end + 1  # +1 for space

        # Assign BIO tags
        for j, idx in enumerate(covered):
            if j == 0:
                labels[idx] = "B-" + label
            else:
                labels[idx] = "I-" + label

    # Append tokens with tags
    for tok, lab in zip(tokens, labels):
        conll_lines.append(f"{tok} {lab}")
    conll_lines.append("")  # Sentence boundary

# Save to file
with open("/content/data/dataset.conll", "w", encoding="utf-8") as f:
    f.write("\n".join(conll_lines))

print("✅ Exported to dataset.conll in CoNLL format")


✅ Exported to dataset.conll in CoNLL format


## 4. Parse CoNLL & Prepare JSONL
This cell parses the CoNLL file (token per line, tag in last column) and saves a JSONL to `/content/prepared/data.jsonl`.

In [5]:
from pathlib import Path
import json

conll_path = Path('/content/data/dataset.conll')
if not conll_path.exists():
    raise FileNotFoundError('Upload dataset.conll to /content/data/ first.')

def read_conll(path):
    sentences, tokens, tags = [], [], []
    with open(path, 'r', encoding='utf-8') as f:
        for line in f:
            line = line.strip()
            if not line:
                if tokens: sentences.append((tokens, tags)); tokens, tags = [], []
                continue
            parts = line.split()
            tokens.append(parts[0]); tags.append(parts[-1])
        if tokens: sentences.append((tokens, tags))
    return sentences

sentences = read_conll(conll_path)
print('Sentences:', len(sentences), 'Tokens:', sum(len(s[0]) for s in sentences))

os.makedirs('/content/prepared', exist_ok=True)
with open('/content/prepared/data.jsonl', 'w', encoding='utf-8') as f:
    for tks, tgs in sentences:
        f.write(json.dumps({'tokens': tks, 'tags': tgs}) + '\n')
print('Saved prepared data to /content/prepared/data.jsonl')

Sentences: 8000 Tokens: 15735
Saved prepared data to /content/prepared/data.jsonl


## 5. Sample annotated sentences
List of annotated sentences

In [6]:
# print first 10 samples
import json, itertools
with open('/content/prepared/data.jsonl', 'r', encoding='utf-8') as f:
    for i, line in enumerate(itertools.islice(f, 10), 1):
        d = json.loads(line)
        print(i, '->', ' '.join([f"{t}/{tg}" for t,tg in zip(d['tokens'], d['tags'])]))

1 -> ǝhna/B-TIME Hya/I-TIME
2 -> Pǝshinda/B-TIME Pǝr/I-TIME pǝchingǝ/I-TIME zekeu/I-TIME
3 -> KuMi/B-TIME
4 -> Pǝchi/B-TIME Pishinda/I-TIME
5 -> Mid/B-TIME Zǝkǝu/I-TIME
6 -> Pǝr/B-TIME pǝchingǝ/I-TIME zekeu/I-TIME Sakana/I-TIME
7 -> Early/B-TIME Pǝchi/I-TIME
8 -> Sakana/B-TIME Zǝkǝu/I-TIME
9 -> Zǝkǝu/B-TIME night/I-TIME
10 -> Hya/B-TIME Pǝchi/I-TIME


## 6. Baseline: CRF Model
Train a CRF baseline using `sklearn-crfsuite`. This step is fast and useful for benchmarking.


In [7]:
from sklearn_crfsuite import CRF, metrics
from sklearn.model_selection import train_test_split
import joblib

data = [json.loads(l) for l in open('/content/prepared/data.jsonl')]
tokens = [d['tokens'] for d in data]
tags = [d['tags'] for d in data]

def word2features(sent, i):
    word = sent[i]
    feats = {'bias':1.0,'word.lower()':word.lower(),'word.isupper()':word.isupper(),
             'word.istitle()':word.istitle(),'word.isdigit()':word.isdigit()}
    if i>0: feats.update({'-1:word.lower()':sent[i-1].lower()})
    else: feats['BOS']=True
    if i<len(sent)-1: feats.update({'+1:word.lower()':sent[i+1].lower()})
    else: feats['EOS']=True
    return feats

X = [[word2features(s,i) for i in range(len(s))] for s in tokens]
y = tags
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)

crf = CRF(max_iterations=100)
crf.fit(X_train,y_train)
y_pred = crf.predict(X_test)
print(metrics.flat_classification_report(y_test,y_pred))
joblib.dump(crf,'/content/prepared/crf_model.joblib')

              precision    recall  f1-score   support

    B-ANIMAL       1.00      0.99      0.99       379
  B-LOCATION       0.88      1.00      0.94       399
    B-PERSON       1.00      0.89      0.94       397
      B-TIME       1.00      0.99      0.99       425
    I-ANIMAL       1.00      1.00      1.00       311
  I-LOCATION       1.00      1.00      1.00       427
    I-PERSON       1.00      1.00      1.00       339
      I-TIME       1.00      1.00      1.00       486

    accuracy                           0.98      3163
   macro avg       0.99      0.98      0.98      3163
weighted avg       0.98      0.98      0.98      3163



['/content/prepared/crf_model.joblib']

## 7. Transformer Fine-Tuning (Lite)

In [None]:
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer, DataCollatorForTokenClassification

MODEL_NAME = "Davlan/afro-xlmr-mini"

data = [json.loads(l) for l in open('/content/prepared/data.jsonl')]
dataset = Dataset.from_list(data)

labels = sorted({lab for d in data for lab in d['tags']})
label2id = {l:i for i,l in enumerate(labels)}
id2label = {i:l for l,i in label2id.items()}

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)

def tokenize_and_align(examples):
    tok = tokenizer(examples['tokens'], is_split_into_words=True, truncation=True, padding='max_length', max_length=128)
    new_labels = []
    for i, labs in enumerate(examples['tags']):
        word_ids = tok.word_ids(batch_index=i)
        prev, label_ids = None, []
        for wid in word_ids:
            if wid is None: label_ids.append(-100)
            elif wid != prev: label_ids.append(label2id[labs[wid]])
            else: label_ids.append(label2id[labs[wid]] if labs[wid].startswith('I-') else label2id[labs[wid].replace('B-','I-')])
            prev = wid
        new_labels.append(label_ids)
    tok['labels'] = new_labels
    return tok

tokenized = dataset.map(tokenize_and_align, batched=True)
tokenized = tokenized.train_test_split(test_size=0.1)

model = AutoModelForTokenClassification.from_pretrained(MODEL_NAME, num_labels=len(labels), id2label=id2label, label2id=label2id)

args = TrainingArguments(
    output_dir='/content/ner_out',
    eval_strategy='epoch',
    eval_steps=100,
    learning_rate=5e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=2,
    gradient_accumulation_steps=2,
    weight_decay=0.01,
    save_total_limit=1,
    logging_steps=20,
)

trainer = Trainer(
    model,
    args,
    train_dataset=tokenized['train'],
    eval_dataset=tokenized['test'],
    data_collator=DataCollatorForTokenClassification(tokenizer),
    tokenizer=tokenizer,
)

trainer.train()
trainer.save_model('/content/prepared/ner_model')

Map:   0%|          | 0/8000 [00:00<?, ? examples/s]

Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at Davlan/afro-xlmr-mini and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(
  | |_| | '_ \/ _` / _` |  _/ -_)


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mchahyaandida[0m ([33mchahyaandida-modibbo-adama-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin




Epoch,Training Loss,Validation Loss


## 8. Inference Examples
Load saved models and run inference on sample texts. Edit the sample sentences as needed.


In [None]:
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer

model_dir = '/content/prepared/ner_model'
if os.path.exists(model_dir):
    tok = AutoTokenizer.from_pretrained(model_dir)
    mod = AutoModelForTokenClassification.from_pretrained(model_dir)
    nlp = pipeline('token-classification', model=mod, tokenizer=tok, aggregation_strategy='simple')
    print(nlp("Ngala ta yana kasuwa."))
else:
    print("Train the model first.")