## NoteBook content:
this notebook will contain:
- loading conull data and preparing it for hugging face model
- tokenize data and align the labels with tokens produced by the tokenizer
- structure training arguments
- use Hugging faces trainer api to fine tune model(XML-Roberta)
- evaluate fine tuned modle on validataion set
- save model for future use


In [None]:
from transformers import AutoTokenizer
from datasets import Dataset, DatasetDict

import os 
import sys

sys.path.append(os.path.abspath('../src'))
from utils.ner_utils.py import load_conll,tokenize_and_align_labels,compute_metrics
from models.ner_model import load_ner_model, save_ner_model, get_ner_pipeline

print("Transformers version:", transformers.__version__) # print transformer version

# Model checkpoint for XML-RoBERTa
model_checkpoint = "xlm-roberta-base"
print("Model checkpoint loaded ✅")

Transformers version: 4.52.4
Model checkpoint loaded ✅


### Next Task:
Define a load_conll function which parses the .conll file ignoring the POS/Chunk columns, extracting only the BIO tags + tokens

In [None]:
# use load_conll module to load the conll file
conll_path = "../data/processed/telegram_labeled_data.conll"
word_tokens, word_labels = load_conll(conll_path)

# print the firs word token and word_label for example
print(word_tokens[0])
print(word_labels[0])

['SUN', '5', 'Nail', 'Dryer', ':', 'Infrared', 'intelligent', 'induction', '(', '30', 'S', '60', 'S', '90', 'S', 'timing', ')', 'LCD', 'display', 'Bottom', 'cooling', 'hole', 'ዋጋ፦', '2600', 'ብር', 'ውስን', 'ፍሬ', 'ነው', 'ያለው', 'አድራሻ', 'ቁ.1', 'መገናኛ', 'ታሜ', 'ጋስ', 'ህንፃ', 'ጎን', 'ስሪ', 'ኤም', 'ሲቲ', 'ሞል', 'ሁለተኛ', 'ፎቅ', 'ቢሮ', 'ቁ.', 'SL-05A', '(', 'ከ', 'ሊፍቱ', 'ፊት', 'ለ', 'ፊት', ')', 'ቁ.2', 'ለቡ', 'መዳህኒዓለም', 'ቤተ', '/', 'ክርስቲያን', '100ሜ', 'ወደ', 'ሙዚቃ', 'ቤት', 'ከፍ', 'ብሎ', '2ኛ', 'ፎቅ', 'ቢሮ.ቁ', '214', '0909522840', '0923350054', 'ለቡ', 'ቅርንጫፍ0973611819', 'በTelegram', 'ለማዘዝ', 'ይጠቀሙ', '@', 'shager_onlinestore', 'ለተጨማሪ', 'ማብራሪያ', 'የቴሌግራም', 'ገፃችን', 'https', ':', '/', '/', 't.me', '/', 'Shageronlinestore']
['B-PRODUCT', 'I-PRODUCT', 'I-PRODUCT', 'I-PRODUCT', 'O', 'B-PROD_COMPONENT', 'I-PROD_COMPONENT', 'I-PROD_COMPONENT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-PROD_COMPONENT', 'I-PROD_COMPONENT', 'I-PROD_COMPONENT', 'I-PROD_COMPONENT', 'I-PROD_COMPONENT', 'B-PRICE', 'I-PRICE', 'I-PRICE', 'O', 'O', 'O', 'O', '

## Next we will convert the word_token and word_labels list into a hugging face dataset

In [5]:
 # Create Dataset
raw_dataset = Dataset.from_dict({
    "tokens": word_tokens,
    "ner_tags": word_labels
})
# 80/20 split for training
raw_dataset = raw_dataset.train_test_split(test_size=0.2, seed=42)
raw_dataset = DatasetDict({
    "train": raw_dataset["train"],
    "validation": raw_dataset["test"],
    "test": raw_dataset["test"]
})

print("raw_dataset loaded ✅")
print(f'the first train element: \n {raw_dataset["train"][0]}')
print(f'the first train element: \n {raw_dataset["validation"][0]}')

raw_dataset loaded ✅
the first train element: 
 {'tokens': ['ብዙ', 'ተወዳጀነትን', 'የተረፈ', 'የቃልኪዳን', 'ጉዞ', 'ለሁሉም', 'እድሜ', 'የሚሆን', 'የአማረኛ', 'ስዕላዊ', 'እና', 'መሳጭ', 'ታሪክ', 'በቀለም', 'እትመት', 'ለማዘዝ', '0974312223', 'ይደውሉ', 'ወይም', 'https', ':', '/', '/', 't.me', '/', 'helloo_market_bot', '?', 'start=121910003', 'ይጠቀሙ', '!'], 'ner_tags': ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-PRODUCT', 'I-PRODUCT', 'I-PRODUCT', 'I-PRODUCT', 'I-PRODUCT', 'B-PROD_COMPONENT', 'I-PROD_COMPONENT', 'O', 'B-CONTACT', 'O', 'O', 'B-CONTACT', 'I-CONTACT', 'I-CONTACT', 'I-CONTACT', 'I-CONTACT', 'I-CONTACT', 'I-CONTACT', 'I-CONTACT', 'I-CONTACT', 'O', 'O']}
the first train element: 
 {'tokens': ['Nike', 'sb', 'Made', 'in', 'Vietnam', 'Size', '40,43', 'Price', '2850', '(', 'Free', 'Delivery', ')', 'Inbox', '@', 'Hiwe5266', 'ስልክ', '+251945355266', 'ፋሽን', 'ተራ', '/', 'Fashion', 'Tera', 'አድራሻ', ':', 'አዲስ', 'አበባ', ',', 'ጦር', 'ሀይሎች', 'ድሪም', 'ታወር', '2ተኛ', 'ፎቅ', 'ቢሮ', 'ቁጥር', '205'], 'ner_tags': ['B-PRODUCT', 'I-PRODUCT', 'B-PROD_COMPON

In [6]:
# Instantiate XLM-RoBERTa tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

#Convert string labels to integer IDs for loss computation.
unique_labels = sorted({ tag for seq in word_labels for tag in seq })
label2id = { lab: i for i, lab in enumerate(unique_labels) }
id2label = { i: lab for lab, i in label2id.items() }

#Display our labels and label id
print(f"unique labels:{unique_labels}")
print(f"label2id:{label2id}")
print(f"id2label: {id2label}")

# Display the first element of the raw_dataset for training
print("\n\n")
print("Example tokens:", raw_dataset["train"][0]["tokens"])
print("Example tags:  ", raw_dataset["train"][0]["ner_tags"])
print("Label2ID map:", label2id)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

unique labels:['B-CONTACT', 'B-DELIVERY_FEE', 'B-LOC', 'B-PRICE', 'B-PRODUCT', 'B-PROD_COMPONENT', 'I-CONTACT', 'I-DELIVERY_FEE', 'I-LOC', 'I-PRICE', 'I-PRODUCT', 'I-PROD_COMPONENT', 'O']
label2id:{'B-CONTACT': 0, 'B-DELIVERY_FEE': 1, 'B-LOC': 2, 'B-PRICE': 3, 'B-PRODUCT': 4, 'B-PROD_COMPONENT': 5, 'I-CONTACT': 6, 'I-DELIVERY_FEE': 7, 'I-LOC': 8, 'I-PRICE': 9, 'I-PRODUCT': 10, 'I-PROD_COMPONENT': 11, 'O': 12}
id2label: {0: 'B-CONTACT', 1: 'B-DELIVERY_FEE', 2: 'B-LOC', 3: 'B-PRICE', 4: 'B-PRODUCT', 5: 'B-PROD_COMPONENT', 6: 'I-CONTACT', 7: 'I-DELIVERY_FEE', 8: 'I-LOC', 9: 'I-PRICE', 10: 'I-PRODUCT', 11: 'I-PROD_COMPONENT', 12: 'O'}



Example tokens: ['ብዙ', 'ተወዳጀነትን', 'የተረፈ', 'የቃልኪዳን', 'ጉዞ', 'ለሁሉም', 'እድሜ', 'የሚሆን', 'የአማረኛ', 'ስዕላዊ', 'እና', 'መሳጭ', 'ታሪክ', 'በቀለም', 'እትመት', 'ለማዘዝ', '0974312223', 'ይደውሉ', 'ወይም', 'https', ':', '/', '/', 't.me', '/', 'helloo_market_bot', '?', 'start=121910003', 'ይጠቀሙ', '!']
Example tags:   ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-PRODUCT', 'I-PRODUCT', 'I-PRODUC

## Next step: **Tokenize and Align labels**
 * ### converts word level tokens and ner_tags into subword level inputs that XML- RoBERTa can consume, propagating BIO tasg with **word_ids()**

In [None]:
from transformers import DataCollatorForTokenClassification
# use the tokenize_and_align_labels function from the ner_utils.py module
#    - Tokenize the batch of word-lists with is_split_into_words=True
#    - Build a list of subword-label sequences in `all_labels`
#    - Use word_ids() to tell which word each subword belongs to


tokenized_dataset = raw_dataset.map(
    lambda examples: tokenize_and_align_labels(examples, tokenizer, label2id),
    batched=True
)
print("Tokenized & aligned example:", tokenized_dataset["train"][0])


Map:   0%|          | 0/80 [00:00<?, ? examples/s]

Map:   0%|          | 0/20 [00:00<?, ? examples/s]

Map:   0%|          | 0/20 [00:00<?, ? examples/s]

Tokenized & aligned example: {'tokens': ['ብዙ', 'ተወዳጀነትን', 'የተረፈ', 'የቃልኪዳን', 'ጉዞ', 'ለሁሉም', 'እድሜ', 'የሚሆን', 'የአማረኛ', 'ስዕላዊ', 'እና', 'መሳጭ', 'ታሪክ', 'በቀለም', 'እትመት', 'ለማዘዝ', '0974312223', 'ይደውሉ', 'ወይም', 'https', ':', '/', '/', 't.me', '/', 'helloo_market_bot', '?', 'start=121910003', 'ይጠቀሙ', '!'], 'ner_tags': ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-PRODUCT', 'I-PRODUCT', 'I-PRODUCT', 'I-PRODUCT', 'I-PRODUCT', 'B-PROD_COMPONENT', 'I-PROD_COMPONENT', 'O', 'B-CONTACT', 'O', 'O', 'B-CONTACT', 'I-CONTACT', 'I-CONTACT', 'I-CONTACT', 'I-CONTACT', 'I-CONTACT', 'I-CONTACT', 'I-CONTACT', 'I-CONTACT', 'O', 'O'], 'input_ids': [0, 21886, 2981, 5698, 5040, 14090, 110514, 3446, 3376, 6980, 623, 115929, 29654, 39543, 150124, 2237, 101748, 225089, 213301, 11718, 4236, 151666, 17930, 16160, 119008, 2302, 6, 155327, 13799, 44181, 189821, 816, 4708, 1437, 71429, 13253, 9039, 7872, 3894, 16360, 5016, 4015, 3742, 2934, 70092, 5617, 16903, 3975, 152, 248, 248, 808, 5, 282, 248, 7943, 47673, 454, 55637, 454, 9190

### Explanatoin of key steps in **Tokenize_and_align_labels()** function
* is_split_into_words=True tells the tokenizer to remember word boundaries so we can map back.

* word_ids() returns a list where each element is the index of the original word that generated that subword (or None for special tokens).

* We assign -100 to special tokens so they don’t contribute to the loss.

* On seeing a new word index, we assign the word’s tag; otherwise we convert B- to I- for subsequent subwords.

# Data collation
Prepare dynamic batches of tokenized inputs and labels, padding sequences to the maximum length in each batch and masking labels appropriately.
- Ensures batch elements are padded to the same length for parallel computation, while ignoring padded tokens during loss calculation

In [8]:
# Instantiate the data collator
data_collator = DataCollatorForTokenClassification(
    tokenizer,
    padding='longest',           # pad to longest in batch
    label_pad_token_id=-100      # labels with -100 are ignored by loss(Ensures model doesnt learn from the paddings)
)

# Model Fine‑Tuning with LoRA
Efficiently adapt XLM-RoBERTa to our NER data by training small LoRA adapter modules instead of all model parameters.

In [None]:
# Load AutoModelForTokenClassification with the correct number of output labels.


base_model = load_ner_model(
    model_checkpoint,
    num_labels=len(unique_labels),
    id2label=id2label,
    label2id=label2id
)

print("Model loaded Successfully✅")

model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model loaded Successfully✅


### Define Training Arguments

In [None]:
from transformers import TrainingArguments

# Initialize TrainingArguments
training_args = TrainingArguments(
    output_dir="../models/Model_checkpoints",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=20,
    weight_decay=0.01,   # No external logging
    label_names=["labels"],
    logging_strategy="steps",          # Log metrics at regular step intervals
    logging_steps=10,
    save_total_limit=2
)
print("Initialized Training arguments")

Initialized Training arguments


In [None]:
# use compute_metrics function which eveluates F1score, recall and precision for model prediction (passed when initializing trainer)
import evaluate
metric = evaluate.load("seqeval")

# Initialize Trainer
from transformers import Trainer

trainer = Trainer(
    model=base_model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset.get("validation", tokenized_dataset.get("test")),
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=lambda eval_preds: compute_metrics(eval_preds, unique_labels, metric)

)
print("Trainer initialized successfully✅")

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25h

Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

  trainer = Trainer(


Trainer initialized successfully✅


**Train model**

In [13]:
# Train model
trainer.train()



<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33madatibebe12[0m ([33madatibebe12-independent[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,2.2087,1.57546,0.0,0.0,0.0,0.513679
2,1.4558,1.032278,0.366667,0.184874,0.24581,0.64434
3,1.0101,0.67512,0.300885,0.285714,0.293103,0.809434
4,0.7255,0.444753,0.443396,0.394958,0.417778,0.874057
5,0.5601,0.363,0.452174,0.436975,0.444444,0.888208
6,0.4369,0.320637,0.586207,0.571429,0.578723,0.907075
7,0.397,0.28176,0.552846,0.571429,0.561983,0.92783
8,0.3405,0.23234,0.678571,0.638655,0.658009,0.941038
9,0.3168,0.221605,0.605263,0.579832,0.592275,0.941038
10,0.2901,0.215167,0.631148,0.647059,0.639004,0.942925


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


TrainOutput(global_step=200, training_loss=0.4768771260976791, metrics={'train_runtime': 700.2496, 'train_samples_per_second': 2.285, 'train_steps_per_second': 0.286, 'total_flos': 172485259741008.0, 'train_loss': 0.4768771260976791, 'epoch': 20.0})

# Lets test our model on some raw texts

In [None]:
#  build a token‐classification pipeline using your fine‐tuned model + tokenizer
nlp = get_ner_pipeline(trainer.model, tokenizer)
# define some raw texts (Amharic or whatever you’re working on)
examples = [
    "Head protector helmet for kids ዋጋ:-550ብር አድራሻ ቁ.1 መገናኛ ስሪ ኤም ሲቲ ሞል ሁለተኛ ፎቅ ቢሮ ቁ. SL-05A(ከ ሊፍቱ ፊት ለ ፊት) ቁ.2 ለቡ መዳህኒዓለም ቤተ/ክርስቲያን ፊት ለፊት 2ኛ ፎቅ ቢሮ ቁጥር.214 ለቡ ቅርንጫፍ0971611819 0909522840 0923350054 በTelegram ለማዘዝ ይጠቀሙ @shager_onlinestore ለተጨማሪ ማብራሪያ የቴሌግራም ገፃችን https://t.me/Shageronlinestore"
    ]


# 3) run and inspect
for text in examples:
    ents = nlp(text)
    print(f"\nText: {text}")
    for e in ents:
        print(f"   {e['entity_group']} [{e['start']}:{e['end']}] -> {e['word']}")

Device set to use cuda:0



Text: Head protector helmet for kids ዋጋ:-550ብር አድራሻ ቁ.1 መገናኛ ስሪ ኤም ሲቲ ሞል ሁለተኛ ፎቅ ቢሮ ቁ. SL-05A(ከ ሊፍቱ ፊት ለ ፊት) ቁ.2 ለቡ መዳህኒዓለም ቤተ/ክርስቲያን ፊት ለፊት 2ኛ ፎቅ ቢሮ ቁጥር.214 ለቡ ቅርንጫፍ0971611819 0909522840 0923350054 በTelegram ለማዘዝ ይጠቀሙ @shager_onlinestore ለተጨማሪ ማብራሪያ የቴሌግራም ገፃችን https://t.me/Shageronlinestore
   PRODUCT [0:30] -> Head protector helmet for kids
   PRICE [31:40] -> ዋጋ:-550ብር
   LOC [50:87] -> መገናኛ ስሪ ኤም ሲቲ ሞል ሁለተኛ ፎቅ ቢሮ ቁ. SL-05A
   LOC [107:151] -> ለቡ መዳህኒዓለም ቤተ/ክርስቲያን ፊት ለፊት 2ኛ ፎቅ ቢሮ ቁጥር.214
   CONTACT [155:192] -> ቅርንጫፍ0971611819 0909522840 0923350054
   CONTACT [213:232] -> @shager_onlinestore


In [None]:
# same as above but using the trainer model and tokenizer
nlp = get_ner_pipeline(trainer.model, tokenizer)

text = "PRO STANDARD BRAND : DELL INSPIRON 2 in 1 DISPLAY: 13.3” touch screen CPU: CORE I5 11th generation RAM:8GB DDR4 STORAG: 512GB SSD GRAPHICS: intel Iris xe graphics card OS:window 10 pro BATTERY: 10hr STATUS: brand new Price 65000birr @rasneva ለአጭር መልእክት ይደዉሉ +251912759900 +251920153333 አድራሻ: - መገናኛ ማራቶን የ ገበያ ማእከል በ ዋናው መግቢያ መሬት ላይ ወይንም ግራውንድ ፍሎር ብቅ ይበሉ ነቫ ኮምፒውተር መሆኑን ያረጋግጡ ድህረ ገጻችንን ይጎብኙ www.nevacomputer.com ቴሌግራም ቻናላችንን ይቀላቀሉ https://t.me/nevacomputer"
ents = nlp(text)
print(f"\nText: {text}")
for e in ents:
    print(f"   {e['entity_group']} [{e['start']}:{e['end']}] -> {e['word']}")



Device set to use cuda:0



Text: PRO STANDARD BRAND : DELL INSPIRON 2 in 1 DISPLAY: 13.3” touch screen CPU: CORE I5 11th generation RAM:8GB DDR4 STORAG: 512GB SSD GRAPHICS: intel Iris xe graphics card OS:window 10 pro BATTERY: 10hr STATUS: brand new Price 65000birr @rasneva ለአጭር መልእክት ይደዉሉ +251912759900 +251920153333 አድራሻ: - መገናኛ ማራቶን የ ገበያ ማእከል በ ዋናው መግቢያ መሬት ላይ ወይንም ግራውንድ ፍሎር ብቅ ይበሉ ነቫ ኮምፒውተር መሆኑን ያረጋግጡ ድህረ ገጻችንን ይጎብኙ www.nevacomputer.com ቴሌግራም ቻናላችንን ይቀላቀሉ https://t.me/nevacomputer
   PRODUCT [0:34] -> PRO STANDARD BRAND : DELL INSPIRON
   PROD_COMPONENT [35:41] -> 2 in 1
   PROD_COMPONENT [42:69] -> DISPLAY: 13.3” touch screen
   PROD_COMPONENT [70:98] -> CPU: CORE I5 11th generation
   PROD_COMPONENT [99:111] -> RAM:8GB DDR4
   PROD_COMPONENT [112:129] -> STORAG: 512GB SSD
   PROD_COMPONENT [130:167] -> GRAPHICS: intel Iris xe graphics card
   PROD_COMPONENT [168:184] -> OS:window 10 pro
   PROD_COMPONENT [185:198] -> BATTERY: 10hr
   PROD_COMPONENT [199:216] -> STATUS: brand new
   PRICE [217:232] -> Pric

In [None]:
save_ner_model(trainer.model, tokenizer, "../models/my_xml_ner_model")
print("saved✅✅")

saved✅✅


# Fine-Tune NER Model (XLM-RoBERTa) – Notebook Summary

This notebook demonstrates a complete workflow for fine-tuning an XLM-RoBERTa model for Named Entity Recognition (NER) on Amharic e-commerce Telegram data. The process is modularized for clarity and reusability.

---

## Workflow Overview

1. **Data Loading**
    - Loads annotated data from a CoNLL file using a utility function.
    - Extracts tokens and BIO NER tags for each sentence.

2. **Dataset Preparation**
    - Converts the token and label lists into a Hugging Face `Dataset`.
    - Splits the data into training, validation, and test sets.

3. **Label Encoding**
    - Maps string NER tags to integer IDs and vice versa for model compatibility.

4. **Tokenization & Label Alignment**
    - Tokenizes the data using the XLM-RoBERTa tokenizer.
    - Aligns word-level NER tags with subword tokens, handling special tokens and subword splits.

5. **Data Collation**
    - Uses a data collator to dynamically pad batches and mask out padding tokens during training.

6. **Model Loading**
    - Loads a pre-trained XLM-RoBERTa model for token classification, adapting it for the specific NER label set.

7. **Training Setup**
    - Defines training arguments (epochs, learning rate, logging, etc.).
    - Initializes the Hugging Face `Trainer` with the model, datasets, tokenizer, data collator, and a custom metrics function.

8. **Model Training**
    - Fine-tunes the model on the training data and evaluates on the validation set.

9. **Inference**
    - Builds an NER pipeline using the fine-tuned model and tokenizer.
    - Runs inference on sample Amharic texts and prints recognized entities.

10. **Model Saving**
    - Saves the trained model and tokenizer for future use.

---

## Modularization

- **Utility functions** for data loading, tokenization, and metrics are placed in `src/utils/ner_utils.py`.
- **Model loading, saving, and inference pipeline** functions are in `src/models/ner_model.py`.
- The notebook imports and uses these modules, keeping the workflow clean and maintainable.

---

**This notebook provides a reproducible and modular template for NER model fine-tuning on custom Amharic datasets using Hugging Face Transformers.**