# Fine-Tuning for Amharic Named Entity Recognition (NER) from Telegram Data
* To fine-tune an NER model for Amharic using your Telegram data, here's a step-by-step approach:

# 1. Data Preparation
* First, we'll need to extract and structure the relevant information from our Excel data:

In [1]:
import pandas as pd

# Load the Excel file
df = pd.read_excel('telegram_data.xlsx')

# Extract messages with Amharic text
amharic_messages = df[df['Message'].notna()]['Message'].tolist()

In [12]:
df.head()

Unnamed: 0,Channel Title,Channel Username,ID,Message,Date,Media Path
0,Sheger online-store,@Shageronlinestore,5333,,2024-09-20 11:50:03+00:00,photos/@Shageronlinestore_5333.jpg
1,Sheger online-store,@Shageronlinestore,5332,,2024-09-20 11:50:03+00:00,photos/@Shageronlinestore_5332.jpg
2,Sheger online-store,@Shageronlinestore,5331,,2024-09-20 11:50:03+00:00,photos/@Shageronlinestore_5331.jpg
3,Sheger online-store,@Shageronlinestore,5330,,2024-09-20 11:50:02+00:00,photos/@Shageronlinestore_5330.jpg
4,Sheger online-store,@Shageronlinestore,5329,,2024-09-20 11:50:02+00:00,photos/@Shageronlinestore_5329.jpg


### Data Summary

In [10]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5015 entries, 0 to 5014
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype              
---  ------            --------------  -----              
 0   Channel Title     5015 non-null   object             
 1   Channel Username  5015 non-null   object             
 2   ID                5015 non-null   int64              
 3   Message           3166 non-null   object             
 4   Date              5015 non-null   datetime64[ns, UTC]
 5   Media Path        3794 non-null   object             
dtypes: datetime64[ns, UTC](1), int64(1), object(4)
memory usage: 235.2+ KB


###  Checking for Missing Values

In [6]:
df.isnull().sum()


Channel Title          0
Channel Username       0
ID                     0
Message             1849
Date                   0
Media Path          1221
dtype: int64

In [9]:
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')



### Droping Missing Values 

In [17]:
# Drop rows where both Message and Media Path are missing.
df = df.dropna(subset=['Message'], how='all')
df = df.dropna(subset=['Media Path'], how='all')


In [18]:
df.isnull().sum()

Channel Title       0
Channel Username    0
ID                  0
Message             0
Date                0
Media Path          0
dtype: int64

# 2. Annotation Guidelines
## Define our entity types based on what appears in our data:

* Products (e.g., "ሴራሚክ የላዛኛ መስሪያ", "የፀጉር ፔስትራ")
* Prices (e.g., "1200 ብር", "500 ብር")
* Locations/Addresses (e.g., "ስሪ ኤም ሲቲ ሞል ሁለተኛ ፎቅ")
* Phone numbers
* Measurements (e.g., "3ሜትር", "45ሳ.ሜ")

# 3. Data Annotation
* We'll need to annotate our data in BIO format.
### For annotation,We can use tools like:
* Doccano (open source)

### Preprocessing Recomendation

In [None]:
import re

def preprocess_amharic_text(text):
    # Normalize Ethiopic numbers if present
    text = text.replace('፩', '1').replace('፪', '2') # etc for all Ethiopic numbers
    
    # Standardize price formats
    text = re.sub(r'(\d+)\s*ብር', r'\1 ብር', text)
    
    # Remove excessive whitespace and line breaks
    text = ' '.join(text.split())
    
    return text

# 4. Model Selection
### For Amharic NER, consider these options:

## Option A: Use AfroXLMRoberta (better for African languages)

In [19]:
model_name = "Davlan/afro-xlmr-base"

# 5. Training Code Example

In [3]:
label_list = ['O', 'B-PER', 'I-PER', 'B-LOC', 'I-LOC', 'B-ORG', 'I-ORG']
num_labels = len(label_list)


In [None]:
from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification
from transformers import Trainer, TrainingArguments
from datasets import load_dataset, Dataset
import numpy as np
import torch
from sklearn.model_selection import train_test_split

# STEP 1: Define your label list
label_list = ['O', 'B-Product', 'I-Product', 'B-PRICE', 'I-PRICE', 'B-LOC', 'I-LOC']
label_to_id = {label: i for i, label in enumerate(label_list)}
id_to_label = {i: label for label, i in label_to_id.items()}
num_labels = len(label_list)

# STEP 2: Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("Davlan/afro-xlmr-base")
model = AutoModelForTokenClassification.from_pretrained("Davlan/afro-xlmr-base", num_labels=num_labels)

# STEP 3: Load and prepare your CoNLL data
# Example: read your conll file (replace with your file path)
from datasets import DatasetDict

def read_conll_data(conll_file):
    with open(conll_file, encoding="utf-8") as f:
        tokens, labels, data = [], [], []
        for line in f:
            if line.strip() == "":
                if tokens:
                    data.append({"tokens": tokens, "ner_tags": [label_to_id[l] for l in labels]})
                    tokens, labels = [], []
            else:
                token, label = line.strip().split()
                tokens.append(token)
                labels.append(label)
        if tokens:
            data.append({"tokens": tokens, "ner_tags": [label_to_id[l] for l in labels]})
    return Dataset.from_list(data)

dataset = read_conll_data("ner_data.conll")
train_test = dataset.train_test_split(test_size=0.2)
train_dataset = train_test["train"]
eval_dataset = train_test["test"]

# STEP 4: Tokenize and align labels
def tokenize_and_align_labels(example):
    tokenized_inputs = tokenizer(example["tokens"], truncation=True, is_split_into_words=True)
    labels = []
    word_ids = tokenized_inputs.word_ids()
    previous_word_idx = None
    for word_idx in word_ids:
        if word_idx is None:
            labels.append(-100)
        elif word_idx != previous_word_idx:
            labels.append(example["ner_tags"][word_idx])
        else:
            labels.append(example["ner_tags"][word_idx])
        previous_word_idx = word_idx
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

train_dataset = train_dataset.map(tokenize_and_align_labels, batched=False)
eval_dataset = eval_dataset.map(tokenize_and_align_labels, batched=False)

# STEP 5: Training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    evaluation_strategy="epoch",
    save_steps=10_000,
    save_total_limit=2,
    learning_rate=2e-5,
    logging_dir='./logs',
)

# STEP 6: Create trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
)

# STEP 7: Train the model
trainer.train()


config.json:   0%|          | 0.00/707 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

Error while downloading from https://cdn-lfs.hf.co/repos/de/c1/dec148c5602335191b2135d8b3035439e0f79c61b35648cb15f8a50abb293ef6/f73b04316c9c2e76fcb213957616a09e1eb44112266717f649fc1897dac42953?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27model.safetensors%3B+filename%3D%22model.safetensors%22%3B&Expires=1750648405&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc1MDY0ODQwNX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5oZi5jby9yZXBvcy9kZS9jMS9kZWMxNDhjNTYwMjMzNTE5MWIyMTM1ZDhiMzAzNTQzOWUwZjc5YzYxYjM1NjQ4Y2IxNWY4YTUwYWJiMjkzZWY2L2Y3M2IwNDMxNmM5YzJlNzZmY2IyMTM5NTc2MTZhMDllMWViNDQxMTIyNjY3MTdmNjQ5ZmMxODk3ZGFjNDI5NTM%7EcmVzcG9uc2UtY29udGVudC1kaXNwb3NpdGlvbj0qIn1dfQ__&Signature=jWnXIgLuzTA57ByYbTkrtcKc4ndG%7E3P6JAlJQRxB5sWjX6KcM4BQQJANmffX4y9JC9DK-L5YaU1QFZYUjH%7El69goeJ4zBqMmALfLpOoGa4fTOVfQ8MNMiZEP21Dqt5CHj%7EVxwZSnhFzUkiNt0q0Tt996kpYBIQzrBWq8HReW2JxdQjSlejJljKQjmtHwHasl3-piuSwtUEzC2WreVxIeGRTnUEsODJr%7EGz2WktD1-I817L9qxEx1%7EFdRtXUkSloTB

model.safetensors:  42%|####2     | 818M/1.93G [00:00<?, ?B/s]