# Task 2: Amharic NER Data Labeling in CoNLL Format

In this section, a subset of 30–50 messages was randomly sampled from the processed Amharic Telegram data. Each token in these messages was assigned an appropriate NER entity label (Product, Price, Location, etc.). The labeled data was then exported as a plain text file in CoNLL format for use in model training.

In [1]:
import os
import json
import random

# processed data directory
processed_dir = "../data/processed/text"
channels = [d for d in os.listdir(processed_dir) if os.path.isdir(os.path.join(processed_dir, d))]

# Collect all message file paths
all_files = []
for channel in channels:
    channel_dir = os.path.join(processed_dir, channel)
    files = [os.path.join(channel_dir, f) for f in os.listdir(channel_dir) if f.endswith('.json')]
    all_files.extend(files)

# Randomly sample 40 messages
sample_files = random.sample(all_files, 40)

# Load the sampled messages
sampled_msgs = []
for file_path in sample_files:
    with open(file_path, 'r', encoding='utf-8') as f:
        msg = json.load(f)
    sampled_msgs.append({'file': file_path, 'tokens': msg['tokens'], 'text': msg['text']})

# Show the first message as an example
print("Sample message tokens:", sampled_msgs[0]['tokens'])
print("Original text:", sampled_msgs[0]['text'])

Sample message tokens: ['700', '2', '36373839', 'አድራሻ', 'ድሬዳዋ', 'አሸዋ', 'ሚና', 'ህንፃ', '1ኛ', 'ፎቅ', 'ላይ', 'እንገኛለን', 'የቴሌግራም', 'ቻናላችንን', 'ይቀላቀሉ', 'የቤት', 'ቁጥር', '109', 'እና', '110', 'በ', '2', 'አዋሩን', '0987336458', '0948595409', 'ይደውሉልን']
Original text: adidas yeezy boost 700 v2
Size 36#37#38#39
MADE IN VIETNAM
SHEWA BRAND
አድራሻ ድሬዳዋ አሸዋ ሚና ህንፃ 1ኛ ፎቅ ላይ እንገኛለን 
የቴሌግራም ቻናላችንን ይቀላቀሉ
👇👇👇
https://t.me//shewabrand
https://t.me//shewabrand
https://t.me//shewabrand
https://t.me//shewabrand
የቤት ቁጥር 109 እና 110
📩በ inbox  @shewat2 አዋሩን

📞 0987336458
📞0948595409 ይደውሉልን


## Step 2:  Token Labeling

For each sampled message, entity labels were assigned to each token. The following entity types were used:
- `B-Product`, `I-Product`
- `B-LOC`, `I-LOC`
- `B-PRICE`, `I-PRICE`
- `O` (for tokens outside any entity)

The code cell below was utilized to facilitate interactive labeling of each message.

In [None]:
ENTITY_LABELS = [
    "O", "B-Product", "I-Product", "B-LOC", "I-LOC", "B-PRICE", "I-PRICE"
]

def label_message(tokens):
    print("Tokens:", tokens)
    labels = []
    for token in tokens:
        label = input(f"Label for '{token}': (O/B-Product/I-Product/B-LOC/I-LOC/B-PRICE/I-PRICE) ")
        if label not in ENTITY_LABELS:
            print("Invalid label, using 'O'.")
            label = "O"
        labels.append(label)
    return labels

# Example: label the first sampled message
labels = label_message(sampled_msgs[0]['tokens'])
print(list(zip(sampled_msgs[0]['tokens'], labels)))

## Step 3: Labeling All Sampled Messages

The labeling process was repeated for all sampled messages. Tokens and their corresponding labels were stored in a list for subsequent export.

In [None]:
labeled_data = []
for i, msg in enumerate(sampled_msgs):
    print(f"\nMessage {i+1}/{len(sampled_msgs)}")
    print("Original text:", msg['text'])
    print("Tokens:", msg['tokens'])
    labels = label_message(msg['tokens'])
    labeled_data.append({'tokens': msg['tokens'], 'labels': labels})

## Step 4: Exporting Labeled Data to CoNLL Format

Upon completion of labeling, the data was exported as a plain text file in CoNLL format.

In [None]:
def save_to_conll(labeled_data, output_path):
    with open(output_path, 'w', encoding='utf-8') as f:
        for item in labeled_data:
            for token, label in zip(item['tokens'], item['labels']):
                f.write(f"{token} {label}\n")
            f.write("\n")  

output_path = "../data/labeled/ner_labeled_sample.conll"
save_to_conll(labeled_data, output_path)
print(f"Saved labeled data to {output_path}")