# Amharic NER Labeling in CoNLL Format (Modular)

This notebook demonstrates modular labeling of Amharic Telegram e-commerce messages for NER in CoNLL format, using helper functions from `src/utils/labeling_utils.py`.

**Entity Types:**
- B-Product, I-Product
- B-LOC, I-LOC
- B-PRICE, I-PRICE
- O (Outside any entity)

**Steps:**
1. Load the raw dataset from `data/raw/telegram_data.xlsx` using a utility function.
2. Tokenize each message using a utility function.
3. Label each token.
4. Save in CoNLL format using a utility function.

In [4]:
import sys
sys.path.append('../')  # So we can import from src/utils
from src.utils.labeling_utils import load_telegram_messages_xlsx, tokenize_message, label_tokens, save_conll_format
import os

DATA_PATH = '../data/raw/telegram_data.xlsx'
OUTPUT_PATH = '../data/raw/labeled_conll.txt'


In [5]:
# Load 40 messages
messages_df = load_telegram_messages_xlsx(DATA_PATH, n=40)
messages_df.head()

Unnamed: 0,Message
0,💥Anti slip tape\n\n👉ደረጃ ወይም የተለያዩ ቦታዎች እንዳያንሸራ...
1,💥Garlic press Chopper\n\n🔰Kitchen ginger garli...
2,💥 Color changing Set of 3 Luma Candles\n\n\nBr...
3,"👉Anti-theft Lightweight Backpack 15.6"""
4,💥የእንቁላልና የክሬም መምቻ \n\n👉 ባለ ሶስት የኬክ ፓትራና\n\n👉የእ...


## Labeling Helper
For each message, tokenize and label. Edit the `labels` list for each message.

In [6]:
labeled_messages = []
for idx, row in messages_df.iterrows():
    message = row['Message']
    tokens = tokenize_message(message)
    print(f'\nMessage {idx+1}: {message}')
    print('Tokens:', tokens)
    # ---- LABEL HERE ----
    # Example: all O (change manually for real labeling)
    labels = ['O'] * len(tokens)
    # Uncomment and edit the line below to label entities
    # labels = ['B-Product', 'I-Product', 'O', ...]  # length must match tokens
    # ----------------------
    labeled = label_tokens(tokens, labels)
    labeled_messages.append(labeled)



Message 1: 💥Anti slip tape

👉ደረጃ ወይም የተለያዩ ቦታዎች እንዳያንሸራትት የሚለጠፍ ፕላስተር

✨size:-50mm*5meter


          ዋጋ:-500ብር


🏢 አድራሻ  👉 መገናኛ ስሪ ኤም ሲቲ ሞል  ሁለተኛ ፎቅ ቢሮ ቁ. SL-05A(ከ ሊፍቱ ፊት ለ ፊት)

     💧💧💧💧


    📲 0909522840
    📲 0923350054

🔖
💬  በTelegram ለማዘዝ ⤵️ ይጠቀሙ
@shager_onlinestore
  
ለተጨማሪ ማብራሪያ የቴሌግራም ገፃችን⤵️
https://t.me/Shageronlinestore
Tokens: ['💥Anti', 'slip', 'tape', '👉ደረጃ', 'ወይም', 'የተለያዩ', 'ቦታዎች', 'እንዳያንሸራትት', 'የሚለጠፍ', 'ፕላስተር', '✨size:-50mm*5meter', 'ዋጋ:-500ብር', '🏢', 'አድራሻ', '👉', 'መገናኛ', 'ስሪ', 'ኤም', 'ሲቲ', 'ሞል', 'ሁለተኛ', 'ፎቅ', 'ቢሮ', 'ቁ.', 'SL-05A(ከ', 'ሊፍቱ', 'ፊት', 'ለ', 'ፊት)', '💧💧💧💧', '📲', '0909522840', '📲', '0923350054', '🔖', '💬', 'በTelegram', 'ለማዘዝ', '⤵️', 'ይጠቀሙ', '@shager_onlinestore', 'ለተጨማሪ', 'ማብራሪያ', 'የቴሌግራም', 'ገፃችን⤵️', 'https://t.me/Shageronlinestore']

Message 2: 💥Garlic press Chopper

🔰Kitchen ginger garlic press.Quick grinding.Saving you time.
🔰Manual extrusion and grinding.
🔰Stainless steel material.Food grade safety material.Durable and easy to clean.
🔰Curved plastic handle.Com

## Save Labeled Data in CoNLL Format

In [7]:
save_conll_format(labeled_messages, OUTPUT_PATH)
print(f'CoNLL labeled data saved to {OUTPUT_PATH}')

CoNLL labeled data saved to ../data/raw/labeled_conll.txt
