# Task 2 – Generate CoNLL Annotation Template

This notebook samples messages from the **pre-processed** Telegram dataset and writes a template file (`data/ner/ner_template.conll`) for manual Named-Entity-Recognition labelling.

**Instructions for annotators**:
1. Each token appears on its own line followed by a placeholder tag `O`.
2. Replace `O` with one of: `B-PRODUCT, I-PRODUCT, B-LOC, I-LOC, B-PRICE, I-PRICE`.
3. Sentences (messages) are separated by a blank line.
4. Save the edited file as `ner_labeled.conll` in the same folder.


In [1]:
from pathlib import Path
from datetime import datetime
import random, re, sys
import pandas as pd

#─ Paths
PRE_DIR = Path('../data/preprocessed')
NER_DIR = Path('../data/ner'); NER_DIR.mkdir(parents=True, exist_ok=True)

#─ Helper – Amharic clean (same as preprocessing)
def clean_amharic_text(text: str) -> str:
    text = re.sub(r'[\r\n]+', ' ', str(text))
    text = re.sub(r'[^\w\s።፥፣፤፦፧፡፠]', '', text, flags=re.UNICODE)
    return re.sub(r'\s+', ' ', text).strip()


In [2]:
#─ Locate latest pre-processed CSV
csv_files = list(PRE_DIR.glob('telegram_data_preprocessed_*.csv')) or list(PRE_DIR.glob('telegram_data_*.csv'))
if not csv_files:
    sys.exit('No pre-processed CSV found – run preprocessing notebook first.')
latest = max(csv_files, key=lambda p: p.stat().st_mtime)
print('Using file →', latest)
df = pd.read_csv(latest, encoding='utf-8-sig')
df.head()


Using file → ..\data\preprocessed\telegram_data_preprocessed_20250625_142643.csv


Unnamed: 0,Channel Title,Channel Username,Message ID,Message,Date,Media Path,Clean Text,entities,Views
0,Shewa Brand,https://t.me/@Shewabrand,3714,የተለያዩ ጫማዎች በፍሬ መምረጥ ማስመረጥ ለምትፈልጉ ደንበኞቻችን አዲስ ነ...,2025-06-22 07:20:07+00:00,photos\@Shewabrand_3714.jpg,የተለያዩ ጫማዎች በፍሬ መምረጥ ማስመረጥ ለምትፈልጉ ደንበኞቻችን አዲስ ነ...,"[{'word': '▁የተለያዩ', 'entity': 'B-Product', 'sc...",1093
1,Shewa Brand,https://t.me/@Shewabrand,3713,NIKE SB FC original 💯 \r\nSize 40#41#42#43#44#...,2025-06-21 09:28:21+00:00,photos\@Shewabrand_3713.jpg,NIKE SB FC original Size 404142434445 MADE IN ...,"[{'word': '▁N', 'entity': 'B-Product', 'score'...",2259
2,Shewa Brand,https://t.me/@Shewabrand,3712,ORIGINAL COTTON TUTA💯 original \r\nSize L#XL#2...,2025-06-21 05:05:45+00:00,photos\@Shewabrand_3712.jpg,ORIGINAL COTTON TUTA original Size LXL2XL3XL4X...,"[{'word': '▁', 'entity': 'B-Product', 'score':...",947
3,Shewa Brand,https://t.me/@Shewabrand,3711,ZARA CLUB COTTON TISHERTS 💯 original \r\nSize ...,2025-06-20 07:57:43+00:00,photos\@Shewabrand_3711.jpg,ZARA CLUB COTTON TISHERTS original Size MLXLXX...,"[{'word': '▁Z', 'entity': 'B-Product', 'score'...",15551
4,Shewa Brand,https://t.me/@Shewabrand,3710,jordan 1 original 💯 \r\nSize 40#41#42#43\r\nMA...,2025-06-20 06:15:40+00:00,photos\@Shewabrand_3710.jpg,jordan 1 original Size 40414243 MADE IN VIETNA...,"[{'word': '▁jord', 'entity': 'B-Product', 'sco...",7584


In [3]:
#─ Sampling messages
SAMPLE_SIZE = 50   # adjust if needed
messages = df['Clean Text'] if 'Clean Text' in df.columns else df['Message']
messages = messages.dropna().tolist()
sample = random.sample(messages, k=min(SAMPLE_SIZE, len(messages)))
print(f'Sampled {len(sample)} messages.')


Sampled 50 messages.


In [5]:
#─ Write CoNLL template
out_path = NER_DIR / 'ner_template.conll'
with out_path.open('w', encoding='utf-8') as f:
    for msg in sample:
        for tok in msg.split():
            if tok.strip():
                f.write(f'{tok}	O\n')
        f.write('\n')  # sentence separator

print('Template written →', out_path)


Template written → ..\data\ner\ner_template.conll
