# 🏷️ Task 2 – NER Labeling & CoNLL Preparation  
📘 Version: 2025-06-24

Manual Named Entity Recognition (NER) labeling of Amharic e-commerce posts. This notebook supports interactive review and annotation of vendor Telegram messages using the CoNLL tagging format. Entities include products, prices, locations, and optional attributes such as quantity or delivery terms.

---

**Challenge:** B5W4 – Amharic E-Commerce Data Extractor  
**Company:** EthioMart (Telegram E-Commerce Aggregator)  
**Author:** Nabil Mohamed  
**Branch:** `task-2-ner-labeling-conll-format`  
**Date:** June 2025  

---

### 📌 This notebook covers:
- Loading cleaned Amharic Telegram messages for annotation
- Guidelines for labeling entities with CoNLL-style BIO tags
- Tokenization and manual tagging interface
- Exporting labeled data to `data/labeled/telegram_messages_labeled.conll`
- Diagnostic preview and tagging consistency checks


In [1]:
# ------------------------------------------------------------------------------
# 🛠 Ensure Notebook Runs from Project Root (for src/ imports to work)
# ------------------------------------------------------------------------------

import os
import sys

# If running from /notebooks/, move up to project root
if os.path.basename(os.getcwd()) == "notebooks":
    os.chdir("..")
    print("📂 Changed working directory to project root")

# Add project root to sys.path so `src/` modules can be imported
project_root = os.getcwd()
if project_root not in sys.path:
    sys.path.insert(0, project_root)
    print(f"✅ Added to sys.path: {project_root}")

# Optional: verify file presence to confirm we're in the right place
expected_path = "data/raw"
print(
    "📁 Output path ready"
    if os.path.exists(expected_path)
    else f"⚠️ Output path not found: {expected_path}"
)

📂 Changed working directory to project root
✅ Added to sys.path: c:\Users\admin\Documents\GIT Repositories\b5w4-amharic-ecommerce-data-extractor-challenge
📁 Output path ready


## 📦 Imports & Environment Setup

This cell loads the core libraries required for token-level NER labeling and CoNLL-format preparation. The imports are grouped by function:

- **Data handling**: `pandas` for managing raw and labeled message tables  
- **Text processing**: `re` for pattern matching, basic token splitting for CoNLL-style tags  
- **Labeling utilities**: Optional helpers for token navigation and validation  
- **System I/O**: `os` and `pathlib` for safe directory and file operations  

These tools power the annotation interface, enforce tagging consistency, and ensure properly formatted CoNLL output.


In [2]:
# ------------------------------------------------------------------------------
# 📦 Core Imports – Labeling, Tokenization, and CoNLL Export
# ------------------------------------------------------------------------------

# Standard Library
import os  # File and path handling
import re  # Regex for tokenization and entity detection
from pathlib import Path  # Cross-platform path safety
import warnings  # Suppress benign warnings

# Core Analysis
import pandas as pd  # Structured data handling

# Optional: tidy up notebook output
warnings.filterwarnings("ignore")

## 📥 Load & Preview Cleaned Telegram Messages (Task 2 Input)

This step loads the cleaned Amharic Telegram e-commerce posts from `data/cleaned/telegram_messages.csv` into memory for manual Named Entity Recognition (NER) tagging.

- Reads structured CSV with message text, channel name, and timestamp  
- Validates structure: non-empty, expected columns (`message`, `channel`, `timestamp`)  
- Outputs summary diagnostics: number of messages, sample preview  
- Raises explicit errors for missing or malformed files  
- Ensures messages are ready for token-level annotation and CoNLL tagging


In [3]:
# ------------------------------------------------------------------------------
# 📦 Load Cleaned Telegram Messages for NER Labeling
# ------------------------------------------------------------------------------

from src.data_loader import TelegramMessageLoader  # Custom loader class

# Define path to pre-cleaned and sorted messages
data_path = "data/labeled/candidate_messages_for_labeling.txt"

# Initialize loader class
loader = TelegramMessageLoader(filepath=data_path)

# Load DataFrame with validation and fallback checks
try:
    df = loader.load()
    print(f"✅ Loaded {len(df):,} messages for labeling.")
except Exception as e:
    print(f"❌ Failed to load candidate Telegram messages: {e}")

✅ Telegram messages loaded: 50 rows × 1 columns
✅ Loaded 50 messages for labeling.
