
# Named Entity Recognition (NER) Labeling
## Task 2: CoNLL Format Annotation
 
This notebook helps you label Amharic text data in CoNLL format for NER tasks.
 
### Steps:
1. Run the preprocessing pipeline to generate labeling sample
2. Execute the CLI annotator tool
3. Label each token following the instructions
4. Save the labeled CoNLL file
 
**Entity Types:**
- `B-PRODUCT`, `I-PRODUCT`: Product names
- `B-LOC`, `I-LOC`: Location names
- `B-PRICE`, `I-PRICE`: Prices and currencies
- `O`: Non-entity tokens

In [None]:
# Install required packages
!pip install pandas tqdm

# %%
import pandas as pd
import os
from scripts.conll_annotator import CoNLLAnnotator


## Step 1: Generate Labeling Sample

SAMPLE_PATH = 'ner_labeling_sample.csv'

if not os.path.exists(SAMPLE_PATH):
    print("Sample file not found! Generating sample...")
    
    # Check if processed data exists
    if not os.path.exists('structured_data/content.csv'):
        print("Processed data not found! Please run preprocessing first.")
        print("Refer to the preprocessing notebook to generate this data.")
    else:
        # Load processed content
        content = pd.read_csv('structured_data/content.csv')
        
        # Convert tokens from string to list
        content['tokens'] = content['tokens'].apply(eval)
        
        # Select sample
        labeling_sample = content.sample(50, random_state=42)[['message_id', 'cleaned_text', 'tokens']]
        labeling_sample.to_csv(SAMPLE_PATH, index=False)
        print(f"Generated labeling sample with 50 messages: {SAMPLE_PATH}")
else:
    print(f"Labeling sample found: {SAMPLE_PATH}")
    print(f"Messages available: {len(pd.read_csv(SAMPLE_PATH))}")


In [None]:
# ## Step 2: Run Annotation Tool

# Initialize and run annotator
if os.path.exists(SAMPLE_PATH):
    annotator = CoNLLAnnotator(SAMPLE_PATH)
    annotator.start_cli_labeling()
else:
    print("Cannot start annotator: Sample file missing")

In [None]:
# ## Step 3: Verify Labeled Data
!head -n 10 labeled_data.conll


# %% [markdown]


In [None]:
# ## Step 4: Proceed to Model Training
# 
# Your labeled data is now ready! You can use this CoNLL file for:
# - Fine-tuning NER models (Task 3)
# - Model comparison (Task 4)
# 
# File saved as: `labeled_data.conll`