# Task 2: Label a Subset of Dataset in CoNLL Format

In this task, we will label a subset of Amharic e-commerce messages for Named Entity Recognition (NER) using the CoNLL format. Entities include products, prices, and locations. The labeled data will be saved in a plain text file for use in NER model training.

## CoNLL Format and Entity Types

- Each token is labeled on its own line, followed by its entity label.
- Blank lines separate individual messages.

**Entity Types:**
- B-Product, I-Product: Product entities
- B-LOC, I-LOC: Location entities
- B-PRICE, I-PRICE: Price entities
- O: Outside any entity

In [None]:
import pandas as pd

# Load the cleaned messages
df = pd.read_csv('../data/processed/cleaned_messages.csv', encoding='utf-8-sig')
df = df.dropna(subset=['message'])
df = df.head(50)  # Select 50 messages for labeling
df['message'].head()

## Manual Annotation Process

For each message, tokenize the text and assign entity labels according to the CoNLL format. This can be done by iterating through each message and labeling tokens as B-Product, I-Product, B-LOC, I-LOC, B-PRICE, I-PRICE, or O.

In [None]:
def label_message_conll(message, entities):
    """
    message: str, the message text
    entities: list of tuples (start_idx, end_idx, label), where label is one of 'Product', 'LOC', 'PRICE'
    Returns: list of (token, label)
    """
    tokens = message.split()
    labels = ['O'] * len(tokens)
    for start, end, ent_type in entities:
        if start < len(tokens):
            labels[start] = f'B-{ent_type}'
            for i in range(start+1, min(end+1, len(tokens))):
                labels[i] = f'I-{ent_type}'
    return list(zip(tokens, labels))

# Example usage for one message (manual annotation):
example_message = df['message'].iloc[0]
# Example: label first two tokens as Product, next two as LOC, next two as PRICE
entities = [(0, 1, 'Product'), (2, 3, 'LOC'), (4, 5, 'PRICE')]
labeled = label_message_conll(example_message, entities)
for token, label in labeled:
    print(f"{token} {label}")

In [None]:
def export_conll(labeled_messages, filepath):
    with open(filepath, 'w', encoding='utf-8') as f:
        for message in labeled_messages:
            for token, label in message:
                f.write(f"{token} {label}\n")
            f.write("\n")

# Example: Save a few manually labeled messages
labeled_messages = [
    label_message_conll(df['message'].iloc[0], [(0, 1, 'Product'), (2, 3, 'LOC'), (4, 5, 'PRICE')]),
    # Add more labeled messages here
]
export_conll(labeled_messages, '../data/conll/labeled_subset.conll')

## Summary

- Loaded a subset of messages for annotation.
- Provided a template for manual or semi-automatic labeling in CoNLL format.
- Labeled data can now be used for NER model training and evaluation.