## Data Labeling notebook
It will contain:
- Converting the Message column of the processed data to a random 100 non duplicate jason format data and save it in 'data/' directoy
- we will then load the saved json file into Label Studios 
- after labelign using label studios we will export the CoNLL format data 
- we will allign BIO labels with transformer token

In [1]:
import pandas as pd
import numpy as np
import json

In [2]:
# load preprocessed data
print("loading data...")
data = pd.read_csv('../data/processed/cleaned_normalized_data.csv')
print("loaded data successfully✅")

loading data...
loaded data successfully✅


In [3]:
# Inspect first few rows of the data
data.head()


Unnamed: 0,Channel Title,Channel Username,ID,Message,Date,Media Path,hashtags,tokenized_msg
0,Sheger online-store,@Shageronlinestore,7399,GROOMING SET ሶስት በአንድ የያዘ የፀጉር ማሽን እና ሼቨር የሚሰራ...,2025-06-20 15:25:41+00:00,../data/raw/photo\@Shageronlinestore_7399.jpg,['ዛም_ሞል'],"['GROOMING', 'SET', 'ሶስት', 'በአንድ', 'የያዘ', 'የፀጉ..."
1,Sheger online-store,@Shageronlinestore,7395,GROOMING SET ሶስት በአንድ የያዘ የፀጉር ማሽን እና ሼቨር የሚሰራ...,2025-06-20 15:25:40+00:00,../data/raw/photo\@Shageronlinestore_7395.jpg,['ዛም_ሞል'],"['GROOMING', 'SET', 'ሶስት', 'በአንድ', 'የያዘ', 'የፀጉ..."
2,Sheger online-store,@Shageronlinestore,7393,1L Water Bottle High Quality 1L water time sca...,2025-06-20 11:47:53+00:00,../data/raw/photo\@Shageronlinestore_7393.jpg,['ዛም_ሞል'],"['1L', 'Water', 'Bottle', 'High', 'Quality', '..."
3,Sheger online-store,@Shageronlinestore,7391,Sonifer Steam Iron የልብስ መቶከሻ High Quality Cera...,2025-06-20 09:03:23+00:00,../data/raw/photo\@Shageronlinestore_7391.jpg,['ዛም_ሞል'],"['Sonifer', 'Steam', 'Iron', 'የልብስ', 'መቶከሻ', '..."
4,Sheger online-store,@Shageronlinestore,7390,Sayona multifunctional juicer and extractor Be...,2025-06-20 06:48:11+00:00,../data/raw/photo\@Shageronlinestore_7390.jpg,['ዛም_ሞል'],"['Sayona', 'multifunctional', 'juicer', 'and',..."


## we will save a random 100 sample of text data into a jasonl file 
- jsonl format is necessary for annotation using label studios

In [None]:
(data[["Message"]] # Select the 'Message' column not as a series but as a dataframe
 .sample(n=100, random_state=42) # Sample 100 messages randomly
 .reset_index(drop=True) # Reset index for the sampled data
 .rename(columns={"Message": "text"})
 .to_json("../data/processed/telegram_ecommerce_data_sample.jsonl", orient="records", lines=True, force_ascii=False)) # Save the sampled messages to a JSONL file
print("Sampled data saved to 'telegram_ecommerce_data_sample.jsonl' successfully✅")

Sampled data saved to 'telegram_ecommerce_data_sample.jsonl' successfully✅


We will convert the Jsonl file into a regular json array file of Json extention to load data using label-studios GUI

In [5]:
data = []
with open("../data/processed/telegram_ecommerce_data_sample.jsonl", "r", encoding="utf-8") as infile:
    for line in infile:
        if line.strip():  # make sure not to process empty lines
            data.append(json.loads(line))

with open("../data/processed/telegram_ecommerce_data_sample.json", "w", encoding="utf-8") as outfile:
    json.dump(data, outfile, ensure_ascii=False, indent=2)

print("Converted JSONL to JSON array and saved successfully.")

Converted JSONL to JSON array and saved successfully.


In [6]:
print("CONLL data successfully exported from label-studios in CONLL 2003 format")
with open("../data/processed/telegram_labeled_data.conll", "r", encoding="utf-8") as file:
    lines = file.readlines()
print("".join(lines[:30]))  # Show first few lines

CONLL data successfully exported from label-studios in CONLL 2003 format
-DOCSTART- -X- O
SUN -X- _ B-PRODUCT
5 -X- _ I-PRODUCT
Nail -X- _ I-PRODUCT
Dryer -X- _ I-PRODUCT
: -X- _ O
Infrared -X- _ B-PROD_COMPONENT
intelligent -X- _ I-PROD_COMPONENT
induction -X- _ I-PROD_COMPONENT
( -X- _ O
30 -X- _ O
S -X- _ O
60 -X- _ O
S -X- _ O
90 -X- _ O
S -X- _ O
timing -X- _ O
) -X- _ O
LCD -X- _ B-PROD_COMPONENT
display -X- _ I-PROD_COMPONENT
Bottom -X- _ I-PROD_COMPONENT
cooling -X- _ I-PROD_COMPONENT
hole -X- _ I-PROD_COMPONENT
ዋጋ፦ -X- _ B-PRICE
2600 -X- _ I-PRICE
ብር -X- _ I-PRICE
ውስን -X- _ O
ፍሬ -X- _ O
ነው -X- _ O
ያለው -X- _ O



## CONLL Format Labled Data Structure
we have labeled each data using **6 tags**:
- PRODUCT -> the product sold
- LOC -> the location entity 
- PRICE -> selling price of product (when available)
- CONTACT -> contact info of the seller
- PROD_COMPONENT -> specs of products and ingredients for products that has such specifications
- DELIVERY_FEE -> delivery fee of sellers that offer delivery services

NOTE: labeling was done manually for about 100 randomly selected messages from the cleaned_normalized_data using Lable-Studios

# 📒 Data Labeling Notebook Summary

This notebook documents the workflow for preparing, sampling, and labeling Amharic e-commerce Telegram messages for Named Entity Recognition (NER) tasks.

---

## 1. **Loading Preprocessed Data**
- Loads cleaned and normalized Telegram message data from `../data/processed/cleaned_normalized_data.csv`.
- Data is inspected to verify structure and content.

---

## 2. **Sampling for Annotation**
- Randomly selects 100 unique messages from the dataset.
- Renames the `Message` column to `text` for annotation compatibility.
- Saves the sample in JSONL format (`telegram_ecommerce_data_sample.jsonl`) for use with Label Studio.

---

## 3. **Format Conversion for Label Studio**
- Converts the JSONL sample to a standard JSON array (`telegram_ecommerce_data_sample.json`) for easier loading into Label Studio's GUI.

---

## 4. **Labeling Workflow**
- The sampled data is imported into Label Studio for manual annotation.
- **NOTE:** Labeling was done manually for about 100 randomly selected messages from the cleaned and normalized data using Label Studio.
- After annotation, the labeled data is exported in CoNLL 2003 format (`telegram_labeled_data.conll`).

---

## 5. **NER Tagging Schema**
The following six entity tags are used for labeling:
- **PRODUCT**: The product being sold.
- **LOC**: Location entities.
- **PRICE**: Selling price (when available).
- **CONTACT**: Seller's contact information.
- **PROD_COMPONENT**: Product specifications or ingredients.
- **DELIVERY_FEE**: Delivery fee information for sellers offering delivery.

---

## 6. **Post-Labeling**
- The notebook demonstrates how to read and preview the exported CoNLL data.
- Next steps: Aligning BIO labels with transformer tokens for model training.

---

**Result:**  
A reproducible workflow for sampling, annotating, and exporting labeled Amharic e-commerce Telegram messages, ready for downstream NER modeling.