## Introduction
This notebook focuses on preprocessing the raw Telegram e-commerce data collected for the **Ethio Ecom NER Analytics** project. The goal is to transform the scraped messages into a clean, tokenized format suitable for training an Amharic Named Entity Recognition (NER) model. The process addresses Amharic-specific linguistic features, handles missing data, and prepares the dataset for downstream tasks like entity extraction and vendor analytics.

---

### Scripts and Imports

In [1]:
import pandas as pd
import sys
import os
sys.path.append(os.path.abspath(".."))
from scripts.preprocess_messages import clean_amharic_message, tokenize_amharic_text

### Reading Data and Initial Inspection
The raw data, scraped from Telegram channels, is loaded from `../data/raw/telegram_data.csv` with UTF-8 encoding to support Amharic characters.

In [None]:
# Read the CSV with better handling of quotes and encoding
df = pd.read_csv("../data/raw/telegram_data.csv", encoding='utf-8')

# Initial inspection of the DataFrame
print("Initial DataFrame shape:")
print(df.shape)
print(df.head())

Initial DataFrame shape:
(39261, 6)
  Channel Title         Channel Username    ID  \
0   EthioBrand®  @ethio_brand_collection  6117   
1   EthioBrand®  @ethio_brand_collection  6116   
2   EthioBrand®  @ethio_brand_collection  6115   
3   EthioBrand®  @ethio_brand_collection  6114   
4   EthioBrand®  @ethio_brand_collection  6113   

                                             Message  \
0  ‼️ እሁድ ሁሌም ክፍት ነን ‼️\n\nReebok Club Vintage   ...   
1  Skechers archfit \nsize 40,41,42,43\nPrice 340...   
2  ‼️ እሁድ ሁሌም ክፍት ነን ‼️\n\nNB 04 leather  \nSize ...   
3                                                NaN   
4  Nike Air Force Paisley \nSize 40,41,42,43,44\n...   

                        Date  Media Path  
0  2025-06-22 06:27:39+00:00         NaN  
1  2025-06-16 09:01:34+00:00         NaN  
2  2025-06-15 09:20:06+00:00         NaN  
3  2025-06-15 09:16:19+00:00         NaN  
4  2025-06-14 09:04:17+00:00         NaN  


**Insights**
- **Shape (39261, 6):** Indicates 39,261 messages with 6 columns (`Channel Title`, `Channel Username`, `ID`, `Message`, `Date`, `Media Path`).
- **Missing Data:** Row 3 shows a `NaN` in `Message`, suggesting incomplete posts.
- **Amharic Content:** Messages like "እሁድ ሁሌም ክፍት ነን" (meaning "We are always open on Sunday") confirm Amharic usage.

### Preprocessing and Tokenization
#### Step-by-Step Processing

1. **Drop Rows with Missing or Empty Messages**
   - Removes `NaN` values and empty strings to ensure data quality.
2. **Flatten Internal Newlines**
   - Replaces newlines (`\n`) and carriage returns (`\r`) with spaces for uniform text.
3. **Strip Leading/Trailing Spaces**
   - Cleans up extra whitespace.
4. **Drop 'Media Path' Column**
   - Removes the irrelevant `Media Path` column since images were not downloaded.
5. **Reset Index**
   - Reindexes the DataFrame after filtering.
6. **Clean Messages**
   - Applies `clean_amharic_message` to remove non-Amharic characters (e.g., emojis) while preserving Amharic script, English, numbers, and basic punctuation.
7. **Tokenize Messages**
   - Uses `tokenize_amharic_text` to split cleaned text into word tokens.

In [3]:
# 1. Drop rows with missing or empty messages
df = df[df['Message'].notna()]  # Remove NaN
df = df[df['Message'].str.strip() != '']  # Remove empty or whitespace-only strings

# 2. Flatten internal newlines
df['Message'] = df['Message'].str.replace('\n', ' ', regex=False).str.replace('\r', '', regex=False)

# 3. Strip leading/trailing spaces
df['Message'] = df['Message'].str.strip()

# 4. Drop 'Media Path' column
if 'Media Path' in df.columns:
    df = df.drop(columns=['Media Path'])

# 5. Reset index
df = df.reset_index(drop=True)

# 6. Clean messages using the provided function
df['Message'] = df['Message'].apply(clean_amharic_message)

# 7. Tokenize messages
df['Message'] = df['Message'].apply(tokenize_amharic_text)

### Output Inspection 

In [4]:
# Inspection after preprocessing
print("After preprocessing:")
print(df.shape)
print(df.head())

After preprocessing:
(21689, 5)
  Channel Title         Channel Username    ID  \
0   EthioBrand®  @ethio_brand_collection  6117   
1   EthioBrand®  @ethio_brand_collection  6116   
2   EthioBrand®  @ethio_brand_collection  6115   
3   EthioBrand®  @ethio_brand_collection  6113   
4   EthioBrand®  @ethio_brand_collection  6112   

                                             Message  \
0  [እሁድ, ሁሌም, ክፍት, ነን, Reebok, Club, Vintage, siz...   
1  [Skechers, archfit, size, 40,41,42,43, Price, ...   
2  [እሁድ, ሁሌም, ክፍት, ነን, NB, 04, leather, Size, 39,...   
3  [Nike, Air, Force, Paisley, Size, 40,41,42,43,...   
4  [Skechers, GY, ULTRA, Size, 40,41,42,43,44, Pr...   

                        Date  
0  2025-06-22 06:27:39+00:00  
1  2025-06-16 09:01:34+00:00  
2  2025-06-15 09:20:06+00:00  
3  2025-06-14 09:04:17+00:00  
4  2025-06-14 06:40:06+00:00  


**Insights**
- **Shape (21689, 5):** Reduced rows indicate removal of 17,572 invalid entries (e.g., `NaN` or empty messages).
- **Tokenized Output:** Messages are now lists of tokens (e.g., `[እሁድ, ሁሌም, ክፍት, ነን]`), ready for NER labeling in CoNLL format.
- **Data Integrity:** Amharic tokens are preserved, ensuring compatibility with the NER model.

### Saving Preprocessed Data
The processed dataset is saved to `../data/processed/preprocessed_messages.csv` with UTF-8-SIG encoding to handle special characters.

In [6]:
# Save to CSV in ../data/processed/
output_path = "../data/processed/preprocessed_messages.csv"
df.to_csv(output_path, index=False, encoding='utf-8-sig')  # utf-8-sig to handle special characters like እሁድ

print(f"Preprocessed data saved to: {output_path}")

Preprocessed data saved to: ../data/processed/preprocessed_messages.csv


---

### Next Steps
1. **Labeling:** Convert tokenized data into CoNLL format for NER training.
2. **Analysis:** Explore token frequency for initial entity insights.

This notebook lays the groundwork for creating a high-quality dataset, critical for the success of the NER model and vendor analytics pipeline.