# Data Preprocessing

This notebook covers the preprocessing of raw Amharic e-commerce messages collected from Telegram channels. Steps include loading the data, cleaning, normalization, tokenization, and preparation for entity extraction.

## 1. Load Raw Data

We begin by loading the raw messages collected from Telegram channels. The data is stored in JSON and CSV formats.

In [None]:
import pandas as pd
import json

# Load from JSON
with open('../data/raw/telegram_data.json', encoding='utf-8') as f:
    raw_data = json.load(f)

# Or load from CSV
raw_df = pd.read_csv('../data/raw/telegram_data.csv', encoding='utf-8-sig')

# Display a sample
raw_df.head()

## 2. Text Normalization and Cleaning

We normalize Amharic text by removing unwanted characters, extra spaces, and standardizing punctuation.

In [None]:
import re

def normalize_amharic_text(text):
    if not isinstance(text, str):
        return ""
    text = text.replace('\n', ' ').replace('\r', ' ')
    text = re.sub(r'[፡።:]', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

raw_df['cleaned_message'] = raw_df['message'].apply(normalize_amharic_text)
raw_df[['message', 'cleaned_message']].head()

## 3. Tokenization and Amharic-Specific Processing

We tokenize the cleaned text and handle Amharic-specific linguistic features. For demonstration, we use simple whitespace tokenization.

In [None]:
# Simple whitespace tokenization
def tokenize_amharic(text):
    return text.split()

raw_df['tokens'] = raw_df['cleaned_message'].apply(tokenize_amharic)
raw_df[['cleaned_message', 'tokens']].head()

## 4. Save Preprocessed Data

We save the cleaned and tokenized data for use in entity extraction and further analysis.

In [None]:
import os
os.makedirs('../data/processed', exist_ok=True)
raw_df.to_csv('../data/processed/cleaned_messages.csv', index=False, encoding='utf-8-sig')
raw_df.to_json('../data/processed/cleaned_messages.json', orient='records', force_ascii=False)
print('Preprocessed data saved.')

## 5. Summary

- Loaded raw Telegram e-commerce messages.
- Cleaned and normalized Amharic text.
- Tokenized messages for downstream NLP tasks.
- Saved preprocessed data for further analysis and entity extraction.