# Data Preprocessing for EthioMart Amharic NER System

This notebook outlines the steps taken to preprocess the raw Telegram messages collected for the EthioMart Amharic Named Entity Recognition (NER) project. The preprocessing steps include data cleaning, normalization, and preparation for labeling.


In [1]:
# Import necessary libraries
import pandas as pd
import re

# Load the raw data
df = pd.read_csv('../data/raw/telegram_messages.csv')

# Display the first few rows of the raw data
df.head()

Unnamed: 0,sender,timestamp,text
0,-1001200000000.0,2024-09-28 07:32:14+00:00,💥 Smart Mini Massager Patch \r\n 💯High ...
1,-1001200000000.0,2024-09-28 07:23:17+00:00,💥 Smart Mini Massager Patch \r\n 💯High ...
2,-1001200000000.0,2024-09-27 07:12:34+00:00,💥ለመላዉ የክርስትና እምንነት ተከታይ ደንበኞቻችን በሙሉ እንኳን ለብርሃነ...
3,-1001200000000.0,2024-09-26 09:20:40+00:00,💥SOKANY 3 in1 Blender /Grinder\r\n\r\nየጁስ የቡና ...
4,-1001200000000.0,2024-09-25 16:09:48+00:00,#አልቆል_ለተባላችሁ_በድጋሚ_አስገብተናል \r\n📣 IMPULSE SEALER...


## Data Cleaning Function

The following function will be used to clean and normalize the text data.


In [2]:
# Define a function to clean and normalize text
def preprocess_text(text):
    # Remove special characters and punctuation
    text = re.sub(r'[^\w\s]', '', text)
    # Replace multiple spaces with a single space
    text = re.sub(r'\s+', ' ', text)
    # Strip leading and trailing whitespace
    return text.strip()


## Applying Preprocessing

Now we will apply the `preprocess_text` function to the 'text' column of our DataFrame and create a new column for the cleaned text.


In [3]:
# Apply preprocessing to the 'text' column
df['cleaned_text'] = df['text'].apply(preprocess_text)

# Display the first few rows of the cleaned data
df[['text', 'cleaned_text']].head()


Unnamed: 0,text,cleaned_text
0,💥 Smart Mini Massager Patch \r\n 💯High ...,Smart Mini Massager Patch High Quality አንገትጀርባ...
1,💥 Smart Mini Massager Patch \r\n 💯High ...,Smart Mini Massager Patch High Quality አንገትጀርባ...
2,💥ለመላዉ የክርስትና እምንነት ተከታይ ደንበኞቻችን በሙሉ እንኳን ለብርሃነ...,ለመላዉ የክርስትና እምንነት ተከታይ ደንበኞቻችን በሙሉ እንኳን ለብርሃነ ...
3,💥SOKANY 3 in1 Blender /Grinder\r\n\r\nየጁስ የቡና ...,SOKANY 3 in1 Blender Grinder የጁስ የቡና የቅመም መፍጫ ...
4,#አልቆል_ለተባላችሁ_በድጋሚ_አስገብተናል \r\n📣 IMPULSE SEALER...,አልቆል_ለተባላችሁ_በድጋሚ_አስገብተናል IMPULSE SEALER የላስቲክ ...


## Saving Preprocessed Data

The cleaned data will be saved for further processing in the labeling step.


In [4]:
# Save the cleaned data to a new CSV file
df.to_csv('../data/processed/preprocessed_telegram_messages.csv', index=False)

print("Preprocessed data saved to data/processed/preprocessed_telegram_messages.csv")


Preprocessed data saved to data/processed/preprocessed_telegram_messages.csv
