### Text Processing: Handling Amharic text, tokenization, and preprocessing techniques.

To preprocess the scraped Amharic text data for tasks like tokenization, normalization, and handling Amharic-specific linguistic features, we need to follow several preprocessing steps tailored for the language. 

Here’s how we can approach this task:

**Steps to Preprocess Amharic Text**

- **Tokenization**: Tokenization is the process of splitting text into individual units such as words or subwords. Since Amharic uses a different script and has some unique linguistic features, tokenizing might need adjustments. 
    - Use specialized libraries that handle Amharic text or a custom rule-based tokenizer.

- **Normalization**: This step involves cleaning and converting the text into a standard format:

    - Remove special characters, punctuation, and numbers.
    - Normalize similar-looking characters.
    - Convert text to a standard form (for example, removing diacritics if necessary).

- **Handling Amharic-Specific Features:**

    - Amharic, like other Semitic languages, has specific features such as root-and-pattern morphology.

    - Handling unique orthographic variants and considering suffixes, prefixes, and infixes in the language.

    - Identifying verb conjugations, plural forms, and possessives for better tokenization.

In [1]:

# Import necessary libraries
import pandas as pd
import logging
import os, sys
import matplotlib.pyplot as plt
from matplotlib import font_manager
from collections import Counter
# Add the 'scripts' directory to the Python path for module imports
sys.path.append(os.path.abspath(os.path.join('..', 'scripts')))
# Import data preprocessor class
from amharic_text_processor import AmharicTextPreprocessor # type: ignore
from amharic_labeler import AmharicNERLabeler # type: ignore

# Set max rows and columns to display
pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 200)

# Configure logging
logging.basicConfig(level=logging.INFO, 
                    format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

logger.info("Imported libraries and configured logging.")

2025-01-17 18:45:41,244 - INFO - Imported libraries and configured logging.


**Load the scraped Telegram data**

In [2]:
# Read the data
data = pd.read_csv('../data/telegram_data.csv')
# Explore the first five rows
data.head()

Unnamed: 0,Channel Title,Channel Username,ID,Message,Date,Media Path
0,Sheger online-store,@Shageronlinestore,5333,,2024-09-20 11:50:03+00:00,photos/@Shageronlinestore_5333.jpg
1,Sheger online-store,@Shageronlinestore,5332,,2024-09-20 11:50:03+00:00,photos/@Shageronlinestore_5332.jpg
2,Sheger online-store,@Shageronlinestore,5331,,2024-09-20 11:50:03+00:00,photos/@Shageronlinestore_5331.jpg
3,Sheger online-store,@Shageronlinestore,5330,,2024-09-20 11:50:02+00:00,photos/@Shageronlinestore_5330.jpg
4,Sheger online-store,@Shageronlinestore,5329,,2024-09-20 11:50:02+00:00,photos/@Shageronlinestore_5329.jpg


In [3]:
# Check the last five rows
data.tail()

Unnamed: 0,Channel Title,Channel Username,ID,Message,Date,Media Path
5010,Sheger online-store,@Shageronlinestore,10,🎯 3in1 One Step Hair Dryer & Styler \n\n👉 ከርል ...,2021-04-27 05:57:12+00:00,photos/@Shageronlinestore_10.jpg
5011,Sheger online-store,@Shageronlinestore,9,✅ Home GYM - X5 slimming vibrator \n\n📢📢📢 ታላቅ ...,2021-04-27 05:45:57+00:00,photos/@Shageronlinestore_9.jpg
5012,Sheger online-store,@Shageronlinestore,4,ለጤናችን-Health & Personal Care\n\n📍FingerTip Pul...,2021-04-12 08:36:40+00:00,photos/@Shageronlinestore_4.jpg
5013,Sheger online-store,@Shageronlinestore,3,#Finger_tip_pulse_oximeter\n #በተመጣጣኝ_ዋጋ\...,2021-04-12 08:35:47+00:00,photos/@Shageronlinestore_3.jpg
5014,Sheger online-store,@Shageronlinestore,1,,2021-04-11 10:31:03+00:00,


In [4]:
data.shape

(5015, 6)

In [5]:
# Let's check the missing values
data.isnull().sum()

Channel Title          0
Channel Username       0
ID                     0
Message             1849
Date                   0
Media Path          1221
dtype: int64

In [6]:
# Preprocess and tokenizes the amharic message
if __name__ == "__main__":
    # Amharic text sample
    amharic_text = "ሰላም እንዴት ነህ? እንኳን ደህና መጣህ።"

    preprocessor = AmharicTextPreprocessor()

    # Preprocess the text
    tokens = preprocessor.preprocess_dataframe(data, 'Message')
    display(tokens)


Unnamed: 0,Channel Title,Channel Username,ID,Message,Date,Media Path,preprocessed_message
0,Sheger online-store,@Shageronlinestore,5333,,2024-09-20 11:50:03+00:00,photos/@Shageronlinestore_5333.jpg,
1,Sheger online-store,@Shageronlinestore,5332,,2024-09-20 11:50:03+00:00,photos/@Shageronlinestore_5332.jpg,
2,Sheger online-store,@Shageronlinestore,5331,,2024-09-20 11:50:03+00:00,photos/@Shageronlinestore_5331.jpg,
3,Sheger online-store,@Shageronlinestore,5330,,2024-09-20 11:50:02+00:00,photos/@Shageronlinestore_5330.jpg,
4,Sheger online-store,@Shageronlinestore,5329,,2024-09-20 11:50:02+00:00,photos/@Shageronlinestore_5329.jpg,
...,...,...,...,...,...,...,...
5010,Sheger online-store,@Shageronlinestore,10,🎯 3in1 One Step Hair Dryer & Styler \n\n👉 ከርል ...,2021-04-27 05:57:12+00:00,photos/@Shageronlinestore_10.jpg,31 ከርል ለመስራት ለማለስለስ እንዲሁም ለማድረቅ የሚያገለግል ለኢትዮጵያ...
5011,Sheger online-store,@Shageronlinestore,9,✅ Home GYM - X5 slimming vibrator \n\n📢📢📢 ታላቅ ...,2021-04-27 05:45:57+00:00,photos/@Shageronlinestore_9.jpg,5 ታላቅ ቅናሽ የሰዉነትዎ ውፍረት አሳስቧታል ሙሉ በሙሉ ቦርጭን በአጭር ...
5012,Sheger online-store,@Shageronlinestore,4,ለጤናችን-Health & Personal Care\n\n📍FingerTip Pul...,2021-04-12 08:36:40+00:00,photos/@Shageronlinestore_4.jpg,ለጤናችን 2 ዋጋ 900 ብር ያሉበት ድረስ በነፃ እናደርሳለን 0909522840
5013,Sheger online-store,@Shageronlinestore,3,#Finger_tip_pulse_oximeter\n #በተመጣጣኝ_ዋጋ\...,2021-04-12 08:35:47+00:00,photos/@Shageronlinestore_3.jpg,በተመጣጣኝዋጋ 0909522840 ለአጠቃቀም ምቹ በሰዉነታችን ያለውን የኦክ...


In [7]:
# Drop NaN 

data.dropna(subset='Message', inplace=True)

In [8]:
list(data['preprocessed_message'])

['3 እስከ 260 ሙቀት መቆቆም የሚችል ዋጋ550ብር አድራሻ ቁ1 ስሪ ኤም ሲቲ ሞል ሁለተኛ ፎቅ ቢሮ ቁ 05ከ ሊፍቱ ፊት ለ ፊት ቁ2 ለቡ መዳህኒዓለም ቤተ/ክርስቲያን ፊት ለፊት ዛምሞል 2ኛ ፎቅ ቢሮ ቁጥር214 ለቡ ቅርንጫፍ0973611819 0909522840 0923350054 በ ለማዘዝ ይጠቀሙ ለተጨማሪ ማብራሪያ የቴሌግራም ገፃችን ///',
 'ጊዜ ቆጣቢ ስላይስ ማድረጊያ ለእጅ ሴፍቲ ተመራጭ ለድንች ለካሮትና ሌሎች አታክልቶች ተመራጭ ጥራት ያለው ዕቃ ዋጋ 1200 ብር አድራሻ ቁ1 ስሪ ኤም ሲቲ ሞል ሁለተኛ ፎቅ ቢሮ ቁ 05ከ ሊፍቱ ፊት ለ ፊት ቁ2 ለቡ መዳህኒዓለም ቤተ/ክርስቲያን ፊት ለፊት ዛምሞል 2ኛ ፎቅ ቢሮ ቁጥር214 ለቡ ቅርንጫፍ0973611819 0909522840 0923350054 በ ለማዘዝ ይጠቀሙ ለተጨማሪ ማብራሪያ የቴሌግራም ገፃችን ///',
 '2 1 1 ዋጋ 2 500 ብር ውስን ፍሬ ነው ያለው አድራሻ ቁ1 ስሪ ኤም ሲቲ ሞል ሁለተኛ ፎቅ ቢሮ ቁ 05ከ ሊፍቱ ፊት ለ ፊት ቁ2 ለቡ መዳህኒዓለም ቤተ/ክርስቲያን ፊት ለፊት ዛምሞል 2ኛ ፎቅ ቢሮ ቁጥር214 ለቡ ቅርንጫፍ0973611819 0909522840 0923350054 በ ለማዘዝ ይጠቀሙ ለተጨማሪ ማብራሪያ የቴሌግራም ገፃችን ///',
 '2 1 1 ዋጋ 2 500 ብር ውስን ፍሬ ነው ያለው አድራሻ ቁ1 ስሪ ኤም ሲቲ ሞል ሁለተኛ ፎቅ ቢሮ ቁ 05ከ ሊፍቱ ፊት ለ ፊት ቁ2 ለቡ መዳህኒዓለም ቤተ/ክርስቲያን ፊት ለፊት ዛምሞል 2ኛ ፎቅ ቢሮ ቁጥር214 ለቡ ቅርንጫፍ0973611819 0909522840 0923350054 በ ለማዘዝ ይጠቀሙ ለተጨማሪ ማብራሪያ የቴሌግራም ገፃችን ///',
 '31 ዋጋ3000ብር ውስን ፍሬ ነው ያለው አድራሻ ቁ1 ስሪ ኤም ሲቲ ሞል ሁለተኛ ፎቅ ቢሮ ቁ 05ከ ሊፍቱ ፊት ለ ፊት ቁ2 

In [9]:
# Ensure there are no NaN values in the preprocessed column
preprocessed_texts = tokens['preprocessed_message'].dropna().tolist()
df = pd.Series(preprocessed_texts).reset_index(name='message')


In [10]:
df.to_csv('../data/preprocessed.txt')