In [1]:
%load_ext autoreload
%autoreload 2

import os 
import sys
import pandas as pd
from transformers import AutoTokenizer
from telethon import TelegramClient

sys.path.append(os.path.abspath('../src'))  # to import modules from the src/service directory
from services.telegram_scrapper import run_scrapper
from utils.utils import clean_data, normalize_data, hf_tokenize, regex_tokenize

sys.path.append(os.path.abspath('../config'))
from config import load_credentials




# Loading data from Telegram
we will use the **telegram_scrapper** module to load telegram data
    - using the **run_scrapper** function(the entry point) of the telegram_scrapper module

In [None]:
creds = load_credentials() # load credentials from environment

# initializing client
client = TelegramClient("scraping_session;.session", creds["api_id"], creds["api_hash"])

#await run_scrapper(client) #run the entry point function of the telegram_scrapper 
print("Scrapped Data successfully ✅")

Sheger online-store


Server closed the connection: [WinError 10054] An existing connection was forcibly closed by the remote host
Attempt 1 at connecting failed: TimeoutError: 


Scraped data from @Shageronlinestore
Shewa Brand
Scraped data from @Shewabrand
HellooMarket
Scraped data from @helloomarketethiopia
Fashion tera
Scraped data from @Fashiontera
NEVA COMPUTER®
Scraped data from @nevacomputer
Scrapped Data successfully ✅


Server closed the connection: [WinError 10054] An existing connection was forcibly closed by the remote host
Server closed the connection: [WinError 10054] An existing connection was forcibly closed by the remote host
Server resent the older message 7518247682440480769, ignoring


Data successfully scrapped and stored in the data/raw/ directory we will load and explore it

In [2]:
data = pd.read_csv("../data/raw/telegram_data.csv") # read the scrapped data as a dataframe

In [4]:
print(f"Total number of messages scrapped: {data.shape[0]}") # print the total number of messages scrapped
print(f"with columns:")
for col in data.columns:
    print(col + "\n")
print(f"We can see that we have {len(data.columns)} columns✅")

Total number of messages scrapped: 3000
with columns:
Channel Title

Channel Username

ID

Message

Date

Media Path

We can see that we have 6 columns✅


In [5]:
data.isna().sum() # check for missing values in the dataframe 

Channel Title          0
Channel Username       0
ID                     0
Message             1223
Date                   0
Media Path           134
dtype: int64

We have about 1223 null values in the message which highlights **votes** or other non text messages 

In [6]:
msg_with_no_photo = data[(data["Message"].isna() == False) & (data["Media Path"].isna())].shape[0] # check for messages that have a message but no media path(picture)
print(f"There are {msg_with_no_photo} messages in the dataset without photo")

There are 122 messages in the dataset without photo


# The next task is to clean the Dataset

In [3]:
cleaned_data = clean_data(data) # clean the data by removing duplicates and messages with no text and media


In [4]:
print(f"Total number of messages after cleaning: {cleaned_data.shape[0]}") # print the total number of messages after cleaning
cleaned_data.isna().sum() # check for missing values in the cleaned dataframe

Total number of messages after cleaning: 1777


Channel Title       0
Channel Username    0
ID                  0
Message             0
Date                0
Media Path          0
dtype: int64

### out data set is cleaned and doesnt contain an **NAN** values 

## **Data Normalization**:
- we will remove all emojis from the messages4
- and remove # tags from the message and store in a hashtag column, for if they were to be used for classification downstream

In [4]:
cleaned_normalized_data = normalize_data(cleaned_data)
cleaned_normalized_data.reset_index(drop=True, inplace=True)
cleaned_normalized_data


Unnamed: 0,Channel Title,Channel Username,ID,Message,Date,Media Path,hashtags
0,Sheger online-store,@Shageronlinestore,7399,GROOMING SET ሶስት በአንድ የያዘ የፀጉር ማሽን እና ሼቨር የሚሰራ...,2025-06-20 15:25:41+00:00,../data/raw/photo\@Shageronlinestore_7399.jpg,[ዛም_ሞል]
1,Sheger online-store,@Shageronlinestore,7395,GROOMING SET ሶስት በአንድ የያዘ የፀጉር ማሽን እና ሼቨር የሚሰራ...,2025-06-20 15:25:40+00:00,../data/raw/photo\@Shageronlinestore_7395.jpg,[ዛም_ሞል]
2,Sheger online-store,@Shageronlinestore,7393,1L Water Bottle High Quality 1L water time sca...,2025-06-20 11:47:53+00:00,../data/raw/photo\@Shageronlinestore_7393.jpg,[ዛም_ሞል]
3,Sheger online-store,@Shageronlinestore,7391,Sonifer Steam Iron የልብስ መቶከሻ High Quality Cera...,2025-06-20 09:03:23+00:00,../data/raw/photo\@Shageronlinestore_7391.jpg,[ዛም_ሞል]
4,Sheger online-store,@Shageronlinestore,7390,Sayona multifunctional juicer and extractor Be...,2025-06-20 06:48:11+00:00,../data/raw/photo\@Shageronlinestore_7390.jpg,[ዛም_ሞል]
...,...,...,...,...,...,...,...
1772,NEVA COMPUTER®,@nevacomputer,8105,This Alienware m15 R5 Ryzen Edition Gaming Lap...,2023-11-28 12:27:58+00:00,../data/raw/photo\@nevacomputer_8105.jpg,[no tag]
1773,NEVA COMPUTER®,@nevacomputer,8103,The Alienware m15 Ryzen Edition R5 is engineer...,2023-11-28 12:15:47+00:00,../data/raw/photo\@nevacomputer_8103.jpg,[no tag]
1774,NEVA COMPUTER®,@nevacomputer,8102,Dell Alienware M15 R5 15.6'' QHD Gaming Laptop...,2023-11-28 12:13:18+00:00,../data/raw/photo\@nevacomputer_8102.jpg,[no tag]
1775,NEVA COMPUTER®,@nevacomputer,8099,NEW ARRIVAL from BRAND : Dell inspiron DISPLAY...,2023-11-28 05:55:47+00:00,../data/raw/photo\@nevacomputer_8099.jpg,[no tag]


In [7]:
print(cleaned_normalized_data.columns)
print(f"we have added a new '{cleaned_normalized_data.columns[-1]}' column to the dataframe✅")

Index(['Channel Title', 'Channel Username', 'ID', 'Message', 'Date',
       'Media Path', 'hashtags'],
      dtype='object')
we have added a new 'hashtags' column to the dataframe✅


# Data Tokenization 
- For initial tokenization we'll use a regx based tokenizer since it is easier to add custome tags
- we will store the tokenized text on a tokenized_msg column

In [5]:
cleaned_normalized_data["tokenized_msg"] = cleaned_normalized_data["Message"].apply(regex_tokenize) # tokenize the messages using regex tokenizer
print("Tokenized messages using regex tokenizer ✅")

Tokenized messages using regex tokenizer ✅


In [None]:
cleaned_normalized_data[["tokenized_msg", "Message"]].head(15)# print the first 15 rows of the tokenized messages

Unnamed: 0,tokenized_msg,Message
0,"[GROOMING, SET, ሶስት, በአንድ, የያዘ, የፀጉር, ማሽን, እና,...",GROOMING SET ሶስት በአንድ የያዘ የፀጉር ማሽን እና ሼቨር የሚሰራ...
1,"[GROOMING, SET, ሶስት, በአንድ, የያዘ, የፀጉር, ማሽን, እና,...",GROOMING SET ሶስት በአንድ የያዘ የፀጉር ማሽን እና ሼቨር የሚሰራ...
2,"[1L, Water, Bottle, High, Quality, 1L, water, ...",1L Water Bottle High Quality 1L water time sca...
3,"[Sonifer, Steam, Iron, የልብስ, መቶከሻ, High, Quali...",Sonifer Steam Iron የልብስ መቶከሻ High Quality Cera...
4,"[Sayona, multifunctional, juicer, and, extract...",Sayona multifunctional juicer and extractor Be...
5,"[2in1, long, handled, bath, brush, ለአያያዝ, ምቹ, ...",2in1 long handled bath brush ለአያያዝ ምቹ በቀላሉ የማን...
6,"[Miralux, Hot, plate, ባለሁለት, ምድጃ, ስቶቭ, orginal...",Miralux Hot plate ባለሁለት ምድጃ ስቶቭ orginal 2000 ዋ...
7,"[7pcs, glass, water, set, አንድ, ማራኪ, ጆግና, 6, መጠ...",7pcs glass water set አንድ ማራኪ ጆግና 6 መጠጫ ብርጭቆዎች ...
8,"[Universal, water, -, saving, dishwasher, head...",Universal water-saving dishwasher head Increas...
9,"[special, base, for, refrigerator, and, washin...",special base for refrigerator and washing mach...


In [10]:
# save the cleaned and normalized data to a csv file for downstream tasks
cleaned_normalized_data.to_csv("../data/processed/cleaned_normalized_data.csv", index=False)
print("Cleaned and normalized data saved to ../data/processed/cleaned_normalized_data.csv ✅")

Cleaned and normalized data saved to ../data/processed/cleaned_normalized_data.csv ✅


# 📒 Data Ingestion and Preprocessing Summary

This notebook outlines the pipeline for collecting and preparing Telegram data for analysis, focusing on Amharic-language e-commerce channels.

---

## 1. Loading Data from Telegram

- Uses a custom `telegram_scrapper` module and the `Telethon` library.
- Connects to Telegram using API credentials.
- Scrapes messages, images, and documents from relevant channels.
- Stores raw data in `data/raw/telegram_data.csv`.
- stores photos in the same directory as the rew telegram_data.csv in a folder called photo

---

## 2. Data Exploration

- Loads the raw data into a Pandas DataFrame.
- Explores the dataset structure and checks for missing values.
- Identifies entries withoug messages which might be polls or other non text messages.
- Identifies messages with and without media attachments.

---

## 3. Data Cleaning

- Removes duplicates and irrelevant entries (e.g., empty messages).
- Ensures the cleaned dataset contains no missing values.

**Note**: Entries with a message but no media entries are not removed as they are needed for the NER pipeline
---

## 4. Data Normalization

- Removes emojis from messages.
- Extracts hashtags and stores them in a separate column for downstream tasks.
- Normalizes Negative circled characters which might cause noise later downstream to their base textual form
- Normalize whitespaces for smoother tokenization
- Normalize non-breaking spaces that appear as /ax0 in the text to regular spaces

---

## 5. Data Tokenization

- Applies a regex-based tokenizer to the message text.
- Stores tokenized output in a new `tokenized_msg` column.

---

## 6. Save preprocessed data
- store preprocessed data to the `..data/processed/cleaned_normalized_data.csv`

**Result:**  
A clean, normalized, and tokenized dataset ready leabeling and model fine tuning, with metadata and message