# Task 1: Data Ingestion and Preprocessing

This notebook implements Task 1 for the Amharic E-commerce Data Extractor project. It fetches messages from Ethiopian Telegram e-commerce channels, preprocesses the data, and stores it in a structured format.

## Objectives
- Scrape messages from 5 Telegram channels ('@ZemenExpress', '@nevacomputer', '@aradabrand2', '@ethio_brand_collection', '@modernshoppingcenter').
- Collect text, images, and metadata (message_id, timestamp, views, sender).
- Preprocess Amharic text (remove emojis, normalize currency).
- Save data to `data/raw/telegram_data.csv` and `data/processed/telegram_data_final.csv`.

## Setup
- Requires `telethon`, `pandas`, `pyyaml`.
- Uses `config.yaml` for Telegram API credentials.

In [1]:
# Import libraries
import yaml
import os
import pandas as pd
import re
from telethon.sync import TelegramClient

# Load configuration
with open('../config.yaml', 'r') as f:
    config = yaml.safe_load(f)

api_id = config['telegram']['api_id']
api_hash = config['telegram']['api_hash']
phone = config['telegram']['phone']
channels = config['channels']

print(f"Channels: {channels}")

Channels: ['@ZemenExpress', '@nevacomputer', '@aradabrand2', '@ethio_brand_collection', '@modernshoppingcenter']


In [2]:
# Run data ingestion script
%run ../src/data_ingestion.py

# Load and inspect raw data
df = pd.read_csv('../data/raw/telegram_data.csv')
print(df.info())
print(df[['channel', 'message', 'views', 'image_path']].head(5))

2025-06-21 17:22:56,349 - INFO - Connecting to 149.154.167.92:443/TcpFull...
2025-06-21 17:22:56,448 - INFO - Connection to 149.154.167.92:443/TcpFull complete!
2025-06-21 17:22:57,010 - INFO - Scraping @ZemenExpress
2025-06-21 17:22:57,439 - INFO - Starting direct file download in chunks of 131072 at 0, stride 131072
2025-06-21 17:22:57,732 - INFO - Starting direct file download in chunks of 131072 at 0, stride 131072
2025-06-21 17:22:57,995 - INFO - Starting direct file download in chunks of 131072 at 0, stride 131072
2025-06-21 17:22:58,303 - INFO - Starting direct file download in chunks of 131072 at 0, stride 131072
2025-06-21 17:22:58,690 - INFO - Starting direct file download in chunks of 131072 at 0, stride 131072
2025-06-21 17:22:59,279 - INFO - Starting direct file download in chunks of 131072 at 0, stride 131072
2025-06-21 17:22:59,614 - INFO - Starting direct file download in chunks of 131072 at 0, stride 131072
2025-06-21 17:22:59,896 - INFO - Starting direct file download

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   message_id  500 non-null    int64 
 1   channel     500 non-null    object
 2   message     238 non-null    object
 3   timestamp   500 non-null    object
 4   views       500 non-null    int64 
 5   sender      500 non-null    int64 
 6   image_path  465 non-null    object
dtypes: int64(3), object(4)
memory usage: 27.5+ KB
None
         channel                                            message  views  \
0  @ZemenExpress                                                NaN   1336   
1  @ZemenExpress                                                NaN   1337   
2  @ZemenExpress                                                NaN   1326   
3  @ZemenExpress  💥💥...................................💥💥\n\n3pc...   1301   
4  @ZemenExpress  💥💥...................................💥💥\n\n3pc...   1129   

           

## Data Ingestion

Run `src/data_ingestion.py` to scrape messages and save to `data/raw/telegram_data.csv`.

## Data Preprocessing

Run `src/preprocess.py` to preprocess text and save to `data/processed/telegram_data_final.csv`.

In [7]:
# Run preprocessing script
%run ../src/preprocess.py

# Load and inspect preprocessed data
df = pd.read_csv('../data/processed/telegram_data_final.csv')
print(df[['message', 'preprocessed_text']].head(5))

# Validate data
!python ../src/preprocess.py --validate --output ../data/processed/telegram_data_final.csv

2025-06-21 17:31:00,820 - INFO - ✅ Saved 500 preprocessed messages to ../data/processed/telegram_data_final.csv


                                             message  \
0                                                NaN   
1                                                NaN   
2                                                NaN   
3  💥💥...................................💥💥\n\n3pc...   
4  💥💥...................................💥💥\n\n3pc...   

                                   preprocessed_text  
0                                                NaN  
1                                                NaN  
2                                                NaN  
3  3pcs Bottle Stopper በማንኛውም ጠርሙስ ጫፍ የሚገጠም ለዘይት ...  
4  3pcs Bottle Stopper በማንኛውም ጠርሙስ ጫፍ የሚገጠም ለዘይት ...  


2025-06-21 17:31:01,231 - INFO - ✅ CSV validation passed


## Summary

- Scraped messages from 5 channels.
- Preprocessed Amharic text for NER.
- Data saved to `data/processed/telegram_data_final.csv`.
- Next: Task 2 (labeling in CoNLL format).

In [6]:
test_cases = [
    "💥💥........💥💥\n\n3pc Bottle Stopper 1500ብር 🔥",
    "📌 New Product 📌 Price:1000ብር",
    "⚠️ Limited Offer ⚠️"
]

for case in test_cases:
    print(f"Before: {case}")
    print(f"After: {preprocess_amharic(case)}\n")

Before: 💥💥........💥💥

3pc Bottle Stopper 1500ብር 🔥
After: 3pc Bottle Stopper 1500 ETB

Before: 📌 New Product 📌 Price:1000ብር
After: New Product Price 1000 ETB

Before: ⚠️ Limited Offer ⚠️
After: Limited Offer

