# Task 1: Data Ingestion and Preprocessing

This notebook implements Task 1 for the Amharic E-commerce Data Extractor project. It fetches messages from Ethiopian Telegram e-commerce channels, preprocesses the data, and stores it in a structured format.

## Objectives
- Scrape messages from 5 Telegram channels ('@ZemenExpress', '@nevacomputer', '@aradabrand2', '@ethio_brand_collection', '@modernshoppingcenter').
- Collect text, images, and metadata (message_id, timestamp, views, sender).
- Preprocess Amharic text (remove emojis, normalize currency).
- Save data to `data/raw/telegram_data.csv` and `data/processed/telegram_data_final.csv`.

## Setup
- Requires `telethon`, `pandas`, `pyyaml`.
- Uses `config.yaml` for Telegram API credentials.

In [1]:
# Import libraries
import yaml
import os
import pandas as pd
import re
from telethon.sync import TelegramClient

# Load configuration
with open('../config.yaml', 'r') as f:
    config = yaml.safe_load(f)

api_id = config['telegram']['api_id']
api_hash = config['telegram']['api_hash']
phone = config['telegram']['phone']
channels = config['channels']

print(f"Channels: {channels}")

Channels: ['@ZemenExpress', '@nevacomputer', '@aradabrand2', '@ethio_brand_collection', '@modernshoppingcenter']


## Data Ingestion

Run `src/data_ingestion.py` to scrape messages and save to `data/raw/telegram_data.csv`.

In [2]:
# Run data ingestion script
%run ../src/data_ingestion.py

# Load and inspect raw data
df = pd.read_csv('../data/raw/telegram_data.csv')
print(df.info())
print(df[['channel', 'text', 'views', 'image_path']].head(5))

2025-06-19 17:13:24,820 - INFO - Connecting to 149.154.167.92:443/TcpFull...
2025-06-19 17:13:24,927 - INFO - Connection to 149.154.167.92:443/TcpFull complete!
2025-06-19 17:13:25,484 - INFO - Scraping @ZemenExpress
2025-06-19 17:13:25,918 - INFO - Starting direct file download in chunks of 131072 at 0, stride 131072
2025-06-19 17:13:26,248 - INFO - Starting direct file download in chunks of 131072 at 0, stride 131072
2025-06-19 17:13:26,510 - INFO - Starting direct file download in chunks of 131072 at 0, stride 131072
2025-06-19 17:13:26,733 - INFO - Starting direct file download in chunks of 131072 at 0, stride 131072
2025-06-19 17:13:26,960 - INFO - Starting direct file download in chunks of 131072 at 0, stride 131072
2025-06-19 17:13:27,129 - INFO - Starting direct file download in chunks of 131072 at 0, stride 131072
2025-06-19 17:13:27,346 - INFO - Starting direct file download in chunks of 131072 at 0, stride 131072
2025-06-19 17:13:27,546 - INFO - Starting direct file download

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   message_id  500 non-null    int64 
 1   channel     500 non-null    object
 2   text        238 non-null    object
 3   timestamp   500 non-null    object
 4   views       500 non-null    int64 
 5   sender      500 non-null    int64 
 6   image_path  463 non-null    object
dtypes: int64(3), object(4)
memory usage: 27.5+ KB
None
         channel                                               text  views  \
0  @ZemenExpress  💥💥...................................💥💥\n\n📌Im...   1974   
1  @ZemenExpress  💥💥...................................💥💥\n\n📌 B...   3072   
2  @ZemenExpress                                                NaN   3147   
3  @ZemenExpress                                                NaN   3209   
4  @ZemenExpress                                                NaN   3175   

           

## Data Preprocessing

Run `src/preprocess.py` to preprocess text and save to `data/processed/telegram_data_final.csv`.

In [3]:
# Run preprocessing script
%run ../src/preprocess.py

# Load and inspect preprocessed data
df = pd.read_csv('../data/processed/telegram_data_final.csv')
print(df[['text', 'preprocessed_text']].head(5))

# Validate data
!python ../src/preprocess.py --validate --output ../data/processed/telegram_data_final.csv

2025-06-19 17:17:01,737 - INFO - Saved 500 preprocessed messages to ../data/processed/telegram_data_final.csv


                                                text  \
0  💥💥...................................💥💥\n\n📌Im...   
1  💥💥...................................💥💥\n\n📌 B...   
2                                                NaN   
3                                                NaN   
4                                                NaN   

                                   preprocessed_text  
0  Imitation Volcano Humidifier with LED Light በኤ...  
1  Baby Carrier በፈለጉት አቅጣጫ ልጅዎን በምቾት ማዘል ያስችልዎታል ...  
2                                                NaN  
3                                                NaN  
4                                                NaN  


2025-06-19 17:17:02,130 - INFO - CSV validation passed


## Summary

- Scraped messages from 5 channels.
- Preprocessed Amharic text for NER.
- Data saved to `data/processed/telegram_data_final.csv`.
- Next: Task 2 (labeling in CoNLL format).