# Task 1: Data Ingestion and Preprocessing

This notebook implements Task 1 for the Amharic E-commerce Data Extractor project. It fetches messages from Ethiopian Telegram e-commerce channels, preprocesses the data, and stores it in a structured format.

## Objectives
- Scrape messages from 5 Telegram channels (@ZemenExpress, @nevacomputer, @aradabrand2, @ethio_brand_collection, @modernshoppingcenter).
- Collect text, images, and metadata (message_id, timestamp, views, sender).
- Preprocess Amharic text (remove emojis, normalize currency).
- Save data to `data/raw/telegram_data.csv` and `data/processed/telegram_data_final.csv`.

## Setup
- Requires `telethon`, `pandas`, `pyyaml`.
- Uses `config.yaml` for Telegram API credentials.

In [3]:
# Install dependencies (for Colab or new environments)
!pip install telethon pandas pyyaml



In [8]:
# Import libraries
import yaml
import os
import pandas as pd
import re
from telethon.sync import TelegramClient

# Load configuration
with open('../config.yaml', 'r') as f:
    config = yaml.safe_load(f)

api_id = config['telegram']['api_id']
api_hash = config['telegram']['api_hash']
phone = config['telegram']['phone']
channels = config['channels']

print(f"Channels: {channels}")

Channels: ['@ZemenExpress', '@nevacomputer', '@aradabrand2', '@ethio_brand_collection', '@modernshoppingcenter']


## Data Ingestion

Run `src/data_ingestion.py` to scrape messages and save to `data/raw/telegram_data.csv`.

In [None]:
# Run data ingestion script
%run ../src/data_ingestion.py

# Load and inspect raw data
df = pd.read_csv('../data/raw/telegram_data.csv')
print(df.info())
print(df[['channel', 'text', 'views', 'image_path']].head(5))

2025-06-19 12:58:18,068 - INFO - Connecting to 149.154.167.51:443/TcpFull...
2025-06-19 12:58:19,548 - INFO - Connection to 149.154.167.51:443/TcpFull complete!


## Data Preprocessing

Run `src/preprocess.py` to preprocess text and save to `data/processed/telegram_data_final.csv`.

In [None]:
# Run preprocessing script
%run ../src/preprocess.py

# Load and inspect preprocessed data
df = pd.read_csv('../data/processed/telegram_data_final.csv')
print(df[['text', 'preprocessed_text']].head(5))

# Validate data
!python ../src/preprocess.py --validate --output ../data/processed/telegram_data_final.csv

## Summary

- Scraped messages from 5 channels.
- Preprocessed Amharic text for NER.
- Data saved to `data/processed/telegram_data_final.csv`.
- Next: Task 2 (labeling in CoNLL format).