# Task 1: Data Ingestion and Data Preprocessing

This notebook demonstrates the process of collecting and preparing Amharic e-commerce data from multiple Ethiopian-based Telegram channels. The workflow includes channel selection, data scraping, preprocessing, and storage for downstream entity extraction tasks.

## 1. Channel Selection

We select at least 5 active Ethiopian e-commerce Telegram channels to maximize data diversity for fine-tuning. Example channels:
- @EthiopianDeals
- @AddisMarket
- @ShegerBargains
- @EthioShop
- @BahirDarBazaar

(Replace with actual channel usernames as needed.)

## 2. Telegram Scraper Setup

We use a custom Telegram scraper (e.g., Telethon or Pyrogram) to connect and fetch messages, images, and documents in real time.

In [2]:
# Install required libraries (if not already installed)
!pip install telethon pandas

Collecting pandas
  Using cached pandas-2.3.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (91 kB)
Collecting pandas
  Using cached pandas-2.3.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (91 kB)
Collecting numpy>=1.26.0 (from pandas)
Collecting numpy>=1.26.0 (from pandas)
  Using cached numpy-2.3.1-cp313-cp313-manylinux_2_28_x86_64.whl.metadata (62 kB)
  Using cached numpy-2.3.1-cp313-cp313-manylinux_2_28_x86_64.whl.metadata (62 kB)
Collecting pytz>=2020.1 (from pandas)
  Using cached pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting pytz>=2020.1 (from pandas)
  Using cached pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Using cached tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting tzdata>=2022.7 (from pandas)
  Using cached tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Using cached pandas-2.3.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.0 M

In [3]:
# Import libraries
from telethon.sync import TelegramClient
import pandas as pd
import os

### 2.1. Connect to Telegram API

Set up the Telegram client using your API credentials. Ensure you have a valid API ID and hash from https://my.telegram.org.

In [1]:
import os
from telethon import TelegramClient
from dotenv import load_dotenv
load_dotenv()

# Load credentials from environment variables
api_id = os.getenv('TELEGRAM_API_ID')
api_hash = os.getenv('TELEGRAM_API_HASH')
phone = os.getenv('TELEGRAM_PHONE')

client = TelegramClient('session_name', api_id, api_hash)
await client.start(phone)

Attempt 1 at connecting failed: TimeoutError: 
Attempt 2 at connecting failed: TimeoutError: 
Attempt 2 at connecting failed: TimeoutError: 
Invalid code. Please try again.
Invalid code. Please try again.


Signed in successfully as Ashe; remember to not break the ToS or you will risk an account ban!


<telethon.client.telegramclient.TelegramClient at 0x7764d88e56a0>

### 2.2. Fetch Messages from Selected Channels

We fetch recent messages, including text, images, and documents, from the selected channels.

In [5]:
from telethon.tl.types import MessageMediaPhoto, MessageMediaDocument

channels = [
    'EthiopianDeals',
    'AddisMarket',
    'ShegerBargains',
    'EthioShop',
    'BahirDarBazaar'
]

raw_data = []

for channel in channels:
    async for message in client.iter_messages(channel, limit=1000):
        data = {
            'channel': channel,
            'text': message.text,
            'timestamp': message.date,
            'views': message.views,
            'image_url': None,
            'document_url': None
        }
        if isinstance(message.media, MessageMediaPhoto):
            data['image_url'] = 'downloaded_image_path'  # Implement download logic
        if isinstance(message.media, MessageMediaDocument):
            data['document_url'] = 'downloaded_document_path'  # Implement download logic
        raw_data.append(data)

NameError: name 'client' is not defined

## 3. Data Preprocessing

We preprocess the collected text data by normalizing, tokenizing, and handling Amharic-specific features.

In [None]:
import re

def normalize_amharic_text(text):
    if not text:
        return ""
    text = text.replace('\n', ' ').replace('\r', ' ')
    text = re.sub(r'[፡።:]', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

for item in raw_data:
    item['text'] = normalize_amharic_text(item['text'])

## 4. Structuring and Saving the Data

We structure the data into a DataFrame, separating metadata from message content, and save it for further analysis.

In [None]:
df = pd.DataFrame(raw_data)
os.makedirs('data/raw', exist_ok=True)
df.to_json('data/raw/telegram_data.json', orient='records', force_ascii=False)
df.to_csv('data/raw/telegram_data.csv', index=False, encoding='utf-8-sig')

## 5. Summary

- Connected to at least 5 Ethiopian e-commerce Telegram channels.
- Ingested messages, images, and documents in real time.
- Preprocessed Amharic text for downstream tasks.
- Saved structured data for further analysis and entity extraction.