# 🛒 Task 1 – Data Ingestion & Preprocessing  
📘 Version: 2025-06-24

Programmatic data scraping and preprocessing of Amharic e-commerce posts from Telegram. This notebook connects to public Telegram channels, extracts messages and metadata (e.g., views, timestamps), and performs text normalization to prepare a clean dataset for Named Entity Recognition (NER) labeling.

---

**Challenge:** B5W4 – Amharic E-Commerce Data Extractor  
**Company:** EthioMart (Telegram E-Commerce Aggregator)  
**Author:** Nabil Mohamed  
**Branch:** `task-1-ingestion-cleaning`  
**Date:** June 2025  

---

### 📌 This notebook covers:
- API connection to 5+ Amharic Telegram vendor channels
- Ingestion of messages, views, and timestamps
- Basic Amharic-friendly text normalization and filtering
- Structured saving of cleaned messages for Task 2 labeling
- Output saved to: `data/cleaned/telegram_messages.csv`


In [3]:
# ------------------------------------------------------------------------------
# 🛠 Ensure Notebook Runs from Project Root (for src/ imports to work)
# ------------------------------------------------------------------------------

import os
import sys

# If running from /notebooks/, move up to project root
if os.path.basename(os.getcwd()) == "notebooks":
    os.chdir("..")
    print("📂 Changed working directory to project root")

# Add project root to sys.path so `src/` modules can be imported
project_root = os.getcwd()
if project_root not in sys.path:
    sys.path.insert(0, project_root)
    print(f"✅ Added to sys.path: {project_root}")

# Optional: verify file presence to confirm we're in the right place
expected_path = "data/raw"
print(
    "📁 Output path ready"
    if os.path.exists(expected_path)
    else f"⚠️ Output path not found: {expected_path}"
)

📂 Changed working directory to project root
✅ Added to sys.path: c:\Users\admin\Documents\GIT Repositories\b5w4-amharic-ecommerce-data-extractor-challenge
📁 Output path ready


## 📦 Imports & Environment Setup

This cell loads core libraries required for data ingestion, text cleaning, and structured saving. The imports are grouped by function:

- **Data handling**: `pandas` for tabular processing, `re` for text cleaning, `datetime` for timestamp formatting  
- **Telegram scraping**: `telethon` for connecting to public Telegram channels and retrieving post history  
- **Environment management**: `dotenv` to securely load API credentials from a `.env` file  

These tools form the backbone of your ingestion pipeline and will be reused throughout the notebook.


In [None]:
# ------------------------------------------------------------------------------
# 📦 Core Imports – Data Ingestion, Cleaning, and Saving
# ------------------------------------------------------------------------------

# Standard Library
import os  # File and path handling
import re  # Regex for text normalization
from datetime import datetime  # Timestamp formatting
import warnings  # Suppress benign warnings

# Core Analysis
import pandas as pd  # Structured data handling

# Optional: tidy up output
warnings.filterwarnings("ignore")

## 📡 Telegram API Client Initialization

This section sets up a secure connection to Telegram using the `Telethon` library. API credentials (`TELEGRAM_API_ID`, `TELEGRAM_API_HASH`) are loaded from a `.env` file.

The script checks for missing credentials and handles connection errors gracefully. If successful, a `TelegramClient` is initialized under the session name `"ethio_ingestor"`, ready to fetch message histories from e-commerce channels.


In [13]:
from telethon.sync import TelegramClient
from dotenv import load_dotenv

# Load API credentials
load_dotenv()

api_id = os.getenv("TELEGRAM_API_ID")
api_hash = os.getenv("TELEGRAM_API_HASH")

# Verify that credentials are present
if not api_id or not api_hash:
    raise ValueError("❌ API credentials not found. Please check your .env file.")

# Initialize the Telegram client
try:
    client = TelegramClient("ethio_ingestor", api_id, api_hash)
    print("✅ Telegram client initialized.")
except Exception as e:
    print("❌ Failed to initialize Telegram client.")
    print("Error:", e)

✅ Telegram client initialized.


## 🧲 Channel Selection & Message Scraping

This section defines a list of target Telegram vendor channels and fetches recent messages from each using the `GetHistoryRequest` method.

For each message, the script extracts:
- Raw text (`message`)
- View count (`views`)
- Timestamp (`date`)
- Channel name

Results are stored in a structured format (list of dictionaries), which will later be converted into a pandas DataFrame for cleaning and analysis.


## 🔐 Step 1: Request Telegram Login Code

To authorize your client, you'll need to log in using your Telegram phone number. This step sends a verification code to your Telegram app (not via SMS).

**Instructions:**
- Enter your phone number in international format (e.g., `+2519XXXXXXXX`)
- Telegram will send you a 5-digit code via your **Telegram app** (not SMS)
- You'll use that code in the next step to complete the login


In [15]:
# ------------------------------------------------------------------------------
# 🔐 Request Login Code from Telegram
# ------------------------------------------------------------------------------

# Replace with your phone number in international format
phone_number = "+251711029700"

# Send the code to your Telegram app
await client.send_code_request(phone_number)

print("✅ Code sent. Please check your Telegram app for the verification code.")


✅ Code sent. Please check your Telegram app for the verification code.


Server closed the connection: [WinError 10054] An existing connection was forcibly closed by the remote host
Attempt 1 at connecting failed: TimeoutError: 
Attempt 2 at connecting failed: ConnectionAbortedError: [Errno 10053] Connect call failed ('149.154.167.91', 443)


## ✅ Step 2: Sign In Using the Code

Once you've received your 5-digit verification code from the Telegram app:

1. Paste the code into the next code cell (replace `'12345'`)
2. Run the cell to complete the sign-in
3. This session will be cached, so you won’t need to do this again unless you delete your `.session` file


In [16]:
# ------------------------------------------------------------------------------
# ✅ Sign In with Verification Code
# ------------------------------------------------------------------------------

# Replace with the code sent to your Telegram app
verification_code = "31373"

# Complete sign-in
await client.sign_in(phone_number, code=verification_code)

print("🎉 Authorization successful. You're now logged in.")

🎉 Authorization successful. You're now logged in.


In [22]:
from telethon.tl.functions.messages import GetHistoryRequest
from pathlib import Path
import pandas as pd

# Ensure client is connected
await client.connect()

if not await client.is_user_authorized():
    print("🔐 You're not authorized. This client may require login with code.")

# Define channels and fetch limits
channel_usernames = [
    "ZemenExpress",
    "Shageronlinestore",
    "Leyueqa",
    "marakibrand",
    "MerttEka",
    "Fashiontera",
    "nevacomputer",
    "ethio_brand_collection",
    "Shewabrand",
    "sinayelj",
]

total_limit = 500  # Number of messages to fetch per channel
batch_size = 100  # Telegram max per request
all_messages = []  # Container for results

# Iterate over each vendor channel
for username in channel_usernames:
    offset_id = 0
    collected = 0

    while collected < total_limit:
        try:
            entity = await client.get_entity(username)
            history = await client(
                GetHistoryRequest(
                    peer=entity,
                    limit=batch_size,
                    offset_date=None,
                    offset_id=offset_id,
                    max_id=0,
                    min_id=0,
                    add_offset=0,
                    hash=0,
                )
            )

            messages = history.messages
            if not messages:
                break  # No more to fetch

            for msg in messages:
                if msg.message:
                    all_messages.append(
                        {
                            "channel": username,
                            "message": msg.message,
                            "views": msg.views,
                            "timestamp": msg.date.isoformat(),
                        }
                    )

            offset_id = messages[-1].id
            collected += len(messages)
            print(f"✅ {username}: Collected {collected}/{total_limit}")

        except Exception as e:
            print(f"❌ Failed to fetch from {username} — {str(e)}")
            break

print(f"\n🎯 Total messages scraped: {len(all_messages)}")

# Save to CSV
output_path = Path("data/raw/telegram_messages_raw.csv")
df = pd.DataFrame(all_messages)
df.dropna(subset=["message"], inplace=True)
df.to_csv(output_path, index=False, encoding="utf-8-sig")

print(f"📁 Messages saved to: {output_path.resolve()}")

✅ ZemenExpress: Collected 100/500
✅ ZemenExpress: Collected 200/500
✅ ZemenExpress: Collected 300/500
✅ ZemenExpress: Collected 400/500
✅ ZemenExpress: Collected 500/500
✅ Shageronlinestore: Collected 100/500
✅ Shageronlinestore: Collected 200/500
✅ Shageronlinestore: Collected 300/500
✅ Shageronlinestore: Collected 400/500
✅ Shageronlinestore: Collected 500/500
✅ Leyueqa: Collected 100/500
✅ Leyueqa: Collected 200/500
✅ Leyueqa: Collected 300/500
✅ Leyueqa: Collected 400/500
✅ Leyueqa: Collected 500/500
✅ marakibrand: Collected 100/500
✅ marakibrand: Collected 200/500
✅ marakibrand: Collected 300/500
✅ marakibrand: Collected 400/500
✅ marakibrand: Collected 500/500
✅ MerttEka: Collected 100/500
✅ MerttEka: Collected 200/500
✅ MerttEka: Collected 300/500
✅ MerttEka: Collected 400/500
✅ MerttEka: Collected 500/500
✅ Fashiontera: Collected 100/500
✅ Fashiontera: Collected 200/500
✅ Fashiontera: Collected 300/500
✅ Fashiontera: Collected 400/500
✅ Fashiontera: Collected 500/500
✅ nevacomp

## 🧼 Clean & Normalize Telegram Messages

This step prepares raw messages for Named Entity Recognition (NER) labeling by applying Amharic-specific text cleaning and formatting.

The cleaning logic includes:
- Removing emojis, special characters, and noisy symbols
- Retaining Amharic (`\u1200-\u137F`) characters, basic punctuation, and digits
- Lowercasing and normalizing whitespace

Cleaned messages are saved to `data/cleaned/telegram_messages_cleaned.csv` and are ready for manual annotation in CoNLL format (Task 2).


In [23]:
# ------------------------------------------------------------------------------
# 🧼 Clean and Normalize Telegram Messages for Labeling
# ------------------------------------------------------------------------------

import pandas as pd
import re
from pathlib import Path

# Load raw messages
raw_path = Path("data/raw/telegram_messages_raw.csv")
df = pd.read_csv(raw_path)


# Define Amharic-preserving cleaner
def clean_text(text):
    # Retain Amharic characters, basic Latin text, digits, and punctuation
    cleaned = re.sub(r"[^\u1200-\u137F፡።\dA-Za-z.,:!?\\s]", "", str(text))
    cleaned = re.sub(r"\s+", " ", cleaned)  # collapse excessive spacing
    return cleaned.strip().lower()


# Apply cleaning function
df["cleaned_message"] = df["message"].apply(clean_text)

# Save cleaned output
cleaned_path = Path("data/cleaned/telegram_messages_cleaned.csv")
df.to_csv(cleaned_path, index=False, encoding="utf-8-sig")

print(f"✅ Cleaned messages saved to: {cleaned_path.resolve()}")

✅ Cleaned messages saved to: C:\Users\admin\Documents\GIT Repositories\b5w4-amharic-ecommerce-data-extractor-challenge\data\cleaned\telegram_messages_cleaned.csv


## 🏷️ Select Messages for CoNLL Labeling

To begin manual annotation, we’ll extract a representative sample of cleaned messages.  
These messages will be saved in a plain text `.txt` file where each line is one message — making it easier to tokenize and label manually.

We'll target 30–50 diverse examples that are rich in product names, prices, and locations.


In [24]:
# ------------------------------------------------------------------------------
# 🏷️ Sample Messages for Manual NER Labeling (CoNLL Format)
# ------------------------------------------------------------------------------

import pandas as pd
from pathlib import Path

# Load cleaned messages
cleaned_path = Path("data/cleaned/telegram_messages_cleaned.csv")
df = pd.read_csv(cleaned_path)

# Drop any duplicates and short/empty messages
df = df.drop_duplicates(subset="cleaned_message")
df = df[df["cleaned_message"].str.len() > 10]

# Sample 50 candidate messages (or fewer if limited)
sampled = df.sample(n=min(50, len(df)), random_state=42)

# Output path
sample_path = Path("data/labeled/candidate_messages_for_labeling.txt")
sample_path.parent.mkdir(parents=True, exist_ok=True)

# Save to plain text format (one message per line)
sampled["cleaned_message"].to_csv(sample_path, index=False, header=False)

print(f"📝 Sampled messages saved for labeling at: {sample_path.resolve()}")
sampled["cleaned_message"].head()

📝 Sampled messages saved for labeling at: C:\Users\admin\Documents\GIT Repositories\b5w4-amharic-ecommerce-data-extractor-challenge\data\labeled\candidate_messages_for_labeling.txt


1178    nikeairforcemadeinvietnamsize41,42price3900fre...
872     2in1eggslicerየተቀቀለእንቁላልናድንችመሰንጠቂያሁለትአይነትአቆራረጥፅ...
2003    ስለዚህምበመስቀልላይሳለከጥንትጀምሮጌታበመቃብርያሉሙታንንሁሉአስነሣሃይማኖተአ...
2200    niketechherasize4041424344madeinvietnamshewabr...
843     givenchysize404142434445price:8000brfreedelive...
Name: cleaned_message, dtype: object