### Introduction
This notebook demonstrates the initial data ingestion process for the **Ethio Ecom NER Analytics** project, focusing on scraping e-commerce data from Telegram channels in Ethiopia. The goal is to collect real-time posts from multiple Telegram channels to build a dataset for Named Entity Recognition (NER) and vendor analytics. The process leverages a custom Telegram scraper built with the `Telethon` library, tailored to handle Amharic text and multimedia content.

---

### Scripts and Imports
- The `scripts.telegram_scraper` module contains the core scraping logic, imported to fetch data from Telegram channels.

In [None]:
import sys
import os
sys.path.append(os.path.abspath(".."))
from scripts.telegram_scraper import scrape_channels

### Data Collection

The `scrape_channels` function is executed asynchronously to fetch messages from the listed Telegram channels. Images are excluded (`download_images=False`) to focus on text data initially, with the option to enable later for multimodal analysis.

In [3]:
channels = ['@ethio_brand_collection', '@modernshoppingcenter', '@qnashcom', '@AwasMart', '@maedbet', '@ZemenExpress']
await scrape_channels(channels, download_images=False)

Attempt 1 at connecting failed: TimeoutError: 
Attempt 2 at connecting failed: TimeoutError: 


Signed in successfully as ｲ乇刀丂ﾑ乇ﾘ; remember to not break the ToS or you will risk an account ban!
✅ Scraped data from @ethio_brand_collection
✅ Scraped data from @modernshoppingcenter
✅ Scraped data from @qnashcom
✅ Scraped data from @AwasMart
✅ Scraped data from @maedbet
✅ Scraped data from @ZemenExpress


### Explanation of the Output
The output log indicates the scraping process's success after overcoming initial connection issues. The `scrape_channels` function:
- Connects to Telegram using credentials stored in a `.env` file (`TG_API_ID`, `TG_API_HASH`, `phone`).
- Iterates over each channel, fetching up to 10,000 messages per channel.
- Writes data to `../data/raw/telegram_data.csv` with columns: `Channel Title`, `Channel Username`, `ID`, `Message`, `Date`, `Media Path`.
- Skips image downloads (as `download_images=False`), storing only text and metadata.

### Insights
- **Connection Failures:** The `TimeoutError` suggests potential network instability or Telegram API throttling. Consider adding retry logic with exponential backoff in future iterations.
- **Success Rate:** All channels were scraped successfully after login, confirming the script's functionality.
- **Data Storage:** Data is saved in a structured CSV, ready for preprocessing and NER labeling tasks outlined in the project.

---

### Next Steps
1. **Data Validation:** Verify the CSV content for completeness and Amharic text integrity.
2. **Preprocessing:** Tokenize and normalize the scraped text in a separate notebook.

This notebook serves as the foundation for the data ingestion pipeline, setting the stage for downstream NER model fine-tuning and vendor analytics.