# Task 1 — Telegram Scraping

This notebook imports and runs the existing Telegram scraper at `src/scraper.py`.

Outputs written by the scraper:
- data/raw/telegram_messages/YYYY-MM-DD/<channel>.json
- data/raw/images/<channel>/* (if enabled)
- data/raw/csv/YYYY-MM-DD/telegram_data.csv
- data/raw/scrape_state.json
- logs/scrape_YYYY-MM-DD.log

In [1]:
# Environment & Imports
from __future__ import annotations

import sys
from pathlib import Path

In [2]:

# We are in: <project_root>/notebooks/task1_telegram_scraping.ipynb
# So we search upwards to find <project_root>/src/scraper.py
NOTEBOOK_CWD: Path = Path.cwd().resolve()


def find_project_root(start: Path) -> Path:
    for p in [start, *start.parents]:
        if (p / "src" / "scraper.py").exists():
            return p
    raise FileNotFoundError(
        "Could not locate project root containing src/scraper.py. "
        "Open this notebook from somewhere inside the repo."
    )


PROJECT_ROOT: Path = find_project_root(NOTEBOOK_CWD)

# Make `src` importable as a package
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

print("NOTEBOOK_CWD:", NOTEBOOK_CWD)
print("PROJECT_ROOT:", PROJECT_ROOT)
print("sys.path[0]:", sys.path[0])
print("Python:", sys.executable)



NOTEBOOK_CWD: D:\Python\Week 8\Shipping-a-Data-Product\notebooks
PROJECT_ROOT: D:\Python\Week 8\Shipping-a-Data-Product
sys.path[0]: D:\Python\Week 8\Shipping-a-Data-Product
Python: d:\Python\Week 8\Shipping-a-Data-Product\.venv\Scripts\python.exe


In [3]:
# Import scraper and show config
from src import scraper

print("Imported scraper from:", scraper.__file__)

# Show important config (read at import time from env vars)
for name in [
    "MAX_MESSAGES_PER_CHANNEL",
    "LOOKBACK_DAYS",
    "DOWNLOAD_MEDIA",
    "MAX_MEDIA_PER_CHANNEL",
    "PER_CHANNEL_TIMEOUT_SEC",
    "STATE_PATH",
    "DATA_LAKE_DIR",
    "IMAGES_DIR",
    "CSV_DIR",
]:
    print(f"{name} = {getattr(scraper, name, None)}")


Imported scraper from: D:\Python\Week 8\Shipping-a-Data-Product\src\scraper.py
MAX_MESSAGES_PER_CHANNEL = 300
LOOKBACK_DAYS = 14
DOWNLOAD_MEDIA = True
MAX_MEDIA_PER_CHANNEL = 50
PER_CHANNEL_TIMEOUT_SEC = 180
STATE_PATH = D:\Python\Week 8\Shipping-a-Data-Product\data\raw\scrape_state.json
DATA_LAKE_DIR = D:\Python\Week 8\Shipping-a-Data-Product\data\raw\telegram_messages
IMAGES_DIR = D:\Python\Week 8\Shipping-a-Data-Product\data\raw\images
CSV_DIR = D:\Python\Week 8\Shipping-a-Data-Product\data\raw\csv


## Run the scraper

In Jupyter, run the async entrypoint with:
Do **not** call `python src/scraper.py` from inside the notebook.


In [4]:
await scraper.main()


2026-01-15 19:04:14,765 | INFO | === Starting Telegram scraping ===
2026-01-15 19:04:14,769 | INFO | Connecting to 149.154.167.91:443/TcpFull...
2026-01-15 19:04:14,916 | INFO | Connection to 149.154.167.91:443/TcpFull complete!
2026-01-15 19:04:16,074 | INFO | Scraping @CheMed123 (last_id=97)
2026-01-15 19:04:16,565 | INFO | Scraping @lobelia4cosmetics (last_id=22887)
2026-01-15 19:04:17,013 | INFO | Scraping @rayapharmaceuticals (last_id=0)
2026-01-15 19:04:17,470 | INFO | Scraping @tikvahpharma (last_id=188947)
2026-01-15 19:04:18,281 | INFO | Disconnecting from 149.154.167.91:443/TcpFull...
2026-01-15 19:04:18,282 | INFO | Disconnection from 149.154.167.91:443/TcpFull complete!
2026-01-15 19:04:18,311 | INFO | === Scraping complete ===


## Verification — State File

In [5]:
import json


state_path = PROJECT_ROOT / "data" / "raw" / "scrape_state.json"


print("State exists:", state_path.exists(), "->", state_path)


if state_path.exists():
	state = json.loads(state_path.read_text(encoding="utf-8"))
	print("\nChannels in state:")
	for ch, v in (state.get("channels") or {}).items():
		print(f" {ch}: {v}")
else:
	print("No state file found yet. Run the scraper cell first.")

State exists: True -> D:\Python\Week 8\Shipping-a-Data-Product\data\raw\scrape_state.json

Channels in state:
 CheMed123: {'last_message_id': 97, 'updated_at': '2026-01-14T18:08:14.968611+00:00'}
 lobelia4cosmetics: 22887
 tikvahpharma: 188947


## Verification — JSON Partitions

In [6]:
data_dir = PROJECT_ROOT / "data" / "raw" / "telegram_messages"


print("Partitions dir exists:", data_dir.exists(), "->", data_dir)


if data_dir.exists():
	days = sorted([p for p in data_dir.iterdir() if p.is_dir()], reverse=True)
	print("\nNewest day partitions:")
	for d in days[:7]:
		files = sorted(d.glob("*.json"))
		print(f" {d.name}: {len(files)} json files")
		for f in files[:5]:
			print(" -", f.name)
else:
	print("No partitions found yet. Run the scraper cell first.")

Partitions dir exists: True -> D:\Python\Week 8\Shipping-a-Data-Product\data\raw\telegram_messages

Newest day partitions:
 2026-01-15: 2 json files
 - lobelia4cosmetics.json
 - tikvahpharma.json
 2026-01-14: 2 json files
 - lobelia4cosmetics.json
 - tikvahpharma.json
 2026-01-13: 2 json files
 - lobelia4cosmetics.json
 - tikvahpharma.json
 2026-01-12: 2 json files
 - lobelia4cosmetics.json
 - tikvahpharma.json
 2026-01-11: 2 json files
 - lobelia4cosmetics.json
 - tikvahpharma.json
 2026-01-10: 2 json files
 - lobelia4cosmetics.json
 - tikvahpharma.json
 2026-01-09: 2 json files
 - lobelia4cosmetics.json
 - tikvahpharma.json


## Verification — CSV Backups

In [7]:
csv_dir = PROJECT_ROOT / "data" / "raw" / "csv"

print("CSV dir exists:", csv_dir.exists(), "->", csv_dir)

if csv_dir.exists():
	days = sorted([p for p in csv_dir.iterdir() if p.is_dir()], reverse=True)
	print("\nNewest CSV backups:")
	for d in days[:7]:
		csv_file = d / "telegram_data.csv"
		if csv_file.exists():
			print(f" {d.name}: telegram_data.csv ({csv_file.stat().st_size} bytes)")
		else:
			print(f" {d.name}: telegram_data.csv MISSING")
else:
	print("No CSV backups found yet. Run the scraper cell first.")

CSV dir exists: True -> D:\Python\Week 8\Shipping-a-Data-Product\data\raw\csv

Newest CSV backups:
 2026-01-15: telegram_data.csv (1668 bytes)


## Verification — Image Downloads 

In [8]:
images_dir = PROJECT_ROOT / "data" / "raw" / "images"


print("Images dir exists:", images_dir.exists(), "->", images_dir)


if images_dir.exists():
	chans = sorted([p for p in images_dir.iterdir() if p.is_dir()])
	print("\nImage folders:")
	for ch in chans:
		n = len(list(ch.glob("*")))
		print(f" {ch.name}: {n} files")
else:
	print("No images found yet. Run the scraper cell first (and enable media download).")

Images dir exists: True -> D:\Python\Week 8\Shipping-a-Data-Product\data\raw\images

Image folders:
 CheMed123: 69 files
 lobelia4cosmetics: 2513 files
 tikvahpharma: 5589 files
