# Task 1 â€” Telegram Scraping

This notebook imports and runs the existing Telegram scraper at `src/scraper.py`.

Outputs written by the scraper:
- `data/raw/telegram_messages/YYYY-MM-DD/<channel>.json`
- `data/raw/images/<channel>/*` (if enabled)
- `data/raw/scrape_state.json`
- `logs/scraper.log`


In [1]:
from __future__ import annotations

import sys
from pathlib import Path

# We are in: <project_root>/notebooks/task1_telegram_scraping.ipynb
# So we search upwards to find <project_root>/src/scraper.py
NOTEBOOK_CWD: Path = Path.cwd().resolve()


def find_project_root(start: Path) -> Path:
    for p in [start, *start.parents]:
        if (p / "src" / "scraper.py").exists():
            return p
    raise FileNotFoundError(
        "Could not locate project root containing src/scraper.py. "
        "Open this notebook from somewhere inside the repo."
    )


PROJECT_ROOT: Path = find_project_root(NOTEBOOK_CWD)

# Make `src` importable as a package
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

print("NOTEBOOK_CWD:", NOTEBOOK_CWD)
print("PROJECT_ROOT:", PROJECT_ROOT)
print("sys.path[0]:", sys.path[0])
print("Python:", sys.executable)



NOTEBOOK_CWD: D:\Python\Week 8\Shipping-a-Data-Product\notebooks
PROJECT_ROOT: D:\Python\Week 8\Shipping-a-Data-Product
sys.path[0]: D:\Python\Week 8\Shipping-a-Data-Product
Python: d:\Python\Week 8\Shipping-a-Data-Product\.venv\Scripts\python.exe


In [2]:
from src import scraper

print("Imported scraper from:", scraper.__file__)

# Show important config (read at import time from env vars)
for name in [
    "MAX_MESSAGES_PER_CHANNEL",
    "LOOKBACK_DAYS",
    "DOWNLOAD_MEDIA",
    "MAX_MEDIA_PER_CHANNEL",
    "PER_CHANNEL_TIMEOUT_SEC",
    "STATE_PATH",
    "DATA_LAKE_DIR",
    "IMAGES_DIR",
]:
    print(f"{name} = {getattr(scraper, name, None)}")


Imported scraper from: D:\Python\Week 8\Shipping-a-Data-Product\src\scraper.py
MAX_MESSAGES_PER_CHANNEL = 300
LOOKBACK_DAYS = 14
DOWNLOAD_MEDIA = True
MAX_MEDIA_PER_CHANNEL = 50
PER_CHANNEL_TIMEOUT_SEC = 180
STATE_PATH = D:\Python\Week 8\Shipping-a-Data-Product\data\raw\scrape_state.json
DATA_LAKE_DIR = D:\Python\Week 8\Shipping-a-Data-Product\data\raw\telegram_messages
IMAGES_DIR = D:\Python\Week 8\Shipping-a-Data-Product\data\raw\images


## Run the scraper

In Jupyter, run the async entrypoint with:

```python
await scraper.main()
```

Do **not** call `python src/scraper.py` from inside the notebook.


In [3]:
await scraper.main()


2026-01-15 17:55:34,512 | INFO | === Telegram scraper starting ===
2026-01-15 17:55:34,514 | INFO | PROJECT_ROOT=D:\Python\Week 8\Shipping-a-Data-Product
2026-01-15 17:55:34,515 | INFO | STATE_PATH=D:\Python\Week 8\Shipping-a-Data-Product\data\raw\scrape_state.json
2026-01-15 17:55:34,516 | INFO | DATA_LAKE_DIR=D:\Python\Week 8\Shipping-a-Data-Product\data\raw\telegram_messages
2026-01-15 17:55:34,516 | INFO | IMAGES_DIR=D:\Python\Week 8\Shipping-a-Data-Product\data\raw\images
2026-01-15 17:55:34,517 | INFO | CONFIG: MAX_MESSAGES_PER_CHANNEL=300, LOOKBACK_DAYS=14, DOWNLOAD_MEDIA=True, MAX_MEDIA_PER_CHANNEL=50, PER_CHANNEL_TIMEOUT_SEC=180
2026-01-15 17:55:34,518 | INFO | Loaded additional channels from D:\Python\Week 8\Shipping-a-Data-Product\channels.txt
2026-01-15 17:55:34,519 | INFO | Total channels to scrape: 4 -> @CheMed123, @lobelia4cosmetics, @tikvahpharma, @rayapharmaceuticals
2026-01-15 17:55:34,550 | INFO | Connecting to 149.154.167.51:443/TcpFull...
2026-01-15 17:55:36,063 | 

Signed in successfully as MogassaðŸ‡ªðŸ‡¹; remember to not break the ToS or you will risk an account ban!


2026-01-15 17:56:03,211 | INFO | Channel @CheMed123 done: new_messages=0, images_downloaded=0, partitions=0, new_last_message_id=97
2026-01-15 17:56:03,212 | INFO | Scraping channel: @lobelia4cosmetics
2026-01-15 17:56:03,212 | INFO | Channel @lobelia4cosmetics last_message_id=22884
2026-01-15 17:56:03,773 | INFO | Channel @lobelia4cosmetics done: new_messages=0, images_downloaded=0, partitions=0, new_last_message_id=22884
2026-01-15 17:56:03,774 | INFO | Scraping channel: @tikvahpharma
2026-01-15 17:56:03,774 | INFO | Channel @tikvahpharma last_message_id=188947
2026-01-15 17:56:04,578 | INFO | Channel @tikvahpharma done: new_messages=0, images_downloaded=0, partitions=0, new_last_message_id=188947
2026-01-15 17:56:04,579 | INFO | Scraping channel: @rayapharmaceuticals
2026-01-15 17:56:04,579 | INFO | Channel @rayapharmaceuticals last_message_id=0
2026-01-15 17:56:05,268 | INFO | Channel @rayapharmaceuticals done: new_messages=0, images_downloaded=0, partitions=0, new_last_message_id=

## Quick verification (outputs)

This cell checks the state file and lists recent partitions and image folders.


In [4]:
import json
from pathlib import Path

state_path = PROJECT_ROOT / "data" / "raw" / "scrape_state.json"
data_dir = PROJECT_ROOT / "data" / "raw" / "telegram_messages"
images_dir = PROJECT_ROOT / "data" / "raw" / "images"

print("State exists:", state_path.exists(), "->", state_path)
print("Partitions dir exists:", data_dir.exists(), "->", data_dir)
print("Images dir exists:", images_dir.exists(), "->", images_dir)

if state_path.exists():
    st = json.loads(state_path.read_text(encoding="utf-8"))
    print("\nState (channels):")
    for ch, v in (st.get("channels") or {}).items():
        print(f"  {ch}: {v}")

if data_dir.exists():
    days = sorted([p for p in data_dir.iterdir() if p.is_dir()], reverse=True)
    print("\nNewest day partitions:")
    for d in days[:7]:
        files = sorted(d.glob("*.json"))
        print(f"  {d.name}: {len(files)} json files")
        for f in files[:5]:
            print("    -", f.name)

if images_dir.exists():
    chans = sorted([p for p in images_dir.iterdir() if p.is_dir()])
    print("\nImage folders:")
    for ch in chans:
        n = len(list(ch.glob("*")))
        print(f"  {ch.name}: {n} files")


State exists: True -> D:\Python\Week 8\Shipping-a-Data-Product\data\raw\scrape_state.json
Partitions dir exists: True -> D:\Python\Week 8\Shipping-a-Data-Product\data\raw\telegram_messages
Images dir exists: True -> D:\Python\Week 8\Shipping-a-Data-Product\data\raw\images

State (channels):
  CheMed123: {'last_message_id': 97, 'updated_at': '2026-01-14T18:08:14.968611+00:00'}
  lobelia4cosmetics: {'last_message_id': 22884, 'updated_at': '2026-01-15T13:20:06.955975+00:00'}
  tikvahpharma: 188947

Newest day partitions:
  2026-01-15: 2 json files
    - lobelia4cosmetics.json
    - tikvahpharma.json
  2026-01-14: 2 json files
    - lobelia4cosmetics.json
    - tikvahpharma.json
  2026-01-13: 2 json files
    - lobelia4cosmetics.json
    - tikvahpharma.json
  2026-01-12: 2 json files
    - lobelia4cosmetics.json
    - tikvahpharma.json
  2026-01-11: 2 json files
    - lobelia4cosmetics.json
    - tikvahpharma.json
  2026-01-10: 2 json files
    - lobelia4cosmetics.json
    - tikvahpharma.j