# Telegram Medical Data Scraper

This notebook demonstrates how to scrape medical data from Telegram channels using the asynchronous scraper module.

## Target Channels
- https://t.me/CheMed123
- https://t.me/lobelia4cosmetics  
- https://t.me/tikvahpharma

## Features
- Asynchronous scraping using Telethon
- Rate limit handling
- Error handling for various Telegram errors
- Data storage in JSON format
- Progress tracking and logging

----

## Import Libraries

In [1]:
import sys
import os
import asyncio
import json
from datetime import datetime
from pathlib import Path

# Add the src directory to Python path
sys.path.append(os.path.join(os.path.dirname(os.getcwd()), 'src'))

# Import our scraper
from scraper.telegram_scraper import TelegramScraper
import logging

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

print("✅ Libraries imported successfully")
print(f"📁 Current working directory: {os.getcwd()}")

✅ Libraries imported successfully
📁 Current working directory: c:\Users\Admin\OneDrive\ACADEMIA\10 Academy\Week 7\GitHub Repository\telegram-medical-data-pipeline\notebooks


## Check Environment

In [2]:
# Check if .env file exists and load environment variables
from dotenv import load_dotenv
load_dotenv()

# Check required environment variables
required_vars = ['TELEGRAM_API_ID', 'TELEGRAM_API_HASH', 'TELEGRAM_PHONE']
missing_vars = []

for var in required_vars:
    value = os.getenv(var)
    if not value:
        missing_vars.append(var)
    else:
        print(f"✅ {var}: {'*' * len(value)}")  # Hide actual values

if missing_vars:
    print(f"❌ Missing environment variables: {missing_vars}")
    print("Please create a .env file with your Telegram API credentials")
else:
    print("✅ All required environment variables are set")

# Check data directory
data_dir = Path("../data/raw")
if not data_dir.exists():
    data_dir.mkdir(parents=True, exist_ok=True)
    print(f"📁 Created data directory: {data_dir}")
else:
    print(f"📁 Data directory exists: {data_dir}")

✅ TELEGRAM_API_ID: ********
✅ TELEGRAM_API_HASH: ********************************
✅ TELEGRAM_PHONE: *************
✅ All required environment variables are set
📁 Data directory exists: ..\data\raw


## Initialize Scraper

Let's create an instance of our TelegramScraper and test the connection.

In [3]:
# Initialize the scraper
try:
    scraper = TelegramScraper()
    print("✅ Scraper initialized successfully")
    print(f"�� Target channels: {len(scraper.target_channels)}")
    for channel in scraper.target_channels:
        print(f"   - {channel}")
except Exception as e:
    print(f"❌ Error initializing scraper: {e}")
    scraper = None

2025-07-14 23:05:01,965 - scraper.telegram_scraper - INFO - Initialized TelegramScraper with 3 target channels


✅ Scraper initialized successfully
�� Target channels: 3
   - https://t.me/CheMed123
   - https://t.me/lobelia4cosmetics
   - https://t.me/tikvahpharma


## Test Connection

In [4]:
# Test connection to Telegram
async def test_connection():
    if not scraper:
        print("❌ Scraper not initialized")
        return False
    
    try:
        print("�� Testing connection to Telegram...")
        connected = await scraper.connect()
        
        if connected:
            print("✅ Successfully connected to Telegram")
            await scraper.disconnect()
            print("✅ Disconnected from Telegram")
            return True
        else:
            print("❌ Failed to connect to Telegram")
            return False
            
    except Exception as e:
        print(f"❌ Connection test failed: {e}")
        return False

# Run the connection test
connection_success = await test_connection()

2025-07-14 23:05:01,981 - telethon.network.mtprotosender - INFO - Connecting to 149.154.167.91:443/TcpFull...


�� Testing connection to Telegram...


2025-07-14 23:05:02,218 - telethon.network.mtprotosender - INFO - Connection to 149.154.167.91:443/TcpFull complete!
2025-07-14 23:05:03,301 - scraper.telegram_scraper - INFO - Successfully connected to Telegram
2025-07-14 23:05:03,303 - telethon.network.mtprotosender - INFO - Disconnecting from 149.154.167.91:443/TcpFull...
2025-07-14 23:05:03,304 - telethon.network.mtprotosender - INFO - Disconnection from 149.154.167.91:443/TcpFull complete!
2025-07-14 23:05:03,310 - scraper.telegram_scraper - INFO - Disconnected from Telegram


✅ Successfully connected to Telegram
✅ Disconnected from Telegram


## Scrape All Channels

In [5]:
# Function to scrape all channels
async def scrape_all_channels():
    """Scrape all target channels"""
    if not scraper:
        print("❌ Scraper not initialized")
        return []
    
    print("�� Starting to scrape all channels...")
    print("=" * 50)
    
    try:
        # Connect to Telegram
        if not await scraper.connect():
            print("❌ Failed to connect to Telegram")
            return []
        
        # Scrape all channels
        results = await scraper.scrape_all_channels()
        
        # Print summary
        print("\n📊 Scraping Summary:")
        print("-" * 30)
        
        total_messages = 0
        successful_channels = 0
        
        for result in results:
            status_icon = "✅" if result['status'] == 'success' else "❌"
            print(f"{status_icon} {result['channel_name']}")
            print(f"   Messages: {result['message_count']}")
            print(f"   Status: {result['status']}")
            
            if result['file_path']:
                print(f"   File: {result['file_path']}")
            
            if result['error']:
                print(f"   Error: {result['error']}")
            
            if result['status'] == 'success':
                successful_channels += 1
                total_messages += result['message_count']
            
            print()
        
        print("=" * 50)
        print(f"📈 Final Summary:")
        print(f"   Successful channels: {successful_channels}/{len(results)}")
        print(f"   Total messages scraped: {total_messages}")
        print(f"   Timestamp: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
        
        await scraper.disconnect()
        return results
        
    except Exception as e:
        print(f"❌ Error in scraping: {e}")
        await scraper.disconnect()
        return []

# Run the scraping
all_results = await scrape_all_channels()

2025-07-14 23:05:03,332 - telethon.network.mtprotosender - INFO - Connecting to 149.154.167.91:443/TcpFull...


�� Starting to scrape all channels...


2025-07-14 23:05:03,549 - telethon.network.mtprotosender - INFO - Connection to 149.154.167.91:443/TcpFull complete!
2025-07-14 23:05:04,778 - scraper.telegram_scraper - INFO - Successfully connected to Telegram
2025-07-14 23:05:04,779 - scraper.telegram_scraper - INFO - Starting scrape for 3 channels
2025-07-14 23:05:04,780 - scraper.telegram_scraper - INFO - Starting scrape for channel: CheMed123
2025-07-14 23:05:04,780 - scraper.telegram_scraper - INFO - Starting to scrape channel: CheMed123
2025-07-14 23:05:05,622 - scraper.telegram_scraper - INFO - Successfully scraped 63 messages from CheMed123
2025-07-14 23:05:05,626 - scraper.telegram_scraper - INFO - Saved 63 messages to data/raw\2025-07-14\CheMed123\messages.json
2025-07-14 23:05:07,629 - scraper.telegram_scraper - INFO - Starting scrape for channel: lobelia4cosmetics
2025-07-14 23:05:07,631 - scraper.telegram_scraper - INFO - Starting to scrape channel: lobelia4cosmetics
2025-07-14 23:05:08,825 - scraper.telegram_scraper - I


📊 Scraping Summary:
------------------------------
✅ CheMed123
   Messages: 63
   Status: success
   File: data/raw\2025-07-14\CheMed123\messages.json

✅ lobelia4cosmetics
   Messages: 965
   Status: success
   File: data/raw\2025-07-14\lobelia4cosmetics\messages.json

✅ tikvahpharma
   Messages: 946
   Status: success
   File: data/raw\2025-07-14\tikvahpharma\messages.json

📈 Final Summary:
   Successful channels: 3/3
   Total messages scraped: 1974
   Timestamp: 2025-07-14 23:05:30


## Data Summary

Let's create a summary of all scraped data across all channels.

In [11]:
# Function to create comprehensive data summary
def create_data_summary():
    """Create a comprehensive summary of all scraped data"""
    data_dir = Path("../notebooks/data/raw/2025-07-14/")
    
    if not data_dir.exists():
        print("❌ Data directory does not exist")
        return
    
    print("📊 Creating comprehensive data summary...")
    print("=" * 50)
    
    # Find all JSON files
    json_files = list(data_dir.rglob("*.json"))
    
    if not json_files:
        print("❌ No data files found")
        return
    
    summary = {
        'total_files': len(json_files),
        'total_messages': 0,
        'channels': {},
        'date_range': {'earliest': None, 'latest': None},
        'file_sizes': {},
        'errors': []
    }
    
    for json_file in json_files:
        try:
            # Get file info
            file_size = json_file.stat().st_size
            file_size_mb = file_size / (1024 * 1024)
            
            # Extract channel name from path
            channel_name = json_file.parent.name
            
            # Load data
            with open(json_file, 'r', encoding='utf-8') as f:
                data = json.load(f)
            
            message_count = len(data) if isinstance(data, list) else 0
            
            # Update summary
            summary['total_messages'] += message_count
            summary['file_sizes'][str(json_file)] = file_size_mb
            
            if channel_name not in summary['channels']:
                summary['channels'][channel_name] = {
                    'files': 0,
                    'messages': 0,
                    'total_size_mb': 0
                }
            
            summary['channels'][channel_name]['files'] += 1
            summary['channels'][channel_name]['messages'] += message_count
            summary['channels'][channel_name]['total_size_mb'] += file_size_mb
            
            # Update date range
            if data:
                dates = [msg.get('message_date', '') for msg in data if msg.get('message_date')]
                if dates:
                    if not summary['date_range']['earliest'] or min(dates) < summary['date_range']['earliest']:
                        summary['date_range']['earliest'] = min(dates)
                    if not summary['date_range']['latest'] or max(dates) > summary['date_range']['latest']:
                        summary['date_range']['latest'] = max(dates)
            
        except Exception as e:
            summary['errors'].append(f"Error reading {json_file}: {e}")
    
    # Print summary
    print(f"📁 Total files: {summary['total_files']}")
    print(f"📝 Total messages: {summary['total_messages']}")
    print(f"📅 Date range: {summary['date_range']['earliest']} to {summary['date_range']['latest']}")
    print(f"📊 Total size: {sum(summary['file_sizes'].values()):.2f} MB")
    
    print(f"\n📡 Channels:")
    for channel, stats in summary['channels'].items():
        print(f"   {channel}:")
        print(f"     Files: {stats['files']}")
        print(f"     Messages: {stats['messages']}")
        print(f"     Size: {stats['total_size_mb']:.2f} MB")
    
    if summary['errors']:
        print(f"\n❌ Errors:")
        for error in summary['errors']:
            print(f"   {error}")
    
    return summary

# Create the summary
data_summary = create_data_summary()

📊 Creating comprehensive data summary...
📁 Total files: 3
📝 Total messages: 1974
📅 Date range: 2022-09-05T09:57:09+00:00 to 2025-07-14T18:27:36+00:00
📊 Total size: 2.74 MB

📡 Channels:
   CheMed123:
     Files: 1
     Messages: 63
     Size: 0.05 MB
   lobelia4cosmetics:
     Files: 1
     Messages: 965
     Size: 0.98 MB
   tikvahpharma:
     Files: 1
     Messages: 946
     Size: 1.71 MB
