# Telegram Medical Data Scraper

This notebook demonstrates how to scrape medical data from Telegram channels using the asynchronous scraper module.

## Target Channels
- https://t.me/CheMed123
- https://t.me/lobelia4cosmetics  
- https://t.me/tikvahpharma

## Features
- Asynchronous scraping using Telethon
- Rate limit handling
- Error handling for various Telegram errors
- Data storage in JSON format with date-based directory structure
- Image attachment downloading
- Enhanced logging with timing information
- Progress tracking and detailed reporting

----

## Import Libraries

In [1]:
import sys
import os
import asyncio
import json
from datetime import datetime
from pathlib import Path

# Add the src directory to Python path
sys.path.append(os.path.join(os.path.dirname(os.getcwd()), 'src'))

# Import our scraper
from scraper.telegram_scraper import TelegramScraper
import logging

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

print("✅ Libraries imported successfully")
print(f"📁 Current working directory: {os.getcwd()}")

✅ Libraries imported successfully
📁 Current working directory: c:\Users\Admin\OneDrive\ACADEMIA\10 Academy\Week 7\GitHub Repository\telegram-medical-data-pipeline\notebooks


## Check Environment

In [2]:
# Check if .env file exists and load environment variables
from dotenv import load_dotenv
load_dotenv()

# Check required environment variables
required_vars = ['TELEGRAM_API_ID', 'TELEGRAM_API_HASH', 'TELEGRAM_PHONE']
missing_vars = []

for var in required_vars:
    value = os.getenv(var)
    if not value:
        missing_vars.append(var)
    else:
        print(f"✅ {var}: {'*' * len(value)}")  # Hide actual values

if missing_vars:
    print(f"❌ Missing environment variables: {missing_vars}")
    print("Please create a .env file with your Telegram API credentials")
else:
    print("✅ All required environment variables are set")

# Check and create data directories
data_dirs = [
    Path("../data/raw/telegram_messages"),
    Path("../data/raw/telegram_images")
]

for data_dir in data_dirs:
    if not data_dir.exists():
        data_dir.mkdir(parents=True, exist_ok=True)
        print(f"📁 Created data directory: {data_dir}")
    else:
        print(f"📁 Data directory exists: {data_dir}")

✅ TELEGRAM_API_ID: ********
✅ TELEGRAM_API_HASH: ********************************
✅ TELEGRAM_PHONE: *************
✅ All required environment variables are set
📁 Created data directory: ..\data\raw\telegram_messages
📁 Created data directory: ..\data\raw\telegram_images


## Initialize Scraper

Let's create an instance of our TelegramScraper and test the connection.

In [3]:
# Initialize the scraper
try:
    scraper = TelegramScraper()
    print("✅ Scraper initialized successfully")
    print(f"�� Target channels: {len(scraper.target_channels)}")
    for channel in scraper.target_channels:
        print(f"   - {channel}")
except Exception as e:
    print(f"❌ Error initializing scraper: {e}")
    scraper = None

2025-07-14 23:33:17,272 - scraper.telegram_scraper - INFO - Initialized TelegramScraper with 3 target channels


✅ Scraper initialized successfully
�� Target channels: 3
   - https://t.me/CheMed123
   - https://t.me/lobelia4cosmetics
   - https://t.me/tikvahpharma


## Test Connection

In [4]:
# Test connection to Telegram
async def test_connection():
    if not scraper:
        print("❌ Scraper not initialized")
        return False
    
    try:
        print("�� Testing connection to Telegram...")
        connected = await scraper.connect()
        
        if connected:
            print("✅ Successfully connected to Telegram")
            await scraper.disconnect()
            print("✅ Disconnected from Telegram")
            return True
        else:
            print("❌ Failed to connect to Telegram")
            return False
            
    except Exception as e:
        print(f"❌ Connection test failed: {e}")
        return False

# Run the connection test
connection_success = await test_connection()

2025-07-14 23:33:17,291 - telethon.network.mtprotosender - INFO - Connecting to 149.154.167.91:443/TcpFull...


�� Testing connection to Telegram...


2025-07-14 23:33:17,457 - telethon.network.mtprotosender - INFO - Connection to 149.154.167.91:443/TcpFull complete!
2025-07-14 23:33:18,732 - scraper.telegram_scraper - INFO - Successfully connected to Telegram
2025-07-14 23:33:18,735 - telethon.network.mtprotosender - INFO - Disconnecting from 149.154.167.91:443/TcpFull...
2025-07-14 23:33:18,738 - telethon.network.mtprotosender - INFO - Disconnection from 149.154.167.91:443/TcpFull complete!
2025-07-14 23:33:18,750 - scraper.telegram_scraper - INFO - Disconnected from Telegram


✅ Successfully connected to Telegram
✅ Disconnected from Telegram


## Scrape All Channels

In [5]:
# Function to scrape all channels
async def scrape_all_channels():
    """Scrape all target channels with enhanced logging and timing"""
    if not scraper:
        print("❌ Scraper not initialized")
        return []
    
    print("�� Starting to scrape all channels...")
    print("=" * 60)
    
    try:
        # Connect to Telegram
        if not await scraper.connect():
            print("❌ Failed to connect to Telegram")
            return []
        
        # Scrape all channels
        results = await scraper.scrape_all_channels()
        
        # Print detailed summary
        print("\n📊 Detailed Scraping Summary:")
        print("-" * 50)
        
        total_messages = 0
        successful_channels = 0
        total_duration = 0
        
        for result in results:
            status_icon = "✅" if result['status'] == 'success' else "❌"
            print(f"{status_icon} {result['channel_name']}")
            print(f"   📝 Messages: {result['message_count']}")
            print(f"   ⏱️  Duration: {result.get('duration_seconds', 0):.2f}s")
            print(f"   �� Start: {result.get('start_time', 'N/A')}")
            print(f"   �� End: {result.get('end_time', 'N/A')}")
            print(f"   📁 Status: {result['status']}")
            
            if result['file_path']:
                print(f"   💾 File: {result['file_path']}")
            
            if result['error']:
                print(f"   ⚠️  Error: {result['error']}")
            
            if result['status'] == 'success':
                successful_channels += 1
                total_messages += result['message_count']
                total_duration += result.get('duration_seconds', 0)
            
            print()
        
        print("=" * 60)
        print(f"📈 Final Summary:")
        print(f"   ✅ Successful channels: {successful_channels}/{len(results)}")
        print(f"   📝 Total messages scraped: {total_messages}")
        print(f"   ⏱️  Total duration: {total_duration:.2f}s")
        print(f"   📅 Timestamp: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
        
        await scraper.disconnect()
        return results
        
    except Exception as e:
        print(f"❌ Error in scraping: {e}")
        await scraper.disconnect()
        return []

# Run the scraping
all_results = await scrape_all_channels()

2025-07-14 23:33:18,769 - telethon.network.mtprotosender - INFO - Connecting to 149.154.167.91:443/TcpFull...
2025-07-14 23:33:18,937 - telethon.network.mtprotosender - INFO - Connection to 149.154.167.91:443/TcpFull complete!


�� Starting to scrape all channels...


2025-07-14 23:33:19,954 - scraper.telegram_scraper - INFO - Successfully connected to Telegram
2025-07-14 23:33:19,955 - scraper.telegram_scraper - INFO - Starting scrape for 3 channels
2025-07-14 23:33:19,957 - scraper.telegram_scraper - INFO - Starting scrape for channel: CheMed123
2025-07-14 23:33:19,959 - scraper.telegram_scraper - INFO - Starting to scrape channel: CheMed123
2025-07-14 23:33:20,676 - telethon.client.downloads - INFO - Starting direct file download in chunks of 131072 at 0, stride 131072
2025-07-14 23:33:21,212 - scraper.telegram_scraper - INFO - Downloaded image: data/raw/telegram_images\2025-07-14\CheMed123\97.jpg
2025-07-14 23:33:21,214 - telethon.client.downloads - INFO - Starting direct file download in chunks of 131072 at 0, stride 131072
2025-07-14 23:33:22,282 - scraper.telegram_scraper - INFO - Downloaded image: data/raw/telegram_images\2025-07-14\CheMed123\96.jpg
2025-07-14 23:33:22,285 - telethon.client.downloads - INFO - Starting direct file download in


📊 Detailed Scraping Summary:
--------------------------------------------------
✅ CheMed123
   📝 Messages: 63
   ⏱️  Duration: 65.40s
   �� Start: 2025-07-14T23:33:19.957364
   �� End: 2025-07-14T23:34:25.352824
   📁 Status: success
   💾 File: data/raw/telegram_messages\2025-07-14\CheMed123\CheMed123.json

✅ lobelia4cosmetics
   📝 Messages: 965
   ⏱️  Duration: 532.65s
   �� Start: 2025-07-14T23:34:27.377658
   �� End: 2025-07-14T23:43:20.023883
   📁 Status: success
   💾 File: data/raw/telegram_messages\2025-07-14\lobelia4cosmetics\lobelia4cosmetics.json

✅ tikvahpharma
   📝 Messages: 946
   ⏱️  Duration: 217.46s
   �� Start: 2025-07-14T23:43:22.088668
   �� End: 2025-07-14T23:46:59.544562
   📁 Status: success
   💾 File: data/raw/telegram_messages\2025-07-14\tikvahpharma\tikvahpharma.json

📈 Final Summary:
   ✅ Successful channels: 3/3
   📝 Total messages scraped: 1974
   ⏱️  Total duration: 815.50s
   📅 Timestamp: 2025-07-14 23:47:01


## Data Summary

Let's create a summary of all scraped data across all channels.

In [9]:
# Function to create comprehensive data summary
def create_data_summary():
    """Create a comprehensive summary of all scraped data including messages and images"""
    today = datetime.now().strftime('%Y-%m-%d')
    messages_dir = Path(f"../notebooks/data/raw/telegram_messages/{today}")
    images_dir = Path(f"../notebooks/data/raw/telegram_images/{today}")
    
    print("📊 Creating comprehensive data summary...")
    print("=" * 60)
    
    summary = {
        'date': today,
        'messages': {
            'total_files': 0,
            'total_messages': 0,
            'channels': {},
            'file_sizes': {}
        },
        'images': {
            'total_files': 0,
            'total_size_mb': 0,
            'channels': {}
        },
        'errors': []
    }
    
    # Process message files
    if messages_dir.exists():
        json_files = list(messages_dir.rglob("*.json"))
        summary['messages']['total_files'] = len(json_files)
        
        for json_file in json_files:
            try:
                # Get file info
                file_size = json_file.stat().st_size
                file_size_mb = file_size / (1024 * 1024)
                
                # Extract channel name from path
                channel_name = json_file.parent.name
                
                # Load data
                with open(json_file, 'r', encoding='utf-8') as f:
                    data = json.load(f)
                
                message_count = len(data) if isinstance(data, list) else 0
                
                # Update summary
                summary['messages']['total_messages'] += message_count
                summary['messages']['file_sizes'][str(json_file)] = file_size_mb
                
                if channel_name not in summary['messages']['channels']:
                    summary['messages']['channels'][channel_name] = {
                        'files': 0,
                        'messages': 0,
                        'total_size_mb': 0
                    }
                
                summary['messages']['channels'][channel_name]['files'] += 1
                summary['messages']['channels'][channel_name]['messages'] += message_count
                summary['messages']['channels'][channel_name]['total_size_mb'] += file_size_mb
                
            except Exception as e:
                summary['errors'].append(f"Error reading {json_file}: {e}")
    
    # Process image files
    if images_dir.exists():
        image_files = list(images_dir.rglob("*.jpg")) + list(images_dir.rglob("*.png")) + list(images_dir.rglob("*.jpeg"))
        summary['images']['total_files'] = len(image_files)
        
        for image_file in image_files:
            try:
                # Get file info
                file_size = image_file.stat().st_size
                file_size_mb = file_size / (1024 * 1024)
                
                # Extract channel name from path
                channel_name = image_file.parent.name
                
                # Update summary
                summary['images']['total_size_mb'] += file_size_mb
                
                if channel_name not in summary['images']['channels']:
                    summary['images']['channels'][channel_name] = {
                        'files': 0,
                        'total_size_mb': 0
                    }
                
                summary['images']['channels'][channel_name]['files'] += 1
                summary['images']['channels'][channel_name]['total_size_mb'] += file_size_mb
                
            except Exception as e:
                summary['errors'].append(f"Error reading {image_file}: {e}")
    
    # Print summary
    print(f"📅 Date: {summary['date']}")
    print(f"\n📝 Messages:")
    print(f"   📁 Total files: {summary['messages']['total_files']}")
    print(f"   📊 Total messages: {summary['messages']['total_messages']}")
    print(f"   💾 Total size: {sum(summary['messages']['file_sizes'].values()):.2f} MB")
    
    print(f"\n🖼️  Images:")
    print(f"   📁 Total files: {summary['images']['total_files']}")
    print(f"   💾 Total size: {summary['images']['total_size_mb']:.2f} MB")
    
    print(f"\n📡 Channels Summary:")
    all_channels = set(summary['messages']['channels'].keys()) | set(summary['images']['channels'].keys())
    
    for channel in sorted(all_channels):
        print(f"   {channel}:")
        
        # Message stats
        if channel in summary['messages']['channels']:
            msg_stats = summary['messages']['channels'][channel]
            print(f"     📝 Messages: {msg_stats['messages']} (files: {msg_stats['files']}, size: {msg_stats['total_size_mb']:.2f} MB)")
        
        # Image stats
        if channel in summary['images']['channels']:
            img_stats = summary['images']['channels'][channel]
            print(f"     🖼️  Images: {img_stats['files']} files ({img_stats['total_size_mb']:.2f} MB)")
    
    if summary['errors']:
        print(f"\n❌ Errors:")
        for error in summary['errors']:
            print(f"   {error}")
    
    return summary

# Create the summary
data_summary = create_data_summary()

📊 Creating comprehensive data summary...
📅 Date: 2025-07-14

📝 Messages:
   📁 Total files: 3
   📊 Total messages: 1974
   💾 Total size: 2.86 MB

🖼️  Images:
   📁 Total files: 1272
   💾 Total size: 100.27 MB

📡 Channels Summary:
   CheMed123:
     📝 Messages: 63 (files: 1, size: 0.06 MB)
     🖼️  Images: 59 files (7.06 MB)
   lobelia4cosmetics:
     📝 Messages: 965 (files: 1, size: 1.06 MB)
     🖼️  Images: 965 files (62.04 MB)
   tikvahpharma:
     📝 Messages: 946 (files: 1, size: 1.74 MB)
     🖼️  Images: 248 files (31.18 MB)


## Data Validation

In [10]:
# Function to validate scraped data
def validate_scraped_data():
    """Validate the scraped data for consistency and completeness"""
    today = datetime.now().strftime('%Y-%m-%d')
    messages_dir = Path(f"../notebooks/data/raw/telegram_messages/{today}")
    
    print("�� Validating scraped data...")
    print("=" * 40)
    
    validation_results = {
        'total_files': 0,
        'valid_files': 0,
        'total_messages': 0,
        'messages_with_media': 0,
        'issues': []
    }
    
    if not messages_dir.exists():
        print("❌ Messages directory does not exist")
        return validation_results
    
    json_files = list(messages_dir.rglob("*.json"))
    validation_results['total_files'] = len(json_files)
    
    for json_file in json_files:
        try:
            with open(json_file, 'r', encoding='utf-8') as f:
                data = json.load(f)
            
            if not isinstance(data, list):
                validation_results['issues'].append(f"{json_file}: Data is not a list")
                continue
            
            validation_results['valid_files'] += 1
            validation_results['total_messages'] += len(data)
            
            # Check for required fields
            for i, message in enumerate(data):
                required_fields = ['message_id', 'message_text', 'message_date', 'channel_name']
                missing_fields = [field for field in required_fields if field not in message]
                
                if missing_fields:
                    validation_results['issues'].append(f"{json_file}: Message {i} missing fields: {missing_fields}")
                
                # Count messages with media
                if message.get('has_media') and message.get('media_path'):
                    validation_results['messages_with_media'] += 1
            
            print(f"✅ {json_file.name}: {len(data)} messages")
            
        except Exception as e:
            validation_results['issues'].append(f"{json_file}: {e}")
            print(f"❌ {json_file.name}: Error - {e}")
    
    # Print validation summary
    print(f"\n�� Validation Summary:")
    print(f"   📁 Total files: {validation_results['total_files']}")
    print(f"   ✅ Valid files: {validation_results['valid_files']}")
    print(f"   📝 Total messages: {validation_results['total_messages']}")
    print(f"   🖼️  Messages with media: {validation_results['messages_with_media']}")
    
    if validation_results['issues']:
        print(f"\n⚠️  Issues found:")
        for issue in validation_results['issues']:
            print(f"   - {issue}")
    else:
        print(f"\n✅ No validation issues found!")
    
    return validation_results

# Run validation
validation_results = validate_scraped_data()

�� Validating scraped data...
✅ CheMed123.json: 63 messages
✅ lobelia4cosmetics.json: 965 messages
✅ tikvahpharma.json: 946 messages

�� Validation Summary:
   📁 Total files: 3
   ✅ Valid files: 3
   📝 Total messages: 1974
   🖼️  Messages with media: 1272

✅ No validation issues found!
