## ‚ö° MULTI-SESSION STRATEGY (RECOMMENDED)

### **Problem:** Scraping 4.75 years takes 4-6 hours. Colab may disconnect.

### **Solution:** Split into 3 sessions!

**Session 1:** `START_DATE = (2021, 1, 1)`, `END_DATE = (2022, 6, 30)` (~500 days)  
‚Üí Run overnight, download JSON (~10,000 articles)

**Session 2:** `START_DATE = (2022, 7, 1)`, `END_DATE = (2024, 1, 1)` (~500 days)  
‚Üí Run next night, download JSON (~10,000 articles)

**Session 3:** `START_DATE = (2024, 1, 2)`, `END_DATE = (2025, 10, 9)` (~600 days)  
‚Üí Run third night, download JSON (~12,000 articles)

**Result:** ~30,000+ articles total! üéØ

### **On Your PC:**
```powershell
# Extract all 3 ZIPs to data/raw/news/
# Then merge:
python src/scraping/2_merge_data.py  # Auto-removes duplicates!
```

---

# üì∞ StockBus News Scraper - Google Colab PRODUCTION Edition

**Created:** October 9, 2025  
**Updated:** Production version with checkpoint saving  
**Purpose:** Scrape 20-30 articles/day for 2021-2025 (30,000+ articles)

---

## üéØ KEY FEATURES:
- ‚úÖ **AUTO-SAVE every 10 articles** - Never lose progress!
- ‚úÖ **Resume from crash** - Picks up where it left off
- ‚úÖ **25 articles per topic** - More data per scrape
- ‚úÖ **6 topics** - Better coverage
- ‚úÖ **Download JSONs** - Process on your PC

---

## ? WHAT YOU'LL GET:
- **Target:** 20-30 articles/day
- **Topics:** 6 Indian stock market topics
- **Time Period:** Custom (recommend 2021-2025)
- **Total Expected:** 30,000-35,000 articles
- **Time:** 4-6 hours (can run overnight!)

---

## ‚ö° STRATEGY:
1. Run this in **2-3 Colab sessions** (split date ranges)
2. Each session scrapes 500-600 days
3. Auto-saves every 10 articles (crash-proof!)
4. Download JSONs after each session
5. Merge on your PC

---

## 1Ô∏è‚É£ Install Required Packages

In [None]:
# Install dependencies
!pip install -q gnews==0.4.2
!pip install -q selenium==4.35.0
!pip install -q webdriver-manager==4.0.2
!pip install -q beautifulsoup4==4.12.3
!pip install -q newspaper3k==0.2.8
!pip install -q lxml==5.3.0
!pip install -q lxml_html_clean==0.4.3
!pip install -q tqdm

print("‚úÖ All packages installed!")

## 2Ô∏è‚É£ Setup Chrome WebDriver for Colab

In [None]:
# Install Chrome and ChromeDriver for Colab
!apt-get update
!apt-get install -y chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin

# Set Chrome options for headless mode
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')

print("‚úÖ Chrome WebDriver ready!")

## 3Ô∏è‚É£ Create News Scraper Code

In [None]:
import json
import time
import logging
import hashlib
from datetime import datetime, date
from pathlib import Path
from typing import List, Dict, Optional
from tqdm import tqdm

from gnews import GNews
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
from newspaper import Article

print("‚úÖ All imports successful!")

In [None]:
class NewsScraperColab:
    """
    PRODUCTION Google Colab Scraper
    Features:
    - Auto-save every 10 articles (crash-proof!)
    - Resume from checkpoint
    - Cache to avoid duplicates
    - Better extraction logic
    """
    
    def __init__(self, checkpoint_interval=10):
        self.gnews = None
        self.driver = None
        self.results = []
        self.checkpoint_interval = checkpoint_interval
        self.cache = set()  # URLs already scraped
        self.cache_file = 'scraped_cache.json'
        self._load_cache()
        
    def _load_cache(self):
        """Load cache of already-scraped URLs"""
        try:
            with open(self.cache_file, 'r') as f:
                cache_data = json.load(f)
                self.cache = set(cache_data)
            print(f"‚úÖ Loaded cache: {len(self.cache)} URLs already scraped")
        except:
            print("üìù Starting fresh (no cache found)")
            
    def _save_cache(self):
        """Save cache to disk"""
        with open(self.cache_file, 'w') as f:
            json.dump(list(self.cache), f)
            
    def _is_cached(self, url):
        """Check if URL already scraped"""
        url_hash = hashlib.md5(url.encode()).hexdigest()
        return url_hash in self.cache
        
    def _mark_cached(self, url):
        """Mark URL as scraped"""
        url_hash = hashlib.md5(url.encode()).hexdigest()
        self.cache.add(url_hash)
        
    def _init_driver(self):
        """Initialize Chrome driver for Colab"""
        if self.driver:
            return
            
        chrome_options = Options()
        chrome_options.add_argument('--headless')
        chrome_options.add_argument('--no-sandbox')
        chrome_options.add_argument('--disable-dev-shm-usage')
        chrome_options.add_argument('--disable-gpu')
        chrome_options.add_argument('--disable-blink-features=AutomationControlled')
        chrome_options.add_argument('user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36')
        
        self.driver = webdriver.Chrome(options=chrome_options)
        print("‚úÖ Chrome driver initialized")
    
    def _extract_with_newspaper3k(self, url: str) -> Optional[Dict]:
        """Extract article using newspaper3k"""
        try:
            article = Article(url)
            article.download()
            article.parse()
            
            if article.text and len(article.text) > 100:
                return {
                    'title': article.title or 'No title',
                    'body': article.text,
                    'published_date': article.publish_date.strftime('%d/%m/%Y') if article.publish_date else None,
                    'authors': article.authors,
                    'extraction_method': 'newspaper3k'
                }
        except Exception as e:
            pass
        return None
    
    def _extract_with_selenium(self, url: str) -> Optional[Dict]:
        """Fallback extraction with Selenium + BeautifulSoup"""
        try:
            self._init_driver()
            self.driver.get(url)
            time.sleep(2)
            
            soup = BeautifulSoup(self.driver.page_source, 'html.parser')
            
            # Extract text from paragraphs
            paragraphs = soup.find_all('p')
            body = ' '.join([p.get_text().strip() for p in paragraphs if p.get_text().strip()])
            
            if len(body) > 100:
                title = soup.find('h1')
                return {
                    'title': title.get_text().strip() if title else 'No title',
                    'body': body,
                    'published_date': None,
                    'authors': [],
                    'extraction_method': 'beautifulsoup'
                }
        except Exception as e:
            pass
        return None
    
    def _save_checkpoint(self, articles: List[Dict], topic: str):
        """Save checkpoint (every N articles)"""
        if not articles:
            return
            
        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
        filename = f"checkpoint_{topic.replace(' ', '_').lower()}_{timestamp}.json"
        
        with open(filename, 'w', encoding='utf-8') as f:
            json.dump(articles, f, indent=2, ensure_ascii=False)
        
        # Also save cache
        self._save_cache()
        
        print(f"üíæ Checkpoint saved: {filename} ({len(articles)} articles)")
    
    def scrape_topic(self, topic: str, start_date: tuple, end_date: tuple, max_articles: int = 25) -> List[Dict]:
        """
        Scrape articles for a topic with checkpoint saving
        
        Args:
            topic: Search topic
            start_date: (year, month, day)
            end_date: (year, month, day)
            max_articles: Max articles to scrape per topic
        """
        print(f"\n{'='*70}")
        print(f"üîç Topic: {topic}")
        print(f"üìÖ Date: {start_date} to {end_date}")
        print(f"{'='*70}")
        
        # Initialize GNews with date range
        self.gnews = GNews(
            language='en',
            country='IN',
            max_results=max_articles,
            start_date=start_date,
            end_date=end_date
        )
        
        try:
            news_items = self.gnews.get_news(topic)
            print(f"üì∞ Found {len(news_items)} news items from GNews")
        except Exception as e:
            print(f"‚ùå Search failed: {e}")
            return []
        
        articles = []
        saved_count = 0
        
        for i, item in enumerate(tqdm(news_items, desc=f"Scraping", unit="item")):
            try:
                # Get full article URL
                full_url = item.get('url', '')
                if not full_url:
                    continue
                
                # Skip if already scraped (resume capability!)
                if self._is_cached(full_url):
                    print(f"‚è≠Ô∏è  Skipping cached: {full_url[:50]}...")
                    continue
                
                # Try newspaper3k first
                article_data = self._extract_with_newspaper3k(full_url)
                
                # Fallback to Selenium if needed
                if not article_data:
                    article_data = self._extract_with_selenium(full_url)
                
                if article_data:
                    # Add metadata
                    article_data.update({
                        'url': full_url,
                        'original_url': item.get('url', ''),
                        'scraped_date': datetime.now().strftime('%d/%m/%Y'),
                        'topic': topic,
                        'publisher': item.get('publisher', {}).get('title', 'Unknown'),
                        'gnews_title': item.get('title', ''),
                        'body_length': len(article_data['body']),
                        'word_count': len(article_data['body'].split())
                    })
                    
                    articles.append(article_data)
                    self._mark_cached(full_url)
                    
                    # CHECKPOINT SAVE (crash-proof!)
                    if len(articles) % self.checkpoint_interval == 0:
                        self._save_checkpoint(articles, topic)
                
                time.sleep(1)  # Rate limiting
                
            except Exception as e:
                print(f"‚ùå Error: {str(e)[:100]}")
                continue
        
        # Final save
        if articles:
            self._save_checkpoint(articles, topic)
        
        print(f"\n‚úÖ Scraped {len(articles)} articles for '{topic}'")
        return articles
    
    def cleanup(self):
        """Close driver and save cache"""
        if self.driver:
            self.driver.quit()
            print("‚úÖ Chrome driver closed")
        self._save_cache()
        print("‚úÖ Cache saved")

print("‚úÖ NewsScraperColab class created!")
print("üíæ Auto-saves every 10 articles")
print("üîÑ Can resume from crash!")

## 4Ô∏è‚É£ Define Topics to Scrape

In [None]:
# Topics for Indian stock market news (6 topics for better coverage)
TOPICS = [
    "Nifty 50 stock market India",
    "BSE Sensex India stock market",
    "Indian stock market news",
    "NSE India trading",
    "Nifty 50 India today",
    "Indian stock market today"
]

# Date range configuration
# RECOMMENDED SPLITS FOR OVERNIGHT SCRAPING:
# Session 1: 2021-01-01 to 2022-06-30 (500 days)
# Session 2: 2022-07-01 to 2024-01-01 (500 days)
# Session 3: 2024-01-02 to 2025-10-09 (600 days)

START_DATE = (2021, 1, 1)   # Change this per session
END_DATE = (2022, 6, 30)     # Change this per session
MAX_ARTICLES_PER_TOPIC = 25  # Increased for 20-30 articles/day

print(f"‚úÖ Configuration:")
print(f"   Topics: {len(TOPICS)}")
print(f"   Date Range: {START_DATE} to {END_DATE}")
print(f"   Max per topic: {MAX_ARTICLES_PER_TOPIC}")
print(f"   Expected: {len(TOPICS) * MAX_ARTICLES_PER_TOPIC} articles per scrape")
print(f"\n‚ö†Ô∏è  REMEMBER: Change START_DATE and END_DATE for each session!")

## 5Ô∏è‚É£ Run the Scraper!

In [None]:
# Create output directory
!mkdir -p scraped_news

# Initialize scraper with auto-save every 10 articles
scraper = NewsScraperColab(checkpoint_interval=10)

print("\n" + "="*70)
print("üöÄ STARTING PRODUCTION NEWS SCRAPER")
print("="*70)
print(f"üìÖ Date Range: {START_DATE} to {END_DATE}")
print(f"üíæ Auto-save: Every 10 articles")
print(f"üîÑ Resume: Will skip cached URLs")
print("="*70)

all_results = {}
all_articles = []

try:
    for topic in TOPICS:
        # Scrape topic with date range
        articles = scraper.scrape_topic(
            topic, 
            start_date=START_DATE,
            end_date=END_DATE,
            max_articles=MAX_ARTICLES_PER_TOPIC
        )
        
        if articles:
            # Save final topic file
            filename = f"scraped_news/{topic.replace(' ', '_').lower()}_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
            
            with open(filename, 'w', encoding='utf-8') as f:
                json.dump(articles, f, indent=2, ensure_ascii=False)
            
            all_results[topic] = len(articles)
            all_articles.extend(articles)
            print(f"üíæ Final save: {filename}")
        
        # Pause between topics
        time.sleep(5)

finally:
    scraper.cleanup()

# Save combined file
combined_file = f"scraped_news/combined_{START_DATE[0]}_{START_DATE[1]:02d}_{START_DATE[2]:02d}_to_{END_DATE[0]}_{END_DATE[1]:02d}_{END_DATE[2]:02d}.json"
with open(combined_file, 'w', encoding='utf-8') as f:
    json.dump(all_articles, f, indent=2, ensure_ascii=False)

print("\n" + "="*70)
print("‚úÖ SCRAPING COMPLETE!")
print("="*70)
print(f"\nüìä Results by Topic:")
total = 0
for topic, count in all_results.items():
    print(f"   {topic}: {count} articles")
    total += count

print(f"\nüéØ Total Articles: {total}")
print(f"üíæ Combined file: {combined_file}")
print(f"üìÇ Individual files: scraped_news/")
print("\nüí° Files saved with checkpoints - crash-proof!")
print("="*70)

## 6Ô∏è‚É£ Create ZIP for Download

In [None]:
# Create ZIP file
!zip -r scraped_news.zip scraped_news/
!zip -u scraped_news.zip checkpoint_*.json scraped_cache.json

print("\n‚úÖ ZIP file created: scraped_news.zip")
print("\nüì• DOWNLOAD STEPS:")
print("="*70)
print("1. Click the üìÅ folder icon (left sidebar)")
print("2. Find 'scraped_news.zip'")
print("3. Right-click ‚Üí Download")
print("4. On your PC, extract to: data/raw/news/")
print("\nüîÑ TO CONTINUE THIS SESSION LATER:")
print("1. Download scraped_cache.json")
print("2. Upload it before running cell 5 again")
print("3. Scraper will skip already-scraped articles!")
print("\nüéØ NEXT STEPS ON YOUR PC:")
print("   python src/scraping/2_merge_data.py  # Removes duplicates")
print("   python src/processing/parallel_summarizer.py")
print("="*70)

## 7Ô∏è‚É£ Preview Scraped Data (Optional)

In [None]:
# Show sample article
import os

json_files = [f for f in os.listdir('scraped_news') if f.endswith('.json')]

if json_files:
    with open(f'scraped_news/{json_files[0]}', 'r', encoding='utf-8') as f:
        sample = json.load(f)
    
    print("üì∞ Sample Article:")
    print("="*70)
    print(f"Title: {sample[0]['title']}")
    print(f"Publisher: {sample[0]['publisher']}")
    print(f"Word Count: {sample[0]['word_count']}")
    print(f"\nBody Preview: {sample[0]['body'][:300]}...")
    print("="*70)
else:
    print("No articles found!")

---

## üéâ Done!

**What you have:**
- ‚úÖ All scraped articles in JSON format
- ‚úÖ ZIP file ready to download
- ‚úÖ No cost to you (free Colab compute!)

**Next steps on your PC:**
1. Extract `scraped_news.zip` ‚Üí `data/raw/news/`
2. Merge: `python src/scraping/2_merge_data.py`
3. Summarize: `python src/processing/summarizer.py`
4. FinBERT sentiment analysis (Day 5)

---

**üí° Pro Tip:** Run this notebook weekly to get fresh news data!