<a href="https://colab.research.google.com/github/Dumi-coder/CeylonPulse/blob/main/CeylonPulse_DataCollection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CeylonPulse: Data Collection & Signal Detection

**Real-Time Situational Awareness System for Sri Lanka**

This notebook implements **Step 2** of the workflow:
- Data Collection from multiple sources
- Signal Detection using 40 PESTLE-based signals
- Integration with TensorFlow models (for future steps)

## Three Data Collection Methods:
1. **Scraping** - RSS feeds, web scraping
2. **API Responses** - Twitter, Google Trends
3. **LLM Extraction** - Structure data + generate signals


## Setup & Installation


In [None]:
# Install required packages
!pip install -q requests beautifulsoup4 feedparser lxml
!pip install -q pytrends python-dateutil
!pip install -q pandas numpy

# For TensorFlow (for future ML models)
!pip install -q tensorflow

# For Mistral 7B (optional - for local model, API doesn't need this)
# !pip install -q transformers torch accelerate

print("‚úÖ All packages installed successfully!")


  Preparing metadata (setup.py) ... [?25l[?25hdone
[?25l   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/81.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m81.5/81.5 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for sgmllib3k (setup.py) ... [?25l[?25hdone
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m388.2/388.2 kB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[?25h‚úÖ All packages installed successfully!


In [2]:
# Mount Google Drive (optional - to save data)
from google.colab import drive
drive.mount('/content/drive')

# Set working directory
import os
os.chdir('/content')

print("‚úÖ Setup complete!")


Mounted at /content/drive
‚úÖ Setup complete!


## Import Libraries & Load 40 Signals


In [3]:
import sys
import os
import json
import re
from datetime import datetime
from typing import List, Dict
from collections import Counter
import pandas as pd
import numpy as np

# Load 40 PESTLE signals from SSD
SIGNALS = [
    "Government Policy Announcements", "Cabinet/Parliament Decisions",
    "Government Sector Strike Warnings", "Police/Security Alerts",
    "Election-related Discussions", "Foreign Policy / International Agreements",
    "Tax Revision Rumors", "Public Protests & Demonstrations",
    "Inflation Mentions", "Fuel Shortage Mentions", "Dollar Rate Discussions",
    "Tourism Search Trend (Google Trends)", "Food Price Spikes",
    "Stock Market Volatility", "Foreign Investment News",
    "Currency Black Market Mentions", "Crime & Safety Alerts",
    "Public Sentiment (Social Media)", "Migration / Visa Interest",
    "Public Health Discussions", "Viral Social Trends",
    "Cultural Event Mentions", "Power Outages (CEB)",
    "Telecom Outages", "Cyberattack Mentions",
    "E-commerce Growth Indicators", "Digital Payments Failure Reports",
    "New Regulations Affecting Businesses", "Court Rulings Impacting Industries",
    "Import/Export Restriction Changes", "Customs/Port Delays",
    "Rainfall Alerts", "Flood Warnings", "Heat Wave Alerts",
    "Landslide Warnings", "Cyclone Updates", "Air Quality Index Changes",
    "Drought Warnings", "Water Supply Cuts (NWSDB)",
    "Coastal Erosion / Tsunami Alerts"
]

print(f"‚úÖ Loaded {len(SIGNALS)} PESTLE signals")


‚úÖ Loaded 40 PESTLE signals


In [4]:
# RSS Feed Scraping
import feedparser
import requests

def scrape_rss_feed(url):
    """Scrape RSS feed and return articles"""
    try:
        feed = feedparser.parse(url)
        articles = []

        for entry in feed.entries:
            article = {
                'title': entry.get('title', ''),
                'link': entry.get('link', ''),
                'description': entry.get('description', ''),
                'published': entry.get('published', ''),
                'source': feed.feed.get('title', 'Unknown'),
                'scraped_at': datetime.utcnow().isoformat()
            }
            articles.append(article)

        return articles
    except Exception as e:
        print(f"Error scraping RSS feed {url}: {str(e)}")
        return []

# Data source URLs
ADA_DERANA_RSS = 'https://www.adaderana.lk/rss.php'
ECONOMYNEXT_RSS = 'https://economynext.com/rss'

# Scrape RSS feeds
ada_articles = scrape_rss_feed(ADA_DERANA_RSS)
econ_articles = scrape_rss_feed(ECONOMYNEXT_RSS)

all_scraped_articles = ada_articles + econ_articles
print(f"‚úÖ Scraped {len(ada_articles)} from Ada Derana, {len(econ_articles)} from EconomyNext")
print(f"üìä Total articles: {len(all_scraped_articles)}")


‚úÖ Scraped 0 from Ada Derana, 20 from EconomyNext
üìä Total articles: 20


  'scraped_at': datetime.utcnow().isoformat()


In [5]:
# Google Trends API
from pytrends.request import TrendReq

def get_google_trends(geo='LK'):
    """Get Google Trends data for Sri Lanka"""
    try:
        pytrends = TrendReq(hl='en-US', tz=360)
        trending = pytrends.trending_searches(pn=geo.lower())

        trends = []
        for idx, trend in enumerate(trending[0].head(20).values):
            trend_data = {
                'rank': idx + 1,
                'keyword': trend[0] if isinstance(trend, list) else str(trend),
                'geo': geo,
                'source': 'Google Trends',
                'scraped_at': datetime.utcnow().isoformat()
            }
            trends.append(trend_data)

        return trends
    except Exception as e:
        print(f"Error getting Google Trends: {str(e)}")
        return []

# Get trending searches
trends = get_google_trends('LK')
print(f"‚úÖ Retrieved {len(trends)} trending searches")

# Display top trends
if trends:
    df_trends = pd.DataFrame(trends)
    print("\nüìà Top 10 Trending Searches in Sri Lanka:")
    print(df_trends[['rank', 'keyword']].head(10).to_string(index=False))


Error getting Google Trends: The request failed: Google returned a response with code 404
‚úÖ Retrieved 0 trending searches


In [6]:
# Signal keywords mapping (from SSD - Signal Specification Document)
SIGNAL_KEYWORDS = {
    "Government Policy Announcements": ["policy", "tax", "cabinet approves", "budget", "government policy"],
    "Fuel Shortage Mentions": ["fuel shortage", "petrol shortage", "diesel shortage", "fuel crisis", "fuel queues"],
    "Inflation Mentions": ["inflation", "price increase", "cost of living", "inflation rate", "cpi"],
    "Dollar Rate Discussions": ["dollar rate", "usd rate", "exchange rate", "rupee dollar", "currency rate"],
    "Power Outages (CEB)": ["power outage", "power cut", "load shedding", "ceb", "electricity cut"],
    "Flood Warnings": ["flood", "flooding", "flood warning", "flood alert", "flash flood"],
    "Public Protests & Demonstrations": ["protest", "demonstration", "rally", "march", "protesters"],
    "Rainfall Alerts": ["rainfall", "heavy rain", "rain alert", "rainfall warning", "monsoon"],
    "Crime & Safety Alerts": ["crime", "robbery", "theft", "murder", "safety alert"],
    "Tourism Search Trend (Google Trends)": ["tourism", "tourist", "visitor", "travel sri lanka", "hotel booking"],
    # Add more as needed
}

def detect_signals(text, title=""):
    """Detect signals from text using keyword matching (SSD-based)"""
    full_text = f"{title} {text}".lower()
    detected = []

    for signal_name, keywords in SIGNAL_KEYWORDS.items():
        matches = []
        for keyword in keywords:
            pattern = r'\b' + re.escape(keyword.lower()) + r'\b'
            if re.search(pattern, full_text):
                matches.append(keyword)

        if matches:
            confidence = min(0.5 + (len(matches) * 0.15), 1.0)
            detected.append({
                'signal_name': signal_name,
                'confidence': round(confidence, 2),
                'matched_keywords': matches[:5]
            })

    return detected

# Detect signals in articles
for article in all_scraped_articles[:10]:  # Test on first 10
    signals = detect_signals(article.get('description', ''), article.get('title', ''))
    article['detected_signals'] = signals

print("‚úÖ Signal detection completed!")
print(f"üìä Articles with signals: {sum(1 for a in all_scraped_articles[:10] if a.get('detected_signals'))}")


‚úÖ Signal detection completed!
üìä Articles with signals: 1


## Optional: LLM Extraction (if API key available)


In [7]:
# Mistral 7B Instruct LLM (FREE - via Hugging Face)
USE_LLM = True  # Set to True to use Mistral 7B (free!)
HUGGINGFACE_API_TOKEN = os.getenv('HUGGINGFACE_API_TOKEN', '')  # Optional but recommended

if USE_LLM:
    try:
        import requests

        MISTRAL_MODEL = "mistralai/Mistral-7B-Instruct-v0.2"
        API_URL = f"https://api-inference.huggingface.co/models/{MISTRAL_MODEL}"

        def extract_signals_mistral(text, title=""):
            """Extract signals using Mistral 7B Instruct (FREE!)"""
            prompt = f"""Analyze this news article and extract relevant signals from the 40 PESTLE signals.

Title: {title}
Content: {text[:1000]}

Available signals: {', '.join(SIGNALS[:15])}...

Return a JSON object with a "signals" array. Each signal should have:
- signal_name (must match one from the list)
- confidence (0-1)
- pestle_category
- swot_category
- severity_estimate (0-1)

Format: {{"signals": [{{"signal_name": "...", "confidence": 0.8, ...}}]}}"""

            # Format for Mistral Instruct
            formatted_prompt = f"<s>[INST] {prompt} [/INST]"

            headers = {}
            if HUGGINGFACE_API_TOKEN:
                headers["Authorization"] = f"Bearer {HUGGINGFACE_API_TOKEN}"

            payload = {
                "inputs": formatted_prompt,
                "parameters": {
                    "max_new_tokens": 1000,
                    "temperature": 0.3,
                    "return_full_text": False
                }
            }

            try:
                response = requests.post(API_URL, headers=headers, json=payload, timeout=60)

                if response.status_code == 200:
                    result = response.json()
                    if isinstance(result, list) and len(result) > 0:
                        content = result[0].get('generated_text', '')
                    else:
                        content = str(result)

                    # Extract JSON from response
                    import re
                    json_match = re.search(r'\{.*\}', content, re.DOTALL)
                    if json_match:
                        parsed = json.loads(json_match.group())
                        return parsed.get('signals', [])
                    return []
                elif response.status_code == 503:
                    print("‚ö†Ô∏è Model is loading, please wait a moment and try again")
                    return []
                else:
                    print(f"‚ö†Ô∏è API error: {response.status_code}")
                    return []
            except Exception as e:
                print(f"‚ö†Ô∏è LLM extraction error: {str(e)}")
                return []

        # Test on one article
        if all_scraped_articles:
            test_article = all_scraped_articles[0]
            print(f"Testing Mistral 7B on: {test_article.get('title', '')[:50]}...")
            llm_signals = extract_signals_mistral(
                test_article.get('description', ''),
                test_article.get('title', '')
            )
            print(f"‚úÖ Mistral 7B extracted {len(llm_signals)} signals")
            if llm_signals:
                print(f"   Example: {llm_signals[0].get('signal_name', 'N/A')}")
    except ImportError:
        print("‚ö†Ô∏è Requests library not available")
else:
    print("‚ö†Ô∏è LLM extraction disabled")


Testing Mistral 7B on: Access to Kandy, Gampola towns in Sri Lanka restor...
‚ö†Ô∏è API error: 410
‚úÖ Mistral 7B extracted 0 signals


## Save Data & Prepare for TensorFlow


In [8]:
# Combine all data
all_data = all_scraped_articles + trends

# Save to JSON
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
output_file = f'/content/collected_data_{timestamp}.json'

with open(output_file, 'w', encoding='utf-8') as f:
    json.dump(all_data, f, indent=2, ensure_ascii=False)

print(f"‚úÖ Saved {len(all_data)} items to {output_file}")

# Create DataFrame
df = pd.DataFrame(all_data)
print(f"\nüìä Data Summary:")
print(f"Total items: {len(df)}")
if 'source' in df.columns:
    print(f"\nSources:\n{df['source'].value_counts()}")


‚úÖ Saved 20 items to /content/collected_data_20251129_154143.json

üìä Data Summary:
Total items: 20

Sources:
source
EconomyNext    20
Name: count, dtype: int64


In [9]:
# Import TensorFlow for future ML models
import tensorflow as tf
from tensorflow import keras

print(f"‚úÖ TensorFlow {tf.__version__} imported")
print(f"GPU Available: {len(tf.config.list_physical_devices('GPU')) > 0}")

# Text preprocessing for TensorFlow
def preprocess_text(text):
    """Basic text preprocessing"""
    if not text:
        return ""
    text = re.sub(r'http\S+', '', text)
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    return text.lower().strip()

# Preprocess text
if 'description' in df.columns:
    df['processed_text'] = df['description'].apply(preprocess_text)
elif 'text' in df.columns:
    df['processed_text'] = df['text'].apply(preprocess_text)

print("‚úÖ Text preprocessing completed - ready for TensorFlow models!")


‚úÖ TensorFlow 2.19.0 imported
GPU Available: True
‚úÖ Text preprocessing completed - ready for TensorFlow models!


## Summary

‚úÖ **Data Collection Complete!**

- **Method 1 (Scraping)**: RSS feeds from Ada Derana & EconomyNext
- **Method 2 (API)**: Google Trends for Sri Lanka
- **Method 3 (Signal Detection)**: Keyword-based detection from SSD

**Next Steps** (from Workflow.md):
- Step 3: NLP Preprocessing (SBERT embeddings, clustering)
- Step 4: Deep Learning Models (BERT, LSTM)
- Step 5: Model Training
- Step 6: Model Evaluation
