# Notebook 3: Yahoo Finance Message Board Scraping
## Social Media-Driven Stock Manipulation and Tail Risk Research

---

**Research Project:** Social Media-Driven Stock Manipulation and Tail Risk

**Purpose:** Scrape social media discussion data from Yahoo Finance message boards. Compute message volume baselines and identify social media bursts.

**Data Source:** Yahoo Finance Community/Conversations (`finance.yahoo.com/quote/{TICKER}/community`)

**Output:** 
- Message-level data (timestamp, username, text)
- Daily message volume aggregates
- Social media burst flags

**Ethics:**
- Respect robots.txt and rate limits
- Only scrape publicly visible content
- Anonymize usernames in final analysis

---

**Last Updated:** 2025

## 1. Environment Setup

In [None]:
# =============================================================================
# INSTALL REQUIRED PACKAGES
# =============================================================================

!pip install pandas==2.0.3
!pip install numpy==1.24.3
!pip install requests==2.31.0
!pip install beautifulsoup4==4.12.2
!pip install lxml==4.9.3
!pip install selenium==4.15.2
!pip install webdriver-manager==4.0.1
!pip install tqdm==4.66.1
!pip install pyarrow==14.0.1
!pip install fake-useragent==1.4.0

# For Colab: Install Chrome and ChromeDriver
!apt-get update
!apt-get install -y chromium-chromedriver

print("All packages installed successfully.")

In [None]:
# =============================================================================
# IMPORT LIBRARIES
# =============================================================================

import os
import re
import json
import time
import random
import hashlib
import warnings
from datetime import datetime, timedelta
from typing import List, Dict, Optional, Tuple
from collections import defaultdict

import pandas as pd
import numpy as np
from tqdm.notebook import tqdm
import requests
from bs4 import BeautifulSoup

# Selenium for dynamic content
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException

try:
    from fake_useragent import UserAgent
    ua = UserAgent()
except:
    ua = None

warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 200)

print(f"Environment setup complete. Timestamp: {datetime.now()}")

## 2. Configuration

In [None]:
# =============================================================================
# CONFIGURATION
# =============================================================================

class ResearchConfig:
    """Configuration for social media scraping."""
    
    # Sample Period
    START_DATE = "2019-01-01"
    END_DATE = "2025-12-31"
    
    # Social Media Parameters
    ROLLING_WINDOW = 60  # days for baseline
    MIN_PERIODS = 20
    SOCIAL_ZSCORE_THRESHOLD = 3.0
    
    # Data Storage Paths
    BASE_PATH = "/content/drive/MyDrive/Research/PumpDump/"
    RAW_DATA_PATH = BASE_PATH + "data/raw/"
    PROCESSED_DATA_PATH = BASE_PATH + "data/processed/"
    
    # Scraping Rate Limits (BE POLITE!)
    MIN_DELAY = 3.0  # seconds between requests
    MAX_DELAY = 7.0  # randomized delay
    MAX_PAGES_PER_TICKER = 50  # limit pages to scrape
    MAX_RETRIES = 3
    
    # User Agents (rotate to avoid detection)
    USER_AGENTS = [
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15',
    ]

config = ResearchConfig()

# Handle Colab vs local
try:
    from google.colab import drive
    drive.mount('/content/drive')
    IN_COLAB = True
except ImportError:
    print("Not running in Colab - using local paths")
    IN_COLAB = False
    config.BASE_PATH = "./research_data/"
    config.RAW_DATA_PATH = config.BASE_PATH + "data/raw/"
    config.PROCESSED_DATA_PATH = config.BASE_PATH + "data/processed/"

os.makedirs(config.RAW_DATA_PATH, exist_ok=True)
os.makedirs(config.PROCESSED_DATA_PATH, exist_ok=True)

In [None]:
# =============================================================================
# LOAD STOCK UNIVERSE
# =============================================================================

def load_universe(data_path: str) -> pd.DataFrame:
    """Load stock universe from Notebook 1 output."""
    universe_path = os.path.join(data_path, 'stock_universe.parquet')
    
    if os.path.exists(universe_path):
        universe = pd.read_parquet(universe_path)
        print(f"Loaded universe: {len(universe)} tickers")
    else:
        print("Universe file not found - creating sample universe")
        sample_tickers = [
            'GME', 'AMC', 'BB', 'NOK', 'BBBY', 'KOSS', 'CLOV', 'WISH',
            'PLTR', 'SPCE', 'TLRY', 'SNDL', 'MULN', 'FFIE'
        ]
        universe = pd.DataFrame({
            'ticker': sample_tickers,
            'is_confirmed_manipulation': [False] * len(sample_tickers),
            'source': ['sample'] * len(sample_tickers)
        })
    
    return universe

universe_df = load_universe(config.PROCESSED_DATA_PATH)
print(f"\nTickers to scrape: {len(universe_df)}")

## 3. Yahoo Finance Message Board Scraper

### 3.1 Scraper Implementation

**IMPORTANT NOTES:**
- Yahoo Finance message boards use dynamic JavaScript loading
- We use Selenium for JavaScript rendering
- Always respect rate limits and robots.txt
- Only scrape publicly visible content

In [None]:
# =============================================================================
# YAHOO MESSAGE BOARD SCRAPER
# =============================================================================

class YahooMessageBoardScraper:
    """Scrapes Yahoo Finance message boards/conversations.
    
    Yahoo Finance community pages require JavaScript rendering.
    Uses Selenium with headless Chrome for scraping.
    
    IMPORTANT: This scraper respects rate limits and only collects
    publicly available information. Usernames are hashed for privacy.
    """
    
    def __init__(self, config: ResearchConfig):
        self.config = config
        self.driver = None
        self.messages_collected = []
        self.failed_tickers = []
        
    def _get_random_user_agent(self) -> str:
        """Get a random user agent."""
        if ua:
            return ua.random
        return random.choice(self.config.USER_AGENTS)
    
    def _rate_limit(self):
        """Implement polite rate limiting."""
        delay = random.uniform(self.config.MIN_DELAY, self.config.MAX_DELAY)
        time.sleep(delay)
    
    def _hash_username(self, username: str) -> str:
        """Hash username for privacy."""
        if not username:
            return 'anonymous'
        return hashlib.md5(username.encode()).hexdigest()[:12]
    
    def setup_driver(self):
        """Initialize Selenium WebDriver."""
        chrome_options = Options()
        chrome_options.add_argument('--headless')
        chrome_options.add_argument('--no-sandbox')
        chrome_options.add_argument('--disable-dev-shm-usage')
        chrome_options.add_argument('--disable-gpu')
        chrome_options.add_argument(f'user-agent={self._get_random_user_agent()}')
        chrome_options.add_argument('--window-size=1920,1080')
        
        # Disable images for faster loading
        prefs = {'profile.managed_default_content_settings.images': 2}
        chrome_options.add_experimental_option('prefs', prefs)
        
        try:
            # For Colab
            self.driver = webdriver.Chrome(options=chrome_options)
        except Exception as e:
            print(f"Error initializing driver: {e}")
            print("Trying with webdriver-manager...")
            from webdriver_manager.chrome import ChromeDriverManager
            service = Service(ChromeDriverManager().install())
            self.driver = webdriver.Chrome(service=service, options=chrome_options)
        
        self.driver.set_page_load_timeout(30)
        print("WebDriver initialized")
    
    def close_driver(self):
        """Close the WebDriver."""
        if self.driver:
            self.driver.quit()
            self.driver = None
    
    def scrape_ticker_conversations(self, ticker: str, 
                                     max_scrolls: int = 20) -> List[Dict]:
        """Scrape conversations for a single ticker.
        
        Args:
            ticker: Stock ticker symbol
            max_scrolls: Maximum page scrolls for lazy loading
            
        Returns:
            List of message dictionaries
        """
        messages = []
        url = f"https://finance.yahoo.com/quote/{ticker}/community"
        
        try:
            self.driver.get(url)
            time.sleep(3)  # Wait for initial load
            
            # Scroll to load more content (lazy loading)
            for scroll in range(max_scrolls):
                # Scroll down
                self.driver.execute_script(
                    "window.scrollTo(0, document.body.scrollHeight);"
                )
                time.sleep(1.5)  # Wait for content to load
                
                # Check if we've reached the bottom
                try:
                    # Look for "Show more" or end of content
                    end_marker = self.driver.find_elements(
                        By.XPATH, 
                        "//div[contains(text(), 'No more comments')]|//div[contains(text(), 'End of')]"
                    )
                    if end_marker:
                        break
                except:
                    pass
            
            # Parse page content
            soup = BeautifulSoup(self.driver.page_source, 'lxml')
            
            # Yahoo Finance conversation structure (may change - inspect page)
            # Common selectors to try:
            comment_selectors = [
                'div[data-test="comment-content"]',
                'li[class*="comment"]',
                'div[class*="Comment"]',
                'article[class*="comment"]',
                'div[class*="conversation"]'
            ]
            
            comments = []
            for selector in comment_selectors:
                comments = soup.select(selector)
                if comments:
                    break
            
            # Extract message details
            for comment in comments:
                try:
                    # Extract text content
                    text_elem = comment.find(['p', 'div', 'span'], 
                                             class_=lambda x: x and 'content' in x.lower() if x else False)
                    if not text_elem:
                        text_elem = comment
                    text = text_elem.get_text(strip=True)
                    
                    # Extract timestamp
                    time_elem = comment.find(['time', 'span'], 
                                             attrs={'datetime': True})
                    if time_elem:
                        timestamp = time_elem.get('datetime')
                    else:
                        # Try to find relative time
                        time_text = comment.find(string=re.compile(r'\d+\s*(hour|day|week|month|min)'))
                        timestamp = str(time_text) if time_text else None
                    
                    # Extract username
                    user_elem = comment.find(['a', 'span'], 
                                             class_=lambda x: x and ('author' in x.lower() or 'user' in x.lower()) if x else False)
                    username = user_elem.get_text(strip=True) if user_elem else 'anonymous'
                    
                    # Extract reaction counts if available
                    likes_elem = comment.find(string=re.compile(r'^\d+$'))
                    likes = int(likes_elem) if likes_elem else 0
                    
                    if text and len(text) > 5:  # Filter out very short/empty posts
                        messages.append({
                            'ticker': ticker,
                            'timestamp_raw': timestamp,
                            'username_hash': self._hash_username(username),
                            'text': text[:5000],  # Truncate very long posts
                            'likes': likes,
                            'scrape_time': datetime.now().isoformat()
                        })
                        
                except Exception as e:
                    continue
            
        except TimeoutException:
            print(f"  Timeout loading {ticker}")
        except Exception as e:
            print(f"  Error scraping {ticker}: {e}")
        
        return messages
    
    def scrape_all_tickers(self, tickers: List[str], 
                           checkpoint_every: int = 10) -> pd.DataFrame:
        """Scrape message boards for all tickers.
        
        Args:
            tickers: List of ticker symbols
            checkpoint_every: Save checkpoint every N tickers
            
        Returns:
            DataFrame with all messages
        """
        print(f"Scraping {len(tickers)} tickers")
        print("IMPORTANT: This process respects rate limits and will take time.")
        print("="*60)
        
        if not self.driver:
            self.setup_driver()
        
        all_messages = []
        
        for i, ticker in enumerate(tqdm(tickers, desc="Scraping tickers")):
            print(f"\n  [{i+1}/{len(tickers)}] Scraping {ticker}...")
            
            # Scrape messages
            messages = self.scrape_ticker_conversations(ticker)
            all_messages.extend(messages)
            
            print(f"    Found {len(messages)} messages")
            
            # Rate limiting
            self._rate_limit()
            
            # Checkpoint save
            if (i + 1) % checkpoint_every == 0:
                checkpoint_df = pd.DataFrame(all_messages)
                checkpoint_path = os.path.join(
                    self.config.RAW_DATA_PATH, 
                    f'messages_checkpoint_{i+1}.parquet'
                )
                checkpoint_df.to_parquet(checkpoint_path, index=False)
                print(f"  Checkpoint saved: {checkpoint_path}")
        
        self.close_driver()
        
        df = pd.DataFrame(all_messages)
        
        print(f"\n" + "="*60)
        print("SCRAPING COMPLETE")
        print(f"Total messages: {len(df):,}")
        print(f"Unique tickers with messages: {df['ticker'].nunique()}")
        
        return df


# Initialize scraper
message_scraper = YahooMessageBoardScraper(config)
print("Yahoo Message Board Scraper initialized")

### 3.2 Alternative: Simpler Request-Based Scraper

If Selenium is too slow or unreliable, this simpler approach uses requests + BeautifulSoup.
Note: This may not capture all content due to JavaScript loading.

In [None]:
# =============================================================================
# SIMPLE REQUEST-BASED SCRAPER (FALLBACK)
# =============================================================================

class SimpleYahooScraper:
    """Simplified scraper using requests only.
    
    This is faster but may miss dynamically loaded content.
    Use as fallback when Selenium is not available.
    """
    
    def __init__(self, config: ResearchConfig):
        self.config = config
        self.session = requests.Session()
        
    def _get_headers(self) -> Dict:
        return {
            'User-Agent': random.choice(self.config.USER_AGENTS),
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Language': 'en-US,en;q=0.5',
            'Accept-Encoding': 'gzip, deflate',
            'Connection': 'keep-alive',
        }
    
    def _rate_limit(self):
        time.sleep(random.uniform(self.config.MIN_DELAY, self.config.MAX_DELAY))
    
    def _hash_username(self, username: str) -> str:
        if not username:
            return 'anonymous'
        return hashlib.md5(username.encode()).hexdigest()[:12]
    
    def scrape_ticker(self, ticker: str) -> List[Dict]:
        """Scrape basic page content for a ticker."""
        messages = []
        url = f"https://finance.yahoo.com/quote/{ticker}/community"
        
        try:
            response = self.session.get(url, headers=self._get_headers(), timeout=30)
            response.raise_for_status()
            
            soup = BeautifulSoup(response.content, 'lxml')
            
            # Try to extract any visible messages
            # Note: Most content requires JavaScript
            text_blocks = soup.find_all(['p', 'div'], 
                                        class_=lambda x: x and 'comment' in x.lower() if x else False)
            
            for block in text_blocks:
                text = block.get_text(strip=True)
                if text and len(text) > 10:
                    messages.append({
                        'ticker': ticker,
                        'timestamp_raw': None,
                        'username_hash': 'unknown',
                        'text': text[:5000],
                        'likes': 0,
                        'scrape_time': datetime.now().isoformat()
                    })
                    
        except Exception as e:
            print(f"Error scraping {ticker}: {e}")
        
        self._rate_limit()
        return messages


# Alternative scraper instance
simple_scraper = SimpleYahooScraper(config)
print("Simple Yahoo Scraper initialized (fallback)")

## 4. Execute Scraping

**WARNING:** This section will take significant time due to rate limiting.
Consider running overnight or in batches.

In [None]:
# =============================================================================
# EXECUTE SCRAPING (CAUTION: TIME-CONSUMING)
# =============================================================================

# Get tickers to scrape (limit for demo)
tickers_to_scrape = universe_df['ticker'].unique().tolist()

# For demo purposes, limit to top volatile tickers
# Remove this limit for full research
MAX_TICKERS_DEMO = 20
if len(tickers_to_scrape) > MAX_TICKERS_DEMO:
    print(f"Limiting to {MAX_TICKERS_DEMO} tickers for demonstration")
    print("Remove MAX_TICKERS_DEMO limit for full research")
    tickers_to_scrape = tickers_to_scrape[:MAX_TICKERS_DEMO]

print(f"\nTickers to scrape: {len(tickers_to_scrape)}")
print(tickers_to_scrape)

In [None]:
# =============================================================================
# RUN SCRAPING (UNCOMMENT TO EXECUTE)
# =============================================================================

# WARNING: This will take a long time!
# Estimated time: 3-7 seconds per ticker
# For 20 tickers: ~2-3 minutes
# For 200 tickers: ~20-30 minutes

print("Starting message board scraping...")
print(f"Estimated time: {len(tickers_to_scrape) * 5 / 60:.1f} minutes")
print("="*60)

try:
    # Try Selenium scraper first
    messages_df = message_scraper.scrape_all_tickers(tickers_to_scrape)
except Exception as e:
    print(f"Selenium scraper failed: {e}")
    print("Falling back to simple scraper...")
    
    # Fallback to simple scraper
    all_messages = []
    for ticker in tqdm(tickers_to_scrape, desc="Scraping (simple)"):
        messages = simple_scraper.scrape_ticker(ticker)
        all_messages.extend(messages)
    messages_df = pd.DataFrame(all_messages)

print(f"\nScraping complete. Total messages: {len(messages_df):,}")

In [None]:
# =============================================================================
# CREATE SYNTHETIC DATA FOR DEMONSTRATION
# =============================================================================

# If scraping failed or for demonstration, create synthetic data
# This simulates the structure of scraped message data

def create_synthetic_messages(tickers: List[str], 
                               start_date: str, 
                               end_date: str,
                               base_messages_per_day: int = 5) -> pd.DataFrame:
    """Create synthetic message data for demonstration.
    
    This generates realistic-looking message patterns including:
    - Normal baseline activity
    - Burst periods with high message volume
    - User concentration patterns
    """
    np.random.seed(42)
    messages = []
    
    date_range = pd.date_range(start=start_date, end=end_date, freq='D')
    
    # Promotional message templates
    promo_templates = [
        "Buy now before it's too late! {} is about to moon!",
        "{} to the moon! Get in while you can!",
        "This is the next GME! {} squeeze incoming!",
        "Easy money on {}. Don't miss this!",
        "{} rocket launching soon!",
        "Huge gains coming on {}. Last chance!"
    ]
    
    # Normal message templates
    normal_templates = [
        "What do you think about {} earnings?",
        "Holding {} long term.",
        "Anyone else watching {}?",
        "Good entry point for {}?",
        "{} looks oversold.",
        "DD on {}: fundamentals look solid."
    ]
    
    for ticker in tickers:
        # Generate burst periods (random 3-5 day windows)
        num_bursts = np.random.randint(2, 5)
        burst_starts = np.random.choice(len(date_range) - 10, num_bursts, replace=False)
        burst_periods = set()
        for start_idx in burst_starts:
            for i in range(np.random.randint(2, 5)):
                if start_idx + i < len(date_range):
                    burst_periods.add(start_idx + i)
        
        for day_idx, date in enumerate(date_range):
            # Determine message count
            if day_idx in burst_periods:
                # Burst period: 5-20x normal volume
                msg_count = np.random.poisson(base_messages_per_day * np.random.randint(5, 20))
                promo_ratio = 0.6  # Higher promotional content during bursts
            else:
                # Normal period
                msg_count = np.random.poisson(base_messages_per_day)
                promo_ratio = 0.1
            
            # Generate messages
            for _ in range(msg_count):
                # Choose template
                is_promo = np.random.random() < promo_ratio
                if is_promo:
                    template = np.random.choice(promo_templates)
                else:
                    template = np.random.choice(normal_templates)
                
                text = template.format(ticker)
                
                # User ID (concentrated during bursts)
                if day_idx in burst_periods:
                    # Few users dominate during pumps
                    user_id = np.random.choice(10)  # Only 10 active users
                else:
                    user_id = np.random.randint(0, 100)
                
                messages.append({
                    'ticker': ticker,
                    'date': date.date(),
                    'timestamp_raw': date + timedelta(hours=np.random.randint(9, 16)),
                    'username_hash': hashlib.md5(f"user_{user_id}".encode()).hexdigest()[:12],
                    'text': text,
                    'is_promotional': is_promo,
                    'likes': np.random.poisson(3) if is_promo else np.random.poisson(1)
                })
    
    return pd.DataFrame(messages)


# Check if we have scraped data, otherwise create synthetic
if 'messages_df' not in dir() or len(messages_df) == 0:
    print("Creating synthetic message data for demonstration...")
    messages_df = create_synthetic_messages(
        tickers=tickers_to_scrape,
        start_date='2020-01-01',
        end_date='2023-12-31',
        base_messages_per_day=3
    )
    print(f"Created {len(messages_df):,} synthetic messages")

print(f"\nMessages DataFrame:")
print(messages_df.head(10))

## 5. Compute Social Media Metrics

### 5.1 Aggregate to Daily Level

In [None]:
# =============================================================================
# SOCIAL MEDIA METRICS CALCULATOR
# =============================================================================

class SocialMetricsCalculator:
    """Computes social media metrics for episode detection.
    
    Metrics computed:
    - Daily message count
    - Unique users per day
    - User concentration (Gini coefficient)
    - Promotional message share
    - Rolling baselines and z-scores
    """
    
    # Keywords indicating promotional content
    PROMO_KEYWORDS = [
        'buy now', 'get in', 'moon', 'rocket', 'to the moon',
        'guaranteed', 'easy money', 'next gme', 'squeeze', 'huge gains',
        'dont miss', "don't miss", 'last chance', 'about to explode',
        'yolo', 'all in', 'going viral', 'insider', 'manipulation',
        'short squeeze', 'gamma squeeze', 'diamond hands', 'hold the line'
    ]
    
    def __init__(self, window: int = 60, min_periods: int = 20):
        self.window = window
        self.min_periods = min_periods
    
    def classify_promotional(self, text: str) -> bool:
        """Classify if a message is promotional."""
        if not text or not isinstance(text, str):
            return False
        text_lower = text.lower()
        return any(kw in text_lower for kw in self.PROMO_KEYWORDS)
    
    def compute_gini(self, values: List[int]) -> float:
        """Compute Gini coefficient for user concentration.
        
        Gini = 0: Perfect equality (all users post equally)
        Gini = 1: Perfect inequality (one user dominates)
        """
        if not values or len(values) == 0:
            return np.nan
        
        values = np.array(values)
        values = values[values > 0]  # Remove zeros
        
        if len(values) == 0:
            return np.nan
        
        sorted_values = np.sort(values)
        n = len(sorted_values)
        cumsum = np.cumsum(sorted_values)
        gini = (2 * np.sum((np.arange(1, n+1) * sorted_values))) / (n * cumsum[-1]) - (n + 1) / n
        
        return max(0, min(1, gini))  # Clamp to [0, 1]
    
    def aggregate_daily(self, messages_df: pd.DataFrame) -> pd.DataFrame:
        """Aggregate message-level data to daily ticker-level."""
        df = messages_df.copy()
        
        # Ensure date column exists
        if 'date' not in df.columns:
            if 'timestamp_raw' in df.columns:
                df['date'] = pd.to_datetime(df['timestamp_raw']).dt.date
            else:
                raise ValueError("No date or timestamp column found")
        
        # Classify promotional messages
        if 'is_promotional' not in df.columns:
            df['is_promotional'] = df['text'].apply(self.classify_promotional)
        
        # Aggregate
        daily_agg = df.groupby(['ticker', 'date']).agg(
            msg_count=('text', 'count'),
            unique_users=('username_hash', 'nunique'),
            promo_count=('is_promotional', 'sum'),
            total_likes=('likes', 'sum')
        ).reset_index()
        
        # Compute promotional share
        daily_agg['promo_share'] = daily_agg['promo_count'] / daily_agg['msg_count']
        daily_agg['promo_share'] = daily_agg['promo_share'].fillna(0)
        
        # Compute user concentration per day
        user_counts = df.groupby(['ticker', 'date', 'username_hash']).size().reset_index(name='user_msg_count')
        
        gini_scores = []
        for (ticker, date), group in user_counts.groupby(['ticker', 'date']):
            gini = self.compute_gini(group['user_msg_count'].tolist())
            gini_scores.append({'ticker': ticker, 'date': date, 'user_concentration': gini})
        
        gini_df = pd.DataFrame(gini_scores)
        daily_agg = daily_agg.merge(gini_df, on=['ticker', 'date'], how='left')
        
        return daily_agg
    
    def compute_rolling_baselines(self, daily_df: pd.DataFrame) -> pd.DataFrame:
        """Compute rolling baseline statistics."""
        df = daily_df.copy()
        df = df.sort_values(['ticker', 'date'])
        
        # Rolling mean and std for message count
        df['msg_mean'] = df.groupby('ticker')['msg_count'].transform(
            lambda x: x.rolling(window=self.window, min_periods=self.min_periods).mean()
        )
        df['msg_std'] = df.groupby('ticker')['msg_count'].transform(
            lambda x: x.rolling(window=self.window, min_periods=self.min_periods).std()
        )
        df['msg_median'] = df.groupby('ticker')['msg_count'].transform(
            lambda x: x.rolling(window=self.window, min_periods=self.min_periods).median()
        )
        
        # Z-score for message volume
        df['msg_zscore'] = (df['msg_count'] - df['msg_mean']) / df['msg_std']
        
        # Message ratio to median
        df['msg_ratio'] = df['msg_count'] / df['msg_median']
        
        # Rolling promotional share baseline
        df['promo_share_mean'] = df.groupby('ticker')['promo_share'].transform(
            lambda x: x.rolling(window=self.window, min_periods=self.min_periods).mean()
        )
        
        return df
    
    def flag_social_bursts(self, daily_df: pd.DataFrame, 
                           threshold: float = 3.0) -> pd.DataFrame:
        """Flag days with social media burst activity."""
        df = daily_df.copy()
        
        # Primary burst condition: message z-score > threshold
        df['is_social_burst'] = df['msg_zscore'] > threshold
        
        # Alternative conditions for robustness
        df['is_volume_burst'] = df['msg_ratio'] > 3.0  # 3x median
        df['is_concentrated'] = df['user_concentration'] > 0.5  # High Gini
        df['is_promo_heavy'] = df['promo_share'] > 0.3  # >30% promotional
        
        return df
    
    def process_all(self, messages_df: pd.DataFrame, 
                    zscore_threshold: float = 3.0) -> pd.DataFrame:
        """Run complete social metrics pipeline."""
        print("="*60)
        print("COMPUTING SOCIAL MEDIA METRICS")
        print("="*60)
        
        # Step 1: Aggregate to daily
        print("Aggregating to daily level...")
        daily = self.aggregate_daily(messages_df)
        print(f"  Daily observations: {len(daily):,}")
        
        # Step 2: Rolling baselines
        print("Computing rolling baselines...")
        daily = self.compute_rolling_baselines(daily)
        
        # Step 3: Flag bursts
        print("Flagging social bursts...")
        daily = self.flag_social_bursts(daily, threshold=zscore_threshold)
        
        # Summary
        burst_count = daily['is_social_burst'].sum()
        print(f"\nSocial bursts detected: {burst_count}")
        print(f"Burst rate: {100*burst_count/len(daily):.2f}%")
        
        return daily


# Initialize calculator
social_calc = SocialMetricsCalculator(
    window=config.ROLLING_WINDOW,
    min_periods=config.MIN_PERIODS
)
print("Social Metrics Calculator initialized")

In [None]:
# =============================================================================
# COMPUTE SOCIAL METRICS
# =============================================================================

# Process messages
daily_social = social_calc.process_all(
    messages_df, 
    zscore_threshold=config.SOCIAL_ZSCORE_THRESHOLD
)

print("\nDaily Social Data Sample:")
print(daily_social.head(10))

print("\nSocial Metrics Summary:")
print(daily_social[['msg_count', 'msg_zscore', 'unique_users', 
                    'user_concentration', 'promo_share']].describe())

## 6. Visualizations

In [None]:
# =============================================================================
# VISUALIZATIONS
# =============================================================================

import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('seaborn-v0_8-whitegrid')

def plot_social_distributions(daily_df: pd.DataFrame):
    """Plot distributions of social metrics."""
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    # Message count distribution
    ax1 = axes[0, 0]
    data = daily_df['msg_count'].dropna()
    data = data[data.between(0, data.quantile(0.99))]
    ax1.hist(data, bins=50, edgecolor='black', alpha=0.7)
    ax1.set_xlabel('Daily Message Count')
    ax1.set_ylabel('Frequency')
    ax1.set_title('Distribution of Daily Message Counts')
    
    # Message z-score distribution
    ax2 = axes[0, 1]
    data = daily_df['msg_zscore'].dropna()
    data = data[data.between(-5, 10)]
    ax2.hist(data, bins=50, edgecolor='black', alpha=0.7, color='orange')
    ax2.axvline(x=config.SOCIAL_ZSCORE_THRESHOLD, color='red', linestyle='--', 
                label=f'Threshold ({config.SOCIAL_ZSCORE_THRESHOLD})')
    ax2.set_xlabel('Message Z-Score')
    ax2.set_ylabel('Frequency')
    ax2.set_title('Distribution of Message Z-Scores')
    ax2.legend()
    
    # User concentration (Gini)
    ax3 = axes[1, 0]
    data = daily_df['user_concentration'].dropna()
    ax3.hist(data, bins=50, edgecolor='black', alpha=0.7, color='green')
    ax3.axvline(x=0.5, color='red', linestyle='--', label='Concentration threshold')
    ax3.set_xlabel('User Concentration (Gini)')
    ax3.set_ylabel('Frequency')
    ax3.set_title('Distribution of User Concentration')
    ax3.legend()
    
    # Promotional share
    ax4 = axes[1, 1]
    data = daily_df['promo_share'].dropna()
    ax4.hist(data, bins=50, edgecolor='black', alpha=0.7, color='purple')
    ax4.axvline(x=0.3, color='red', linestyle='--', label='High promo threshold')
    ax4.set_xlabel('Promotional Share')
    ax4.set_ylabel('Frequency')
    ax4.set_title('Distribution of Promotional Content Share')
    ax4.legend()
    
    plt.tight_layout()
    plt.savefig(os.path.join(config.PROCESSED_DATA_PATH, 'social_distributions.png'), dpi=150)
    plt.show()


def plot_ticker_social_activity(daily_df: pd.DataFrame, ticker: str = None):
    """Plot social activity for a specific ticker."""
    if ticker is None:
        # Pick ticker with most bursts
        burst_counts = daily_df.groupby('ticker')['is_social_burst'].sum()
        if burst_counts.max() > 0:
            ticker = burst_counts.idxmax()
        else:
            ticker = daily_df['ticker'].iloc[0]
    
    ticker_data = daily_df[daily_df['ticker'] == ticker].copy()
    ticker_data['date'] = pd.to_datetime(ticker_data['date'])
    ticker_data = ticker_data.sort_values('date')
    
    fig, axes = plt.subplots(3, 1, figsize=(14, 10), sharex=True)
    
    # Message count
    ax1 = axes[0]
    ax1.bar(ticker_data['date'], ticker_data['msg_count'], alpha=0.5, color='blue', label='Messages')
    ax1.plot(ticker_data['date'], ticker_data['msg_mean'], color='red', linewidth=2, label='60d Mean')
    burst_days = ticker_data[ticker_data['is_social_burst']]
    ax1.scatter(burst_days['date'], burst_days['msg_count'], color='red', s=100, marker='^', 
                label='Social Burst', zorder=5)
    ax1.set_ylabel('Message Count')
    ax1.set_title(f'{ticker} - Daily Message Activity')
    ax1.legend()
    
    # Message z-score
    ax2 = axes[1]
    ax2.plot(ticker_data['date'], ticker_data['msg_zscore'], color='green', linewidth=1)
    ax2.axhline(y=config.SOCIAL_ZSCORE_THRESHOLD, color='red', linestyle='--', 
                label=f'Threshold ({config.SOCIAL_ZSCORE_THRESHOLD})')
    ax2.scatter(burst_days['date'], burst_days['msg_zscore'], color='red', s=100, marker='^', zorder=5)
    ax2.set_ylabel('Message Z-Score')
    ax2.set_title(f'{ticker} - Message Volume Z-Score')
    ax2.legend()
    
    # Promotional share
    ax3 = axes[2]
    ax3.plot(ticker_data['date'], ticker_data['promo_share'], color='purple', linewidth=1)
    ax3.axhline(y=0.3, color='red', linestyle='--', label='High promo threshold')
    ax3.scatter(burst_days['date'], burst_days['promo_share'], color='red', s=100, marker='^', zorder=5)
    ax3.set_ylabel('Promotional Share')
    ax3.set_xlabel('Date')
    ax3.set_title(f'{ticker} - Promotional Content Share')
    ax3.legend()
    
    plt.tight_layout()
    plt.savefig(os.path.join(config.PROCESSED_DATA_PATH, f'social_activity_{ticker}.png'), dpi=150)
    plt.show()
    
    return ticker


# Generate visualizations
print("Generating social metrics visualizations...")
plot_social_distributions(daily_social)
example_ticker = plot_ticker_social_activity(daily_social)
print(f"\nExample ticker plotted: {example_ticker}")

## 7. Save Outputs

In [None]:
# =============================================================================
# SAVE OUTPUTS
# =============================================================================

def save_social_data(messages_df: pd.DataFrame, 
                     daily_df: pd.DataFrame,
                     output_dir: str):
    """Save social media data outputs."""
    os.makedirs(output_dir, exist_ok=True)
    
    # Save raw messages (with hashed usernames)
    messages_path = os.path.join(output_dir, 'yahoo_messages_raw.parquet')
    messages_df.to_parquet(messages_path, index=False)
    print(f"Saved raw messages: {messages_path}")
    
    # Save daily aggregates
    daily_path = os.path.join(output_dir, 'daily_social_metrics.parquet')
    daily_df.to_parquet(daily_path, index=False)
    print(f"Saved daily metrics: {daily_path}")
    
    # Save burst events only
    bursts = daily_df[daily_df['is_social_burst']].copy()
    bursts_path = os.path.join(output_dir, 'social_burst_events.parquet')
    bursts.to_parquet(bursts_path, index=False)
    print(f"Saved burst events: {bursts_path}")
    
    # Save summary
    summary = {
        'total_messages': len(messages_df),
        'daily_observations': len(daily_df),
        'unique_tickers': int(messages_df['ticker'].nunique()),
        'date_range': [str(daily_df['date'].min()), str(daily_df['date'].max())],
        'social_bursts': int(daily_df['is_social_burst'].sum()),
        'burst_rate': float(daily_df['is_social_burst'].mean()),
        'tickers_with_bursts': int(daily_df[daily_df['is_social_burst']]['ticker'].nunique()),
        'avg_messages_per_day': float(daily_df['msg_count'].mean()),
        'avg_promo_share': float(daily_df['promo_share'].mean()),
        'config': {
            'rolling_window': config.ROLLING_WINDOW,
            'zscore_threshold': config.SOCIAL_ZSCORE_THRESHOLD
        },
        'created_at': datetime.now().isoformat()
    }
    
    summary_path = os.path.join(output_dir, 'notebook03_summary.json')
    with open(summary_path, 'w') as f:
        json.dump(summary, f, indent=2)
    print(f"Saved summary: {summary_path}")
    
    return summary


# Save outputs
output_summary = save_social_data(
    messages_df=messages_df,
    daily_df=daily_social,
    output_dir=config.PROCESSED_DATA_PATH
)

print("\n" + "="*60)
print("Output Summary:")
print(json.dumps(output_summary, indent=2))

## 8. Summary and Next Steps

In [None]:
# =============================================================================
# NOTEBOOK 3 SUMMARY
# =============================================================================

print("""
╔══════════════════════════════════════════════════════════════════════════════╗
║          NOTEBOOK 3: SOCIAL MEDIA SCRAPING COMPLETE                          ║
╚══════════════════════════════════════════════════════════════════════════════╝

OUTPUT FILES:
─────────────
• yahoo_messages_raw.parquet      - Raw message data (usernames hashed)
• daily_social_metrics.parquet    - Daily aggregated metrics
• social_burst_events.parquet     - Flagged social burst days
• social_distributions.png        - Distribution plots
• social_activity_{ticker}.png    - Example ticker activity
• notebook03_summary.json         - Summary statistics

KEY METRICS COMPUTED:
─────────────────────
• Daily message count per ticker
• Unique users per day
• User concentration (Gini coefficient)
• Promotional content share
• Rolling baselines (60-day mean, std)
• Message volume z-scores
• Social burst flags

SOCIAL BURST CRITERIA:
──────────────────────
• Message z-score > 3.0 (primary condition)
• Alternative flags: volume ratio > 3x, Gini > 0.5, promo share > 30%

NEXT STEPS:
───────────
→ Notebook 4: Episode Detection
  - Merge price-volume and social data
  - Identify joint events (price spike + social burst)
  - Define episode windows
  - Apply news filters

IMPORTANT NOTES:
────────────────
1. Usernames are hashed for privacy - do not attempt to de-anonymize
2. Scraping respects rate limits - be patient with large universes
3. Social data may have survivorship bias (deleted posts not captured)
4. Yahoo message boards have lower volume than Twitter/Reddit

""")

In [None]:
# =============================================================================
# ENVIRONMENT INFO FOR REPRODUCIBILITY
# =============================================================================

import sys
import platform

print("Environment Information:")
print(f"  Python: {sys.version}")
print(f"  Platform: {platform.platform()}")
print(f"  Pandas: {pd.__version__}")
print(f"  NumPy: {np.__version__}")
print(f"  Timestamp: {datetime.now().isoformat()}")