# Notebook 2: Yahoo Finance Market Data Collection
## Social Media-Driven Stock Manipulation and Tail Risk Research

---

**Research Project:** Social Media-Driven Stock Manipulation and Tail Risk

**Purpose:** Collect and process daily OHLCV (Open, High, Low, Close, Volume) data from Yahoo Finance for the stock universe. Compute baseline statistics for episode detection.

**Data Source:** Yahoo Finance via yfinance library (no API key required)

**Output:** 
- Daily price-volume data for all universe tickers
- Rolling baseline statistics (mean, std, percentiles)
- Candidate price-volume spike events

---

**Last Updated:** 2025

## 1. Environment Setup

In [None]:
# =============================================================================
# INSTALL REQUIRED PACKAGES
# =============================================================================

!pip install yfinance==0.2.33
!pip install pandas==2.0.3
!pip install numpy==1.24.3
!pip install scipy==1.11.4
!pip install tqdm==4.66.1
!pip install pyarrow==14.0.1
!pip install matplotlib==3.8.2
!pip install seaborn==0.13.0

print("All packages installed successfully.")

In [None]:
# =============================================================================
# IMPORT LIBRARIES
# =============================================================================

import os
import json
import time
import warnings
from datetime import datetime, timedelta
from typing import List, Dict, Optional, Tuple

import pandas as pd
import numpy as np
from scipy import stats
from tqdm.notebook import tqdm
import yfinance as yf

import matplotlib.pyplot as plt
import seaborn as sns

warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)

# Plotting style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')

print(f"Environment setup complete. Timestamp: {datetime.now()}")

## 2. Configuration and Load Universe

In [None]:
# =============================================================================
# CONFIGURATION
# =============================================================================

class ResearchConfig:
    """Configuration for market data collection."""
    
    # Sample Period
    START_DATE = "2019-01-01"
    END_DATE = "2025-12-31"
    
    # Rolling Statistics Parameters
    ROLLING_WINDOW = 60  # days for baseline calculation
    MIN_PERIODS = 20     # minimum observations for rolling stats
    
    # Episode Detection Thresholds (applied in Notebook 4)
    RETURN_ZSCORE_THRESHOLD = 3.0
    VOLUME_PERCENTILE_THRESHOLD = 95
    PRICE_THRESHOLD = 10.0  # Max price for penny stock filter
    
    # Data Storage Paths
    BASE_PATH = "/content/drive/MyDrive/Research/PumpDump/"
    RAW_DATA_PATH = BASE_PATH + "data/raw/"
    PROCESSED_DATA_PATH = BASE_PATH + "data/processed/"
    
    # Scraping Parameters
    BATCH_SIZE = 20  # tickers per batch for yfinance
    SLEEP_BETWEEN_BATCHES = 2  # seconds

config = ResearchConfig()

# Handle Colab vs local
try:
    from google.colab import drive
    drive.mount('/content/drive')
    IN_COLAB = True
except ImportError:
    print("Not running in Colab - using local paths")
    IN_COLAB = False
    config.BASE_PATH = "./research_data/"
    config.RAW_DATA_PATH = config.BASE_PATH + "data/raw/"
    config.PROCESSED_DATA_PATH = config.BASE_PATH + "data/processed/"

os.makedirs(config.RAW_DATA_PATH, exist_ok=True)
os.makedirs(config.PROCESSED_DATA_PATH, exist_ok=True)

In [None]:
# =============================================================================
# LOAD STOCK UNIVERSE FROM NOTEBOOK 1
# =============================================================================

def load_universe(data_path: str) -> pd.DataFrame:
    """Load stock universe from Notebook 1 output."""
    
    universe_path = os.path.join(data_path, 'stock_universe.parquet')
    
    if os.path.exists(universe_path):
        universe = pd.read_parquet(universe_path)
        print(f"Loaded universe: {len(universe)} tickers")
    else:
        print("Universe file not found - creating sample universe")
        # Create sample universe for demonstration
        sample_tickers = [
            'GME', 'AMC', 'BB', 'NOK', 'BBBY', 'KOSS', 'CLOV', 'WISH',
            'PLTR', 'SPCE', 'TLRY', 'SNDL', 'LCID', 'RIVN', 'MULN',
            'FFIE', 'OCGN', 'NVAX', 'INO', 'ATER'
        ]
        universe = pd.DataFrame({
            'ticker': sample_tickers,
            'is_confirmed_manipulation': [False] * len(sample_tickers),
            'source': ['sample'] * len(sample_tickers)
        })
    
    return universe

universe_df = load_universe(config.PROCESSED_DATA_PATH)
print(f"\nUniverse composition:")
print(universe_df['source'].value_counts())

## 3. Yahoo Finance Data Collection

### 3.1 Price-Volume Data Scraper

In [None]:
# =============================================================================
# YAHOO FINANCE DATA COLLECTOR
# =============================================================================

class YahooFinanceCollector:
    """Collects daily OHLCV data from Yahoo Finance.
    
    Uses yfinance library which scrapes Yahoo Finance without API key.
    Implements batching and rate limiting for stability.
    """
    
    def __init__(self, config: ResearchConfig):
        self.config = config
        self.failed_tickers = []
        
    def get_single_ticker_data(self, ticker: str, 
                                start: str, end: str) -> Optional[pd.DataFrame]:
        """Get OHLCV data for a single ticker.
        
        Args:
            ticker: Stock ticker symbol
            start: Start date (YYYY-MM-DD)
            end: End date (YYYY-MM-DD)
            
        Returns:
            DataFrame with OHLCV data or None if failed
        """
        try:
            stock = yf.Ticker(ticker)
            df = stock.history(start=start, end=end, auto_adjust=True)
            
            if len(df) == 0:
                return None
            
            # Clean up DataFrame
            df = df[['Open', 'High', 'Low', 'Close', 'Volume']].copy()
            df['ticker'] = ticker
            df = df.reset_index()
            df.columns = ['date', 'open', 'high', 'low', 'close', 'volume', 'ticker']
            df['date'] = pd.to_datetime(df['date']).dt.date
            
            return df
            
        except Exception as e:
            return None
    
    def get_batch_data(self, tickers: List[str], 
                       start: str, end: str) -> pd.DataFrame:
        """Get OHLCV data for multiple tickers using yfinance batch download.
        
        Args:
            tickers: List of ticker symbols
            start: Start date
            end: End date
            
        Returns:
            Combined DataFrame with all ticker data
        """
        try:
            # Download batch
            data = yf.download(
                tickers=tickers,
                start=start,
                end=end,
                auto_adjust=True,
                progress=False,
                threads=True
            )
            
            if len(data) == 0:
                return pd.DataFrame()
            
            # Reshape from multi-index columns to long format
            records = []
            
            # Handle single vs multiple ticker response
            if isinstance(data.columns, pd.MultiIndex):
                for ticker in tickers:
                    try:
                        ticker_data = data.xs(ticker, axis=1, level=1)
                        ticker_data = ticker_data.reset_index()
                        ticker_data['ticker'] = ticker
                        records.append(ticker_data)
                    except KeyError:
                        self.failed_tickers.append(ticker)
            else:
                # Single ticker case
                data = data.reset_index()
                data['ticker'] = tickers[0]
                records.append(data)
            
            if not records:
                return pd.DataFrame()
            
            df = pd.concat(records, ignore_index=True)
            
            # Standardize column names
            df.columns = df.columns.str.lower()
            if 'date' not in df.columns and 'index' in df.columns:
                df = df.rename(columns={'index': 'date'})
            
            df['date'] = pd.to_datetime(df['date']).dt.date
            
            return df
            
        except Exception as e:
            print(f"Batch download error: {e}")
            return pd.DataFrame()
    
    def collect_all_data(self, tickers: List[str]) -> pd.DataFrame:
        """Collect data for all tickers with batching and rate limiting.
        
        Args:
            tickers: List of all ticker symbols
            
        Returns:
            Combined DataFrame with all price-volume data
        """
        print(f"Collecting data for {len(tickers)} tickers")
        print(f"Period: {self.config.START_DATE} to {self.config.END_DATE}")
        print(f"Batch size: {self.config.BATCH_SIZE}")
        
        all_data = []
        
        # Split into batches
        batches = [tickers[i:i+self.config.BATCH_SIZE] 
                   for i in range(0, len(tickers), self.config.BATCH_SIZE)]
        
        for batch in tqdm(batches, desc="Downloading batches"):
            batch_data = self.get_batch_data(
                tickers=batch,
                start=self.config.START_DATE,
                end=self.config.END_DATE
            )
            
            if len(batch_data) > 0:
                all_data.append(batch_data)
            
            # Rate limiting
            time.sleep(self.config.SLEEP_BETWEEN_BATCHES)
        
        if not all_data:
            print("No data collected!")
            return pd.DataFrame()
        
        combined = pd.concat(all_data, ignore_index=True)
        
        # Remove duplicates
        combined = combined.drop_duplicates(subset=['ticker', 'date'])
        
        # Sort
        combined = combined.sort_values(['ticker', 'date']).reset_index(drop=True)
        
        print(f"\nCollection complete:")
        print(f"  Total records: {len(combined):,}")
        print(f"  Unique tickers: {combined['ticker'].nunique()}")
        print(f"  Date range: {combined['date'].min()} to {combined['date'].max()}")
        print(f"  Failed tickers: {len(self.failed_tickers)}")
        
        return combined


# Initialize collector
yf_collector = YahooFinanceCollector(config)
print("Yahoo Finance Collector initialized")

In [None]:
# =============================================================================
# EXECUTE DATA COLLECTION
# =============================================================================

# Get list of tickers from universe
tickers = universe_df['ticker'].unique().tolist()

print(f"Collecting market data for {len(tickers)} tickers...")
print("This may take several minutes depending on universe size.")
print("="*60)

# Collect data
price_data = yf_collector.collect_all_data(tickers)

# Display sample
print("\nSample of collected data:")
print(price_data.head(10))

## 4. Compute Baseline Statistics

### 4.1 Rolling Statistics for Each Stock

In [None]:
# =============================================================================
# BASELINE STATISTICS CALCULATOR
# =============================================================================

class BaselineCalculator:
    """Computes rolling baseline statistics for episode detection.
    
    For each stock on each day, computes:
    - Rolling mean and std of returns (60-day)
    - Rolling percentiles of volume (60-day)
    - Z-scores for anomaly detection
    """
    
    def __init__(self, window: int = 60, min_periods: int = 20):
        self.window = window
        self.min_periods = min_periods
        
    def compute_returns(self, df: pd.DataFrame) -> pd.DataFrame:
        """Compute daily returns for each ticker."""
        df = df.copy()
        df = df.sort_values(['ticker', 'date'])
        
        # Simple return
        df['return'] = df.groupby('ticker')['close'].pct_change()
        
        # Log return (for normality)
        df['log_return'] = np.log(df['close'] / df.groupby('ticker')['close'].shift(1))
        
        return df
    
    def compute_rolling_stats(self, df: pd.DataFrame) -> pd.DataFrame:
        """Compute rolling statistics for returns and volume."""
        df = df.copy()
        
        print("Computing rolling statistics...")
        
        # Group by ticker and compute rolling stats
        for col in tqdm(['return', 'volume'], desc="Computing stats"):
            # Rolling mean
            df[f'{col}_mean_{self.window}d'] = df.groupby('ticker')[col].transform(
                lambda x: x.rolling(window=self.window, min_periods=self.min_periods).mean()
            )
            
            # Rolling std
            df[f'{col}_std_{self.window}d'] = df.groupby('ticker')[col].transform(
                lambda x: x.rolling(window=self.window, min_periods=self.min_periods).std()
            )
            
            # Rolling median (more robust)
            df[f'{col}_median_{self.window}d'] = df.groupby('ticker')[col].transform(
                lambda x: x.rolling(window=self.window, min_periods=self.min_periods).median()
            )
        
        # Volume percentiles
        for pct in [90, 95, 99]:
            df[f'volume_pct{pct}_{self.window}d'] = df.groupby('ticker')['volume'].transform(
                lambda x: x.rolling(window=self.window, min_periods=self.min_periods).quantile(pct/100)
            )
        
        return df
    
    def compute_zscores(self, df: pd.DataFrame) -> pd.DataFrame:
        """Compute z-scores for anomaly detection."""
        df = df.copy()
        
        # Return z-score
        df['return_zscore'] = (
            (df['return'] - df[f'return_mean_{self.window}d']) / 
            df[f'return_std_{self.window}d']
        )
        
        # Volume z-score
        df['volume_zscore'] = (
            (df['volume'] - df[f'volume_mean_{self.window}d']) / 
            df[f'volume_std_{self.window}d']
        )
        
        # Volume as ratio to median (more interpretable)
        df['volume_ratio'] = df['volume'] / df[f'volume_median_{self.window}d']
        
        # Turnover (volume relative to average)
        df['turnover_ratio'] = df['volume'] / df[f'volume_mean_{self.window}d']
        
        return df
    
    def compute_price_features(self, df: pd.DataFrame) -> pd.DataFrame:
        """Compute additional price-based features."""
        df = df.copy()
        
        # Intraday range
        df['intraday_range'] = (df['high'] - df['low']) / df['close']
        
        # Gap (open vs previous close)
        df['gap'] = (df['open'] - df.groupby('ticker')['close'].shift(1)) / df.groupby('ticker')['close'].shift(1)
        
        # Close position in range
        df['close_position'] = (df['close'] - df['low']) / (df['high'] - df['low'] + 1e-10)
        
        # Cumulative returns
        df['return_5d'] = df.groupby('ticker')['return'].transform(
            lambda x: x.rolling(5, min_periods=1).sum()
        )
        df['return_20d'] = df.groupby('ticker')['return'].transform(
            lambda x: x.rolling(20, min_periods=1).sum()
        )
        
        return df
    
    def process_all(self, df: pd.DataFrame) -> pd.DataFrame:
        """Run complete baseline calculation pipeline."""
        print("="*60)
        print("COMPUTING BASELINE STATISTICS")
        print("="*60)
        
        # Step 1: Returns
        df = self.compute_returns(df)
        print(f"Returns computed: {df['return'].notna().sum():,} observations")
        
        # Step 2: Rolling statistics
        df = self.compute_rolling_stats(df)
        
        # Step 3: Z-scores
        df = self.compute_zscores(df)
        print(f"Z-scores computed")
        
        # Step 4: Price features
        df = self.compute_price_features(df)
        print(f"Price features computed")
        
        print("\nBaseline computation complete")
        print(f"Total columns: {len(df.columns)}")
        
        return df


# Initialize calculator
baseline_calc = BaselineCalculator(
    window=config.ROLLING_WINDOW, 
    min_periods=config.MIN_PERIODS
)
print("Baseline Calculator initialized")

In [None]:
# =============================================================================
# COMPUTE BASELINES
# =============================================================================

# Process data
if len(price_data) > 0:
    market_data = baseline_calc.process_all(price_data)
    
    print("\nData Summary:")
    print(market_data[['return', 'return_zscore', 'volume', 'volume_zscore', 'volume_ratio']].describe())
else:
    print("No price data to process")
    market_data = pd.DataFrame()

## 5. Identify Candidate Price-Volume Events

### 5.1 Flag Days with Extreme Price-Volume Activity

In [None]:
# =============================================================================
# CANDIDATE EVENT DETECTOR
# =============================================================================

class CandidateEventDetector:
    """Identifies candidate pump-and-dump events based on price-volume anomalies.
    
    A day is flagged as candidate if ALL conditions hold:
    1. Return z-score > threshold (extreme positive return)
    2. Volume > 95th percentile of own history (unusual volume)
    3. Price < $10 (penny stock filter)
    """
    
    def __init__(self, 
                 return_threshold: float = 3.0,
                 volume_percentile: int = 95,
                 price_threshold: float = 10.0):
        self.return_threshold = return_threshold
        self.volume_percentile = volume_percentile
        self.price_threshold = price_threshold
        
    def flag_candidates(self, df: pd.DataFrame, window: int = 60) -> pd.DataFrame:
        """Flag candidate price-volume spike days.
        
        Args:
            df: DataFrame with baseline statistics
            window: Rolling window used for baselines
            
        Returns:
            DataFrame with candidate flags
        """
        df = df.copy()
        
        # Volume threshold column name
        vol_col = f'volume_pct{self.volume_percentile}_{window}d'
        
        # Previous day close (for price filter)
        df['prev_close'] = df.groupby('ticker')['close'].shift(1)
        
        # Condition 1: Extreme positive return
        cond_return = df['return_zscore'] > self.return_threshold
        
        # Condition 2: Extreme volume
        if vol_col in df.columns:
            cond_volume = df['volume'] > df[vol_col]
        else:
            # Fallback to z-score based
            cond_volume = df['volume_zscore'] > 2.0
        
        # Condition 3: Penny stock price
        cond_price = df['prev_close'] < self.price_threshold
        
        # Combined candidate flag
        df['is_price_spike'] = cond_return
        df['is_volume_spike'] = cond_volume
        df['is_penny_stock'] = cond_price
        df['is_candidate_event'] = cond_return & cond_volume & cond_price
        
        # Also flag without penny stock filter (for robustness)
        df['is_candidate_any_price'] = cond_return & cond_volume
        
        return df
    
    def summarize_candidates(self, df: pd.DataFrame) -> Dict:
        """Generate summary statistics for candidate events."""
        summary = {
            'total_days': len(df),
            'unique_tickers': df['ticker'].nunique(),
            'price_spikes': int(df['is_price_spike'].sum()),
            'volume_spikes': int(df['is_volume_spike'].sum()),
            'penny_stock_days': int(df['is_penny_stock'].sum()),
            'candidate_events': int(df['is_candidate_event'].sum()),
            'candidate_any_price': int(df['is_candidate_any_price'].sum()),
            'tickers_with_candidates': int(df[df['is_candidate_event']]['ticker'].nunique())
        }
        
        # Candidates by ticker
        ticker_counts = df[df['is_candidate_event']].groupby('ticker').size()
        summary['top_candidate_tickers'] = ticker_counts.nlargest(10).to_dict()
        
        return summary
    
    def get_candidate_events(self, df: pd.DataFrame) -> pd.DataFrame:
        """Extract DataFrame of candidate events only."""
        candidates = df[df['is_candidate_event']].copy()
        candidates = candidates.sort_values(['ticker', 'date'])
        return candidates


# Initialize detector
event_detector = CandidateEventDetector(
    return_threshold=config.RETURN_ZSCORE_THRESHOLD,
    volume_percentile=config.VOLUME_PERCENTILE_THRESHOLD,
    price_threshold=config.PRICE_THRESHOLD
)
print("Candidate Event Detector initialized")

In [None]:
# =============================================================================
# FLAG CANDIDATE EVENTS
# =============================================================================

if len(market_data) > 0:
    # Flag candidates
    market_data = event_detector.flag_candidates(market_data, window=config.ROLLING_WINDOW)
    
    # Summarize
    summary = event_detector.summarize_candidates(market_data)
    
    print("\n" + "="*60)
    print("CANDIDATE EVENT DETECTION RESULTS")
    print("="*60)
    print(f"Total trading days: {summary['total_days']:,}")
    print(f"Unique tickers: {summary['unique_tickers']}")
    print(f"\nSpike Detection:")
    print(f"  Price spikes (z > {config.RETURN_ZSCORE_THRESHOLD}): {summary['price_spikes']:,}")
    print(f"  Volume spikes (> {config.VOLUME_PERCENTILE_THRESHOLD}th pct): {summary['volume_spikes']:,}")
    print(f"  Penny stock days (< ${config.PRICE_THRESHOLD}): {summary['penny_stock_days']:,}")
    print(f"\nCandidate Events (joint):")
    print(f"  With penny stock filter: {summary['candidate_events']:,}")
    print(f"  Without price filter: {summary['candidate_any_price']:,}")
    print(f"  Unique tickers with candidates: {summary['tickers_with_candidates']}")
    print(f"\nTop tickers by candidate count:")
    for ticker, count in list(summary['top_candidate_tickers'].items())[:10]:
        print(f"    {ticker}: {count}")
else:
    print("No market data to analyze")

## 6. Visualizations

In [None]:
# =============================================================================
# VISUALIZATIONS
# =============================================================================

def plot_candidate_distribution(df: pd.DataFrame):
    """Plot distribution of candidate events over time."""
    if len(df) == 0:
        print("No data to plot")
        return
    
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    # 1. Return z-score distribution
    ax1 = axes[0, 0]
    data = df['return_zscore'].dropna()
    data = data[data.between(-10, 10)]  # Clip outliers for visualization
    ax1.hist(data, bins=100, edgecolor='black', alpha=0.7)
    ax1.axvline(x=config.RETURN_ZSCORE_THRESHOLD, color='red', linestyle='--', label=f'Threshold ({config.RETURN_ZSCORE_THRESHOLD})')
    ax1.set_xlabel('Return Z-Score')
    ax1.set_ylabel('Frequency')
    ax1.set_title('Distribution of Return Z-Scores')
    ax1.legend()
    
    # 2. Volume ratio distribution
    ax2 = axes[0, 1]
    data = df['volume_ratio'].dropna()
    data = data[data.between(0, 20)]  # Clip for visualization
    ax2.hist(data, bins=100, edgecolor='black', alpha=0.7, color='orange')
    ax2.axvline(x=2.0, color='red', linestyle='--', label='2x Median')
    ax2.set_xlabel('Volume / Median Volume')
    ax2.set_ylabel('Frequency')
    ax2.set_title('Distribution of Volume Ratios')
    ax2.legend()
    
    # 3. Candidate events over time
    ax3 = axes[1, 0]
    df['date'] = pd.to_datetime(df['date'])
    monthly_candidates = df.groupby(df['date'].dt.to_period('M'))['is_candidate_event'].sum()
    monthly_candidates.plot(ax=ax3, kind='bar', color='green', alpha=0.7)
    ax3.set_xlabel('Month')
    ax3.set_ylabel('Candidate Events')
    ax3.set_title('Candidate Events Over Time')
    ax3.tick_params(axis='x', rotation=45)
    # Show only every 6th label
    for i, label in enumerate(ax3.xaxis.get_ticklabels()):
        if i % 6 != 0:
            label.set_visible(False)
    
    # 4. Price at candidate events
    ax4 = axes[1, 1]
    candidates = df[df['is_candidate_event']]
    if len(candidates) > 0:
        ax4.hist(candidates['prev_close'].dropna(), bins=50, edgecolor='black', alpha=0.7, color='purple')
        ax4.set_xlabel('Price ($)')
        ax4.set_ylabel('Frequency')
        ax4.set_title('Price Distribution at Candidate Events')
    else:
        ax4.text(0.5, 0.5, 'No candidate events', ha='center', va='center', transform=ax4.transAxes)
    
    plt.tight_layout()
    plt.savefig(os.path.join(config.PROCESSED_DATA_PATH, 'candidate_distributions.png'), dpi=150)
    plt.show()


def plot_example_spike(df: pd.DataFrame, ticker: str = None):
    """Plot example of a price-volume spike event."""
    if len(df) == 0:
        print("No data to plot")
        return
    
    # Find a ticker with candidate events
    if ticker is None:
        candidates = df[df['is_candidate_event']]
        if len(candidates) == 0:
            print("No candidate events to plot")
            return
        ticker = candidates['ticker'].value_counts().index[0]
    
    ticker_data = df[df['ticker'] == ticker].copy()
    ticker_data = ticker_data.sort_values('date')
    
    fig, axes = plt.subplots(3, 1, figsize=(14, 10), sharex=True)
    
    # Price
    ax1 = axes[0]
    ax1.plot(ticker_data['date'], ticker_data['close'], color='blue', linewidth=1)
    candidate_days = ticker_data[ticker_data['is_candidate_event']]
    ax1.scatter(candidate_days['date'], candidate_days['close'], color='red', s=100, marker='^', label='Candidate Event', zorder=5)
    ax1.set_ylabel('Close Price ($)')
    ax1.set_title(f'{ticker} - Price with Candidate Events')
    ax1.legend()
    
    # Volume
    ax2 = axes[1]
    ax2.bar(ticker_data['date'], ticker_data['volume'], color='gray', alpha=0.5, width=1)
    ax2.scatter(candidate_days['date'], candidate_days['volume'], color='red', s=100, marker='^', zorder=5)
    ax2.set_ylabel('Volume')
    ax2.set_title(f'{ticker} - Volume')
    
    # Return Z-Score
    ax3 = axes[2]
    ax3.plot(ticker_data['date'], ticker_data['return_zscore'], color='green', linewidth=1)
    ax3.axhline(y=config.RETURN_ZSCORE_THRESHOLD, color='red', linestyle='--', label=f'Threshold ({config.RETURN_ZSCORE_THRESHOLD})')
    ax3.axhline(y=-config.RETURN_ZSCORE_THRESHOLD, color='red', linestyle='--')
    ax3.scatter(candidate_days['date'], candidate_days['return_zscore'], color='red', s=100, marker='^', zorder=5)
    ax3.set_ylabel('Return Z-Score')
    ax3.set_xlabel('Date')
    ax3.set_title(f'{ticker} - Return Z-Score')
    ax3.legend()
    
    plt.tight_layout()
    plt.savefig(os.path.join(config.PROCESSED_DATA_PATH, f'example_spike_{ticker}.png'), dpi=150)
    plt.show()
    
    return ticker


# Generate visualizations
if len(market_data) > 0:
    print("Generating visualizations...")
    plot_candidate_distribution(market_data)
    example_ticker = plot_example_spike(market_data)
    print(f"\nExample ticker plotted: {example_ticker}")

## 7. Save Outputs

In [None]:
# =============================================================================
# SAVE OUTPUTS
# =============================================================================

def save_market_data(df: pd.DataFrame, output_dir: str):
    """Save market data with baseline statistics."""
    os.makedirs(output_dir, exist_ok=True)
    
    # Save full dataset
    full_path = os.path.join(output_dir, 'market_data_with_baselines.parquet')
    df.to_parquet(full_path, index=False)
    print(f"Saved full market data: {full_path}")
    
    # Save candidate events only
    candidates = df[df['is_candidate_event']].copy()
    candidates_path = os.path.join(output_dir, 'candidate_price_volume_events.parquet')
    candidates.to_parquet(candidates_path, index=False)
    print(f"Saved candidate events: {candidates_path}")
    
    # Save summary
    summary = {
        'total_observations': len(df),
        'unique_tickers': int(df['ticker'].nunique()),
        'date_range': [str(df['date'].min()), str(df['date'].max())],
        'candidate_events': int(df['is_candidate_event'].sum()),
        'tickers_with_candidates': int(df[df['is_candidate_event']]['ticker'].nunique()),
        'thresholds': {
            'return_zscore': config.RETURN_ZSCORE_THRESHOLD,
            'volume_percentile': config.VOLUME_PERCENTILE_THRESHOLD,
            'price': config.PRICE_THRESHOLD,
            'rolling_window': config.ROLLING_WINDOW
        },
        'created_at': datetime.now().isoformat()
    }
    
    summary_path = os.path.join(output_dir, 'notebook02_summary.json')
    with open(summary_path, 'w') as f:
        json.dump(summary, f, indent=2)
    print(f"Saved summary: {summary_path}")
    
    return summary


# Save outputs
if len(market_data) > 0:
    output_summary = save_market_data(market_data, config.PROCESSED_DATA_PATH)
    print("\n" + "="*60)
    print("Output Summary:")
    print(json.dumps(output_summary, indent=2))
else:
    print("No data to save")

## 8. Summary and Next Steps

In [None]:
# =============================================================================
# NOTEBOOK 2 SUMMARY
# =============================================================================

print("""
╔══════════════════════════════════════════════════════════════════════════════╗
║          NOTEBOOK 2: MARKET DATA COLLECTION COMPLETE                         ║
╚══════════════════════════════════════════════════════════════════════════════╝

OUTPUT FILES:
─────────────
• market_data_with_baselines.parquet   - Full OHLCV data with rolling stats
• candidate_price_volume_events.parquet - Flagged candidate events
• candidate_distributions.png           - Distribution plots
• example_spike_{ticker}.png            - Example spike visualization
• notebook02_summary.json               - Summary statistics

KEY FEATURES COMPUTED:
──────────────────────
• Daily returns (simple and log)
• 60-day rolling mean/std for returns and volume
• Return z-scores
• Volume z-scores and ratios
• Volume percentiles (90th, 95th, 99th)
• Candidate event flags

CANDIDATE EVENT CRITERIA:
─────────────────────────
1. Return z-score > 3.0 (extreme positive return)
2. Volume > 95th percentile of 60-day history
3. Previous close < $10 (penny stock filter)

NEXT STEPS:
───────────
→ Notebook 3: Yahoo Message Board Scraping
  - Scrape social media discussion data
  - Compute message volume baselines
  - Identify social media bursts

NOTE: These candidate events are based on PRICE-VOLUME only.
Final episode detection will require joint price-volume AND social conditions.

""")

In [None]:
# =============================================================================
# ENVIRONMENT INFO FOR REPRODUCIBILITY
# =============================================================================

import sys
import platform

print("Environment Information:")
print(f"  Python: {sys.version}")
print(f"  Platform: {platform.platform()}")
print(f"  Pandas: {pd.__version__}")
print(f"  NumPy: {np.__version__}")
print(f"  yfinance: {yf.__version__}")
print(f"  Timestamp: {datetime.now().isoformat()}")