# Notebook 4: Episode Detection
## Social Media-Driven Stock Manipulation and Tail Risk Research

---

**Research Project:** Social Media-Driven Stock Manipulation and Tail Risk

**Purpose:** Identify pump-and-dump episodes by combining price-volume anomalies with social media bursts. Apply filters to reduce false positives.

**Inputs:**
- Market data with baselines (Notebook 2)
- Social media metrics (Notebook 3)
- SEC enforcement labels (Notebook 1)

**Output:**
- Episode-level dataset with event windows
- Ground truth labels
- Placebo test results

---

**Last Updated:** 2025

## 1. Environment Setup

In [None]:
# =============================================================================
# INSTALL REQUIRED PACKAGES
# =============================================================================

!pip install pandas==2.0.3
!pip install numpy==1.24.3
!pip install scipy==1.11.4
!pip install tqdm==4.66.1
!pip install pyarrow==14.0.1
!pip install matplotlib==3.8.2
!pip install seaborn==0.13.0

print("All packages installed successfully.")

In [None]:
# =============================================================================
# IMPORT LIBRARIES
# =============================================================================

import os
import json
import warnings
from datetime import datetime, timedelta
from typing import List, Dict, Optional, Tuple

import pandas as pd
import numpy as np
from scipy import stats
from tqdm.notebook import tqdm

import matplotlib.pyplot as plt
import seaborn as sns

warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)

plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')

print(f"Environment setup complete. Timestamp: {datetime.now()}")

## 2. Configuration and Load Data

In [None]:
# =============================================================================
# CONFIGURATION
# =============================================================================

class ResearchConfig:
    """Configuration for episode detection."""
    
    # Episode Detection Thresholds
    RETURN_ZSCORE_THRESHOLD = 3.0
    SOCIAL_ZSCORE_THRESHOLD = 3.0
    SOCIAL_WINDOW = 1  # days before/after for social burst matching
    
    # Episode Windows (trading days)
    PRE_WINDOW = 20    # days before event
    POST_SHORT = 5     # immediate post-event
    POST_MEDIUM = 20   # medium-term post-event
    POST_LONG = 60     # long-term post-event
    
    # Filtering
    MIN_CLUSTER_GAP = 5  # days between distinct episodes
    
    # Data Paths
    BASE_PATH = "/content/drive/MyDrive/Research/PumpDump/"
    PROCESSED_DATA_PATH = BASE_PATH + "data/processed/"
    RESULTS_PATH = BASE_PATH + "results/"

config = ResearchConfig()

# Handle Colab vs local
try:
    from google.colab import drive
    drive.mount('/content/drive')
    IN_COLAB = True
except ImportError:
    IN_COLAB = False
    config.BASE_PATH = "./research_data/"
    config.PROCESSED_DATA_PATH = config.BASE_PATH + "data/processed/"
    config.RESULTS_PATH = config.BASE_PATH + "results/"

os.makedirs(config.RESULTS_PATH, exist_ok=True)

In [None]:
# =============================================================================
# LOAD DATA FROM PREVIOUS NOTEBOOKS
# =============================================================================

def load_data(data_path: str) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    """Load data from previous notebooks."""
    
    # Market data with baselines (Notebook 2)
    market_path = os.path.join(data_path, 'market_data_with_baselines.parquet')
    if os.path.exists(market_path):
        market_data = pd.read_parquet(market_path)
        print(f"Loaded market data: {len(market_data):,} rows")
    else:
        print("Market data not found - creating sample")
        market_data = create_sample_market_data()
    
    # Social data (Notebook 3)
    social_path = os.path.join(data_path, 'daily_social_metrics.parquet')
    if os.path.exists(social_path):
        social_data = pd.read_parquet(social_path)
        print(f"Loaded social data: {len(social_data):,} rows")
    else:
        print("Social data not found - creating sample")
        social_data = create_sample_social_data()
    
    # SEC labels (Notebook 1)
    labels_path = os.path.join(data_path, 'ticker_manipulation_labels.parquet')
    if os.path.exists(labels_path):
        labels = pd.read_parquet(labels_path)
        print(f"Loaded labels: {len(labels)} tickers")
    else:
        print("Labels not found - using empty DataFrame")
        labels = pd.DataFrame(columns=['ticker', 'enforcement_date', 'label'])
    
    return market_data, social_data, labels


def create_sample_market_data() -> pd.DataFrame:
    """Create sample market data for demonstration."""
    np.random.seed(42)
    tickers = ['GME', 'AMC', 'BB', 'NOK', 'CLOV', 'WISH', 'MULN', 'FFIE']
    dates = pd.date_range('2020-01-01', '2023-12-31', freq='B')
    
    records = []
    for ticker in tickers:
        for date in dates:
            close = np.random.lognormal(1, 0.5)
            volume = np.random.lognormal(15, 1)
            ret = np.random.normal(0, 0.03)
            
            # Add some spikes
            if np.random.random() < 0.01:
                ret = np.random.uniform(0.1, 0.5)
                volume *= np.random.uniform(5, 20)
            
            records.append({
                'ticker': ticker,
                'date': date.date(),
                'close': close,
                'volume': volume,
                'return': ret,
                'return_zscore': ret / 0.03,
                'volume_zscore': (volume - 1000000) / 500000,
                'is_candidate_event': ret > 0.1 and volume > 5000000,
                'prev_close': close * (1 - ret)
            })
    
    return pd.DataFrame(records)


def create_sample_social_data() -> pd.DataFrame:
    """Create sample social data for demonstration."""
    np.random.seed(42)
    tickers = ['GME', 'AMC', 'BB', 'NOK', 'CLOV', 'WISH', 'MULN', 'FFIE']
    dates = pd.date_range('2020-01-01', '2023-12-31', freq='B')
    
    records = []
    for ticker in tickers:
        for date in dates:
            msg_count = np.random.poisson(10)
            
            # Add some bursts
            if np.random.random() < 0.02:
                msg_count *= np.random.randint(5, 20)
            
            records.append({
                'ticker': ticker,
                'date': date.date(),
                'msg_count': msg_count,
                'msg_zscore': (msg_count - 10) / 5,
                'promo_share': np.random.uniform(0, 0.5),
                'user_concentration': np.random.uniform(0, 1),
                'is_social_burst': msg_count > 50
            })
    
    return pd.DataFrame(records)


# Load data
market_data, social_data, sec_labels = load_data(config.PROCESSED_DATA_PATH)

## 3. Merge Price-Volume and Social Data

In [None]:
# =============================================================================
# DATA MERGER
# =============================================================================

class DataMerger:
    """Merges market and social data for joint event detection."""
    
    def __init__(self, config: ResearchConfig):
        self.config = config
    
    def merge_market_social(self, market_df: pd.DataFrame, 
                            social_df: pd.DataFrame) -> pd.DataFrame:
        """Merge market and social data on ticker and date."""
        
        # Ensure date columns are the same type
        market_df = market_df.copy()
        social_df = social_df.copy()
        
        market_df['date'] = pd.to_datetime(market_df['date']).dt.date
        social_df['date'] = pd.to_datetime(social_df['date']).dt.date
        
        # Select columns to merge
        social_cols = ['ticker', 'date', 'msg_count', 'msg_zscore', 
                       'promo_share', 'user_concentration', 'is_social_burst']
        social_cols = [c for c in social_cols if c in social_df.columns]
        
        # Merge
        merged = market_df.merge(
            social_df[social_cols],
            on=['ticker', 'date'],
            how='left'
        )
        
        # Fill missing social data (no messages = 0)
        merged['msg_count'] = merged['msg_count'].fillna(0)
        merged['msg_zscore'] = merged['msg_zscore'].fillna(0)
        merged['is_social_burst'] = merged['is_social_burst'].fillna(False)
        
        print(f"Merged data: {len(merged):,} rows")
        print(f"Rows with social data: {(merged['msg_count'] > 0).sum():,}")
        
        return merged
    
    def add_social_burst_window(self, df: pd.DataFrame, 
                                 window: int = 1) -> pd.DataFrame:
        """Add flag for social burst within window of each day.
        
        For each day, check if there's a social burst within [-window, +window] days.
        """
        df = df.copy()
        df = df.sort_values(['ticker', 'date'])
        
        # For each ticker, check rolling window for social bursts
        def check_burst_window(group):
            burst = group['is_social_burst'].astype(int)
            # Forward and backward rolling sum
            forward = burst.rolling(window + 1, min_periods=1).sum().shift(-window)
            backward = burst.rolling(window + 1, min_periods=1).sum()
            return (forward + backward - burst) > 0  # Subtract self to avoid double counting
        
        df['social_burst_in_window'] = df.groupby('ticker').apply(
            lambda x: check_burst_window(x)
        ).reset_index(level=0, drop=True)
        
        return df


# Initialize merger
merger = DataMerger(config)

# Merge data
print("Merging market and social data...")
merged_data = merger.merge_market_social(market_data, social_data)
merged_data = merger.add_social_burst_window(merged_data, window=config.SOCIAL_WINDOW)

print("\nMerged data sample:")
print(merged_data.head())

## 4. Joint Event Detection

### 4.1 Define Episode Criteria

In [None]:
# =============================================================================
# EPISODE DETECTOR
# =============================================================================

class EpisodeDetector:
    """Detects pump-and-dump episodes from merged data.
    
    An episode is defined as:
    1. Price spike (return z-score > threshold) AND
    2. Volume spike (volume > 95th percentile) AND
    3. Social burst within +-1 day
    """
    
    def __init__(self, config: ResearchConfig):
        self.config = config
    
    def identify_episodes(self, df: pd.DataFrame) -> pd.DataFrame:
        """Identify joint price-volume-social episodes."""
        df = df.copy()
        
        # Condition 1: Price spike (from candidate_event flag or z-score)
        if 'is_candidate_event' in df.columns:
            price_spike = df['is_candidate_event']
        else:
            price_spike = df['return_zscore'] > self.config.RETURN_ZSCORE_THRESHOLD
        
        # Condition 2: Social burst in window
        if 'social_burst_in_window' in df.columns:
            social_burst = df['social_burst_in_window']
        else:
            social_burst = df['is_social_burst']
        
        # Joint event
        df['is_episode'] = price_spike & social_burst
        
        # Also flag without social requirement (for comparison)
        df['is_price_only_event'] = price_spike
        
        return df
    
    def cluster_episodes(self, df: pd.DataFrame, 
                          min_gap: int = 5) -> pd.DataFrame:
        """Cluster consecutive episode days into single episodes.
        
        If multiple qualifying days occur within min_gap days,
        treat as single episode (take first day).
        """
        df = df.copy()
        df = df.sort_values(['ticker', 'date'])
        
        def assign_cluster(group):
            group = group.copy()
            group['episode_cluster'] = np.nan
            
            episode_days = group[group['is_episode']].index.tolist()
            
            cluster_id = 0
            last_episode_idx = None
            
            for idx in episode_days:
                if last_episode_idx is None:
                    cluster_id += 1
                    group.loc[idx, 'episode_cluster'] = cluster_id
                else:
                    # Check gap
                    days_since_last = (group.loc[idx, 'date'] - group.loc[last_episode_idx, 'date']).days
                    if days_since_last > min_gap:
                        cluster_id += 1
                    group.loc[idx, 'episode_cluster'] = cluster_id
                
                last_episode_idx = idx
            
            return group
        
        # Apply clustering
        df['date'] = pd.to_datetime(df['date'])
        df = df.groupby('ticker').apply(assign_cluster).reset_index(drop=True)
        
        # Mark first day of each cluster as the episode event day
        df['is_episode_start'] = False
        for ticker in df['ticker'].unique():
            ticker_mask = df['ticker'] == ticker
            clusters = df[ticker_mask & df['is_episode']].groupby('episode_cluster')
            for _, cluster in clusters:
                first_idx = cluster.index[0]
                df.loc[first_idx, 'is_episode_start'] = True
        
        return df
    
    def extract_episodes(self, df: pd.DataFrame) -> pd.DataFrame:
        """Extract episode-level DataFrame."""
        episodes = df[df['is_episode_start']].copy()
        
        # Rename date to event_date
        episodes = episodes.rename(columns={'date': 'event_date'})
        
        # Select relevant columns
        cols = ['ticker', 'event_date', 'close', 'volume', 'return', 'return_zscore',
                'msg_count', 'msg_zscore', 'promo_share', 'user_concentration',
                'episode_cluster']
        cols = [c for c in cols if c in episodes.columns]
        
        episodes = episodes[cols].reset_index(drop=True)
        
        # Add episode ID
        episodes['episode_id'] = range(1, len(episodes) + 1)
        
        return episodes
    
    def summarize_detection(self, df: pd.DataFrame, episodes: pd.DataFrame) -> Dict:
        """Generate detection summary statistics."""
        summary = {
            'total_observations': len(df),
            'unique_tickers': int(df['ticker'].nunique()),
            'price_only_events': int(df['is_price_only_event'].sum()),
            'joint_events': int(df['is_episode'].sum()),
            'distinct_episodes': len(episodes),
            'tickers_with_episodes': int(episodes['ticker'].nunique()),
            'episodes_per_ticker': episodes.groupby('ticker').size().describe().to_dict()
        }
        
        return summary


# Initialize detector
detector = EpisodeDetector(config)
print("Episode Detector initialized")

In [None]:
# =============================================================================
# DETECT EPISODES
# =============================================================================

print("Identifying episodes...")

# Identify joint events
merged_data = detector.identify_episodes(merged_data)

# Cluster consecutive episodes
merged_data = detector.cluster_episodes(merged_data, min_gap=config.MIN_CLUSTER_GAP)

# Extract episode-level data
episodes_df = detector.extract_episodes(merged_data)

# Summary
detection_summary = detector.summarize_detection(merged_data, episodes_df)

print("\n" + "="*60)
print("EPISODE DETECTION RESULTS")
print("="*60)
print(f"Total observations: {detection_summary['total_observations']:,}")
print(f"Price-only events: {detection_summary['price_only_events']:,}")
print(f"Joint events (price + social): {detection_summary['joint_events']:,}")
print(f"Distinct episodes (after clustering): {detection_summary['distinct_episodes']}")
print(f"Tickers with episodes: {detection_summary['tickers_with_episodes']}")

print("\nEpisodes DataFrame:")
print(episodes_df.head(10))

## 5. Add Ground Truth Labels

In [None]:
# =============================================================================
# ADD SEC ENFORCEMENT LABELS
# =============================================================================

class LabelAssigner:
    """Assigns ground truth labels to episodes.
    
    Labels:
    - 1 = Confirmed pump (ticker in SEC enforcement)
    - 0 = Control (high-volatility but no enforcement)
    """
    
    def __init__(self, labels_df: pd.DataFrame):
        self.labels = labels_df
        self.confirmed_tickers = set(labels_df['ticker'].unique()) if len(labels_df) > 0 else set()
    
    def assign_labels(self, episodes_df: pd.DataFrame, 
                      lookback_days: int = 365) -> pd.DataFrame:
        """Assign labels based on SEC enforcement.
        
        An episode is labeled as confirmed pump if:
        - Ticker appears in SEC enforcement AND
        - Episode date is within lookback_days before enforcement date
        """
        episodes = episodes_df.copy()
        episodes['label'] = 0  # Default: control
        episodes['enforcement_date'] = None
        episodes['enforcement_release'] = None
        
        if len(self.labels) == 0:
            print("No enforcement labels available - all episodes labeled as control")
            return episodes
        
        # Convert dates
        episodes['event_date'] = pd.to_datetime(episodes['event_date'])
        self.labels['enforcement_date'] = pd.to_datetime(self.labels['enforcement_date'])
        
        # Match episodes to enforcement
        for idx, episode in episodes.iterrows():
            ticker = episode['ticker']
            event_date = episode['event_date']
            
            # Check if ticker is in enforcement
            ticker_enforcement = self.labels[self.labels['ticker'] == ticker]
            
            if len(ticker_enforcement) > 0:
                # Check if episode is within lookback window before enforcement
                for _, enf_row in ticker_enforcement.iterrows():
                    enf_date = enf_row['enforcement_date']
                    days_before = (enf_date - event_date).days
                    
                    if 0 <= days_before <= lookback_days:
                        episodes.loc[idx, 'label'] = 1
                        episodes.loc[idx, 'enforcement_date'] = enf_date
                        if 'release_number' in enf_row:
                            episodes.loc[idx, 'enforcement_release'] = enf_row['release_number']
                        break
        
        # Summary
        confirmed = (episodes['label'] == 1).sum()
        control = (episodes['label'] == 0).sum()
        
        print(f"Label Assignment:")
        print(f"  Confirmed pump (label=1): {confirmed}")
        print(f"  Control (label=0): {control}")
        
        return episodes


# Assign labels
label_assigner = LabelAssigner(sec_labels)
episodes_df = label_assigner.assign_labels(episodes_df)

print("\nLabeled Episodes Sample:")
print(episodes_df[['episode_id', 'ticker', 'event_date', 'label', 'enforcement_date']].head(10))

## 6. Define Episode Windows

In [None]:
# =============================================================================
# EPISODE WINDOW CONSTRUCTOR
# =============================================================================

class WindowConstructor:
    """Constructs event windows for each episode.
    
    Windows:
    - Pre-episode: t0-20 to t0-1 (baseline)
    - Event: t0 (spike day)
    - Post-short: t0+1 to t0+5 (immediate reversal)
    - Post-medium: t0+1 to t0+20 (full reversal)
    - Post-long: t0+1 to t0+60 (extended)
    """
    
    def __init__(self, config: ResearchConfig):
        self.config = config
    
    def extract_window_data(self, daily_df: pd.DataFrame, 
                            ticker: str, 
                            event_date: datetime,
                            pre_days: int,
                            post_days: int) -> pd.DataFrame:
        """Extract data for a specific window around an event."""
        
        event_date = pd.to_datetime(event_date)
        daily_df['date'] = pd.to_datetime(daily_df['date'])
        
        # Filter to ticker
        ticker_data = daily_df[daily_df['ticker'] == ticker].copy()
        ticker_data = ticker_data.sort_values('date')
        
        # Find event index
        event_idx = ticker_data[ticker_data['date'] == event_date].index
        if len(event_idx) == 0:
            return pd.DataFrame()
        
        event_idx = event_idx[0]
        ticker_idx = ticker_data.index.tolist()
        event_pos = ticker_idx.index(event_idx)
        
        # Get window indices
        start_pos = max(0, event_pos - pre_days)
        end_pos = min(len(ticker_idx) - 1, event_pos + post_days)
        
        window_indices = ticker_idx[start_pos:end_pos + 1]
        
        window_data = ticker_data.loc[window_indices].copy()
        window_data['days_from_event'] = range(-event_pos + start_pos, end_pos - event_pos + 1)
        
        return window_data
    
    def compute_window_metrics(self, daily_df: pd.DataFrame,
                                episodes_df: pd.DataFrame) -> pd.DataFrame:
        """Compute metrics for each episode window."""
        
        episodes = episodes_df.copy()
        
        # Initialize metric columns
        metrics_cols = [
            # Event day metrics
            'event_return', 'event_volume_ratio',
            
            # Reversal metrics
            'return_5d', 'return_20d', 'return_60d',
            
            # Drawdown metrics
            'max_drawdown_5d', 'max_drawdown_20d', 'max_drawdown_60d',
            
            # Pre-event baseline
            'pre_avg_return', 'pre_avg_volume'
        ]
        
        for col in metrics_cols:
            episodes[col] = np.nan
        
        print(f"Computing window metrics for {len(episodes)} episodes...")
        
        for idx, episode in tqdm(episodes.iterrows(), total=len(episodes)):
            ticker = episode['ticker']
            event_date = episode['event_date']
            
            # Get full window data
            window = self.extract_window_data(
                daily_df, ticker, event_date,
                pre_days=self.config.PRE_WINDOW,
                post_days=self.config.POST_LONG
            )
            
            if len(window) == 0:
                continue
            
            # Event day metrics
            event_row = window[window['days_from_event'] == 0]
            if len(event_row) > 0:
                episodes.loc[idx, 'event_return'] = event_row['return'].iloc[0]
                if 'volume_ratio' in event_row.columns:
                    episodes.loc[idx, 'event_volume_ratio'] = event_row['volume_ratio'].iloc[0]
            
            # Pre-event baseline
            pre_window = window[window['days_from_event'] < 0]
            if len(pre_window) > 5:
                episodes.loc[idx, 'pre_avg_return'] = pre_window['return'].mean()
                episodes.loc[idx, 'pre_avg_volume'] = pre_window['volume'].mean()
            
            # Post-event returns and drawdowns
            post_window = window[window['days_from_event'] > 0]
            
            if len(post_window) > 0:
                event_close = event_row['close'].iloc[0] if len(event_row) > 0 else np.nan
                
                for days, suffix in [(5, '5d'), (20, '20d'), (60, '60d')]:
                    period = post_window[post_window['days_from_event'] <= days]
                    
                    if len(period) > 0 and not np.isnan(event_close):
                        # Cumulative return
                        final_close = period['close'].iloc[-1]
                        cum_return = (final_close / event_close) - 1
                        episodes.loc[idx, f'return_{suffix}'] = cum_return
                        
                        # Max drawdown
                        prices = period['close'].values
                        running_max = np.maximum.accumulate(np.concatenate([[event_close], prices]))
                        drawdowns = (running_max - np.concatenate([[event_close], prices])) / running_max
                        episodes.loc[idx, f'max_drawdown_{suffix}'] = drawdowns.max()
        
        return episodes


# Construct windows
window_constructor = WindowConstructor(config)
episodes_df = window_constructor.compute_window_metrics(merged_data, episodes_df)

print("\nEpisodes with Window Metrics:")
print(episodes_df[['episode_id', 'ticker', 'event_return', 
                   'return_5d', 'return_20d', 'max_drawdown_20d']].head(10))

## 7. Placebo Tests

In [None]:
# =============================================================================
# PLACEBO TESTS
# =============================================================================

class PlaceboTester:
    """Runs placebo tests to validate episode detection.
    
    Tests:
    1. Large-cap placebo: Apply filter to S&P 500 stocks
    2. Time shuffle: Randomly permute social data dates
    3. Random universe: Apply to random stocks
    """
    
    def __init__(self, config: ResearchConfig):
        self.config = config
    
    def time_shuffle_test(self, merged_df: pd.DataFrame, 
                          n_iterations: int = 100) -> Dict:
        """Test detection rate with shuffled social data.
        
        If our detection is meaningful, shuffling social dates
        should dramatically reduce joint event detection.
        """
        np.random.seed(42)
        
        df = merged_df.copy()
        original_rate = df['is_episode'].mean()
        
        shuffled_rates = []
        
        print(f"Running time shuffle test ({n_iterations} iterations)...")
        
        for _ in tqdm(range(n_iterations)):
            # Shuffle social burst flags within each ticker
            df_shuffled = df.copy()
            
            for ticker in df_shuffled['ticker'].unique():
                ticker_mask = df_shuffled['ticker'] == ticker
                social_flags = df_shuffled.loc[ticker_mask, 'is_social_burst'].values.copy()
                np.random.shuffle(social_flags)
                df_shuffled.loc[ticker_mask, 'is_social_burst'] = social_flags
            
            # Recalculate episodes
            if 'is_candidate_event' in df_shuffled.columns:
                price_spike = df_shuffled['is_candidate_event']
            else:
                price_spike = df_shuffled['return_zscore'] > self.config.RETURN_ZSCORE_THRESHOLD
            
            shuffled_episodes = price_spike & df_shuffled['is_social_burst']
            shuffled_rates.append(shuffled_episodes.mean())
        
        results = {
            'original_rate': original_rate,
            'shuffled_mean': np.mean(shuffled_rates),
            'shuffled_std': np.std(shuffled_rates),
            'z_score': (original_rate - np.mean(shuffled_rates)) / np.std(shuffled_rates) if np.std(shuffled_rates) > 0 else np.inf,
            'ratio': original_rate / np.mean(shuffled_rates) if np.mean(shuffled_rates) > 0 else np.inf
        }
        
        return results
    
    def print_placebo_results(self, results: Dict):
        """Print formatted placebo test results."""
        print("\n" + "="*60)
        print("PLACEBO TEST RESULTS")
        print("="*60)
        print(f"Original episode rate: {results['original_rate']:.6f}")
        print(f"Shuffled mean rate: {results['shuffled_mean']:.6f}")
        print(f"Shuffled std: {results['shuffled_std']:.6f}")
        print(f"Z-score: {results['z_score']:.2f}")
        print(f"Original/Shuffled ratio: {results['ratio']:.2f}x")
        
        if results['z_score'] > 3:
            print("\n*** PASS: Original rate significantly higher than random ***")
        elif results['z_score'] > 2:
            print("\n** MARGINAL: Original rate moderately higher than random **")
        else:
            print("\n* WARNING: Original rate not significantly different from random *")


# Run placebo tests
placebo_tester = PlaceboTester(config)
placebo_results = placebo_tester.time_shuffle_test(merged_data, n_iterations=100)
placebo_tester.print_placebo_results(placebo_results)

## 8. Visualizations

In [None]:
# =============================================================================
# VISUALIZATIONS
# =============================================================================

def plot_episode_characteristics(episodes_df: pd.DataFrame):
    """Plot characteristics of detected episodes."""
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    # Event returns distribution
    ax1 = axes[0, 0]
    data = episodes_df['event_return'].dropna() * 100
    ax1.hist(data, bins=30, edgecolor='black', alpha=0.7)
    ax1.axvline(x=data.median(), color='red', linestyle='--', label=f'Median: {data.median():.1f}%')
    ax1.set_xlabel('Event Day Return (%)')
    ax1.set_ylabel('Frequency')
    ax1.set_title('Distribution of Event Day Returns')
    ax1.legend()
    
    # Post-event reversals
    ax2 = axes[0, 1]
    for col, label, color in [('return_5d', '5-Day', 'blue'), 
                               ('return_20d', '20-Day', 'orange'),
                               ('return_60d', '60-Day', 'green')]:
        if col in episodes_df.columns:
            data = episodes_df[col].dropna() * 100
            ax2.hist(data, bins=30, alpha=0.5, label=f'{label} (med: {data.median():.1f}%)', color=color)
    ax2.axvline(x=0, color='black', linestyle='-', linewidth=2)
    ax2.set_xlabel('Post-Event Return (%)')
    ax2.set_ylabel('Frequency')
    ax2.set_title('Post-Event Return Distributions')
    ax2.legend()
    
    # Max drawdowns
    ax3 = axes[1, 0]
    for col, label, color in [('max_drawdown_5d', '5-Day', 'blue'),
                               ('max_drawdown_20d', '20-Day', 'orange'),
                               ('max_drawdown_60d', '60-Day', 'green')]:
        if col in episodes_df.columns:
            data = episodes_df[col].dropna() * 100
            ax3.hist(data, bins=30, alpha=0.5, label=f'{label} (med: {data.median():.1f}%)', color=color)
    ax3.set_xlabel('Maximum Drawdown (%)')
    ax3.set_ylabel('Frequency')
    ax3.set_title('Post-Event Maximum Drawdown Distributions')
    ax3.legend()
    
    # Episodes over time
    ax4 = axes[1, 1]
    episodes_df['event_date'] = pd.to_datetime(episodes_df['event_date'])
    monthly = episodes_df.groupby(episodes_df['event_date'].dt.to_period('M')).size()
    monthly.plot(ax=ax4, kind='bar', color='purple', alpha=0.7)
    ax4.set_xlabel('Month')
    ax4.set_ylabel('Episode Count')
    ax4.set_title('Episodes Over Time')
    ax4.tick_params(axis='x', rotation=45)
    for i, label in enumerate(ax4.xaxis.get_ticklabels()):
        if i % 6 != 0:
            label.set_visible(False)
    
    plt.tight_layout()
    plt.savefig(os.path.join(config.RESULTS_PATH, 'episode_characteristics.png'), dpi=150)
    plt.show()


def plot_confirmed_vs_control(episodes_df: pd.DataFrame):
    """Compare confirmed pumps vs control episodes."""
    if 'label' not in episodes_df.columns or episodes_df['label'].sum() == 0:
        print("No confirmed pump labels available for comparison")
        return
    
    confirmed = episodes_df[episodes_df['label'] == 1]
    control = episodes_df[episodes_df['label'] == 0]
    
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))
    
    # Event returns
    ax1 = axes[0]
    ax1.hist(confirmed['event_return'].dropna()*100, bins=20, alpha=0.5, label='Confirmed', color='red')
    ax1.hist(control['event_return'].dropna()*100, bins=20, alpha=0.5, label='Control', color='blue')
    ax1.set_xlabel('Event Return (%)')
    ax1.set_title('Event Day Returns')
    ax1.legend()
    
    # 20-day reversal
    ax2 = axes[1]
    ax2.hist(confirmed['return_20d'].dropna()*100, bins=20, alpha=0.5, label='Confirmed', color='red')
    ax2.hist(control['return_20d'].dropna()*100, bins=20, alpha=0.5, label='Control', color='blue')
    ax2.axvline(x=0, color='black', linestyle='--')
    ax2.set_xlabel('20-Day Return (%)')
    ax2.set_title('20-Day Post-Event Returns')
    ax2.legend()
    
    # Max drawdown
    ax3 = axes[2]
    ax3.hist(confirmed['max_drawdown_20d'].dropna()*100, bins=20, alpha=0.5, label='Confirmed', color='red')
    ax3.hist(control['max_drawdown_20d'].dropna()*100, bins=20, alpha=0.5, label='Control', color='blue')
    ax3.set_xlabel('Max Drawdown (%)')
    ax3.set_title('20-Day Maximum Drawdown')
    ax3.legend()
    
    plt.tight_layout()
    plt.savefig(os.path.join(config.RESULTS_PATH, 'confirmed_vs_control.png'), dpi=150)
    plt.show()


# Generate visualizations
if len(episodes_df) > 0:
    print("Generating episode visualizations...")
    plot_episode_characteristics(episodes_df)
    plot_confirmed_vs_control(episodes_df)
else:
    print("No episodes to visualize")

## 9. Save Outputs

In [None]:
# =============================================================================
# SAVE OUTPUTS
# =============================================================================

def save_episode_data(episodes_df: pd.DataFrame,
                      merged_df: pd.DataFrame,
                      detection_summary: Dict,
                      placebo_results: Dict,
                      output_dir: str):
    """Save episode detection outputs."""
    os.makedirs(output_dir, exist_ok=True)
    
    # Save episodes
    episodes_path = os.path.join(output_dir, 'episodes.parquet')
    episodes_df.to_parquet(episodes_path, index=False)
    print(f"Saved episodes: {episodes_path}")
    
    # Save episodes CSV
    episodes_csv = os.path.join(output_dir, 'episodes.csv')
    episodes_df.to_csv(episodes_csv, index=False)
    print(f"Saved episodes CSV: {episodes_csv}")
    
    # Save merged daily data with episode flags
    merged_path = os.path.join(output_dir, 'merged_daily_data.parquet')
    merged_df.to_parquet(merged_path, index=False)
    print(f"Saved merged daily data: {merged_path}")
    
    # Save summary
    summary = {
        'detection': detection_summary,
        'placebo': placebo_results,
        'episode_stats': {
            'total_episodes': len(episodes_df),
            'confirmed_pumps': int((episodes_df['label'] == 1).sum()) if 'label' in episodes_df.columns else 0,
            'control_episodes': int((episodes_df['label'] == 0).sum()) if 'label' in episodes_df.columns else len(episodes_df),
            'unique_tickers': int(episodes_df['ticker'].nunique()),
            'avg_event_return': float(episodes_df['event_return'].mean()) if 'event_return' in episodes_df.columns else np.nan,
            'avg_20d_reversal': float(episodes_df['return_20d'].mean()) if 'return_20d' in episodes_df.columns else np.nan,
            'avg_max_drawdown_20d': float(episodes_df['max_drawdown_20d'].mean()) if 'max_drawdown_20d' in episodes_df.columns else np.nan
        },
        'config': {
            'return_threshold': config.RETURN_ZSCORE_THRESHOLD,
            'social_threshold': config.SOCIAL_ZSCORE_THRESHOLD,
            'social_window': config.SOCIAL_WINDOW,
            'cluster_gap': config.MIN_CLUSTER_GAP
        },
        'created_at': datetime.now().isoformat()
    }
    
    summary_path = os.path.join(output_dir, 'notebook04_summary.json')
    with open(summary_path, 'w') as f:
        json.dump(summary, f, indent=2, default=str)
    print(f"Saved summary: {summary_path}")
    
    return summary


# Save outputs
output_summary = save_episode_data(
    episodes_df=episodes_df,
    merged_df=merged_data,
    detection_summary=detection_summary,
    placebo_results=placebo_results,
    output_dir=config.RESULTS_PATH
)

print("\n" + "="*60)
print("Output Summary:")
print(json.dumps(output_summary, indent=2, default=str))

## 10. Summary and Next Steps

In [None]:
# =============================================================================
# NOTEBOOK 4 SUMMARY
# =============================================================================

print("""
╔══════════════════════════════════════════════════════════════════════════════╗
║               NOTEBOOK 4: EPISODE DETECTION COMPLETE                         ║
╚══════════════════════════════════════════════════════════════════════════════╝

OUTPUT FILES:
─────────────
• episodes.parquet                - Episode-level dataset with window metrics
• episodes.csv                    - CSV for inspection
• merged_daily_data.parquet       - Full daily data with episode flags
• episode_characteristics.png     - Visualizations
• confirmed_vs_control.png        - Comparison plots
• notebook04_summary.json         - Summary statistics

EPISODE DETECTION CRITERIA:
───────────────────────────
1. Return z-score > 3.0 (price spike)
2. Volume > 95th percentile (volume spike)
3. Social burst within +-1 day (message z-score > 3.0)
4. Consecutive events clustered (5-day gap)

EPISODE WINDOWS:
────────────────
• Pre-event: t-20 to t-1 (baseline)
• Event: t0 (spike day)
• Post-short: t+1 to t+5
• Post-medium: t+1 to t+20
• Post-long: t+1 to t+60

KEY METRICS:
────────────
• Event return: Return on spike day
• Reversal returns: Cumulative returns in post windows
• Max drawdown: Worst peak-to-trough in each window

NEXT STEPS:
───────────
→ Notebook 5: Feature Engineering & Classification
  - Engineer features for pump classification
  - Train Random Forest classifier
  - Generate Pump Likelihood Scores (PLS)

IMPORTANT NOTES:
────────────────
1. Episodes are defined by joint conditions - not all price spikes are episodes
2. Placebo test validates that joint detection is not random
3. Ground truth labels are from SEC enforcement (incomplete)
4. PLS score will be computed in Notebook 5 as continuous proxy

""")

In [None]:
# =============================================================================
# ENVIRONMENT INFO
# =============================================================================

import sys
import platform

print("Environment Information:")
print(f"  Python: {sys.version}")
print(f"  Platform: {platform.platform()}")
print(f"  Pandas: {pd.__version__}")
print(f"  NumPy: {np.__version__}")
print(f"  Timestamp: {datetime.now().isoformat()}")