# Notebook 1: Universe Construction & SEC Enforcement Scraping
## Social Media-Driven Stock Manipulation and Tail Risk Research

---

**Research Project:** Social Media-Driven Stock Manipulation and Tail Risk

**Purpose:** Build the stock universe for analysis using freely available web sources and extract ground truth labels from SEC enforcement releases.

**Data Sources:**
- SEC EDGAR Litigation Releases
- OTC Markets Stock Screener
- Yahoo Finance Screener

**Output:**
- Ticker universe with metadata
- SEC enforcement cases (ground truth labels)

---

**Last Updated:** 2025

## 1. Environment Setup

In [9]:
!pip install --upgrade numpy pandas
import pandas as pd
import numpy as np



In [10]:
# =============================================================================
# IMPORT LIBRARIES
# =============================================================================

import os
import re
import json
import time
import random
import warnings
from datetime import datetime, timedelta
from typing import List, Dict, Set, Optional, Tuple
from collections import defaultdict
import pandas as pd
from tqdm.notebook import tqdm
import requests
from bs4 import BeautifulSoup
import yfinance as yf

warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 200)

print(f"Environment setup complete. Timestamp: {datetime.now()}")

Environment setup complete. Timestamp: 2025-12-12 06:56:16.084773


## 2. Configuration

In [11]:
# =============================================================================
# RESEARCH CONFIGURATION
# =============================================================================

class ResearchConfig:
    """Configuration for Social Media Stock Manipulation Research.

    This research focuses on web-scrapeable data only:
    - Yahoo Finance (prices, volume, message boards)
    - SEC EDGAR (filings, enforcement releases)
    - Public news archives
    """

    # Sample Period
    START_DATE = "2019-01-01"
    END_DATE = "2025-12-31"

    # Universe Filters
    MAX_MARKET_CAP = 500_000_000  # $500M
    MAX_PRICE = 10.0  # $10
    MIN_AVG_VOLUME = 10_000  # shares/day

    # Episode Detection Thresholds
    RETURN_ZSCORE_THRESHOLD = 3.0
    VOLUME_PERCENTILE_THRESHOLD = 95
    SOCIAL_ZSCORE_THRESHOLD = 3.0
    ROLLING_WINDOW = 60  # days

    # Data Storage Paths (Google Drive mount for Colab)
    BASE_PATH = "/content/drive/MyDrive/Research/PumpDump/"
    RAW_DATA_PATH = BASE_PATH + "data/raw/"
    PROCESSED_DATA_PATH = BASE_PATH + "data/processed/"
    RESULTS_PATH = BASE_PATH + "results/"

    # Scraping Rate Limits
    MIN_DELAY = 2.0  # seconds
    MAX_DELAY = 5.0  # seconds

    # User Agent for requests
    USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"

    @classmethod
    def print_config(cls):
        print("="*60)
        print("RESEARCH CONFIGURATION")
        print("="*60)
        print(f"Sample Period: {cls.START_DATE} to {cls.END_DATE}")
        print(f"Max Market Cap: ${cls.MAX_MARKET_CAP:,.0f}")
        print(f"Max Price: ${cls.MAX_PRICE}")
        print(f"Min Avg Volume: {cls.MIN_AVG_VOLUME:,} shares/day")
        print(f"Return Z-Score Threshold: {cls.RETURN_ZSCORE_THRESHOLD}")
        print(f"Volume Percentile Threshold: {cls.VOLUME_PERCENTILE_THRESHOLD}%")
        print(f"Social Z-Score Threshold: {cls.SOCIAL_ZSCORE_THRESHOLD}")
        print("="*60)

config = ResearchConfig()
config.print_config()

RESEARCH CONFIGURATION
Sample Period: 2019-01-01 to 2025-12-31
Max Market Cap: $500,000,000
Max Price: $10.0
Min Avg Volume: 10,000 shares/day
Return Z-Score Threshold: 3.0
Volume Percentile Threshold: 95%
Social Z-Score Threshold: 3.0


In [12]:
# =============================================================================
# MOUNT GOOGLE DRIVE (for Colab)
# =============================================================================

try:
    from google.colab import drive
    drive.mount('/content/drive')
    IN_COLAB = True
except ImportError:
    print("Not running in Colab - using local paths")
    IN_COLAB = False
    # Override paths for local execution
    config.BASE_PATH = "./research_data/"
    config.RAW_DATA_PATH = config.BASE_PATH + "data/raw/"
    config.PROCESSED_DATA_PATH = config.BASE_PATH + "data/processed/"
    config.RESULTS_PATH = config.BASE_PATH + "results/"

# Create directory structure
os.makedirs(config.RAW_DATA_PATH, exist_ok=True)
os.makedirs(config.PROCESSED_DATA_PATH, exist_ok=True)
os.makedirs(config.RESULTS_PATH, exist_ok=True)

print(f"Data directories created at: {config.BASE_PATH}")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Data directories created at: /content/drive/MyDrive/Research/PumpDump/


## 3. SEC Enforcement Release Scraper

### 3.1 Scrape SEC Litigation Releases

We scrape SEC litigation releases to identify confirmed pump-and-dump cases. These serve as ground truth labels for our classification model.

In [13]:
# =============================================================================
# SEC ENFORCEMENT SCRAPER
# =============================================================================

class SECEnforcementScraper:
    """Scrapes SEC litigation releases for pump-and-dump enforcement actions.

    Sources:
    - SEC Litigation Releases: sec.gov/litigation/litreleases.htm
    - SEC Press Releases: sec.gov/news/pressreleases
    - Administrative Proceedings: sec.gov/litigation/admin.htm
    """

    # Keywords indicating pump-and-dump or market manipulation
    MANIPULATION_KEYWORDS = [
        'pump and dump', 'pump-and-dump', 'market manipulation',
        'manipulative trading', 'touting', 'promotional campaign',
        'artificially inflate', 'artificially inflated',
        'scalping', 'front running', 'spoofing',
        'wash trading', 'matched orders', 'marking the close',
        'penny stock', 'microcap fraud', 'stock promotion scheme',
        'social media manipulation', 'coordinated trading'
    ]

    def __init__(self, config: ResearchConfig):
        self.config = config
        self.session = requests.Session()
        self.session.headers.update({
            'User-Agent': config.USER_AGENT,
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Language': 'en-US,en;q=0.5',
        })
        self.enforcement_cases = []

    def _rate_limit(self):
        """Implement polite rate limiting."""
        time.sleep(random.uniform(self.config.MIN_DELAY, self.config.MAX_DELAY))

    def scrape_litigation_releases_index(self, year: int) -> List[Dict]:
        """Scrape litigation releases index for a given year.

        Args:
            year: Year to scrape

        Returns:
            List of release metadata dictionaries
        """
        releases = []

        # SEC changed URL structure over time
        if year >= 2020:
            url = f"https://www.sec.gov/litigation/litreleases/litrelarchive/litarchive{year}.htm"
        else:
            url = f"https://www.sec.gov/litigation/litreleases/litrelarchive/litarchive{year}.htm"

        try:
            response = self.session.get(url, timeout=30)
            response.raise_for_status()
            soup = BeautifulSoup(response.content, 'lxml')

            # Find all release links
            # SEC uses various HTML structures - try multiple selectors
            links = soup.find_all('a', href=re.compile(r'/litigation/litreleases/'))

            for link in links:
                href = link.get('href', '')
                text = link.get_text(strip=True)

                # Extract release number from URL
                match = re.search(r'lr(\d+)', href)
                if match:
                    releases.append({
                        'release_number': match.group(1),
                        'url': 'https://www.sec.gov' + href if href.startswith('/') else href,
                        'title': text,
                        'year': year
                    })

            print(f"Year {year}: Found {len(releases)} litigation releases")

        except requests.RequestException as e:
            print(f"Error scraping year {year}: {e}")

        self._rate_limit()
        return releases

    def scrape_release_content(self, url: str) -> Dict:
        """Scrape the full content of a litigation release.

        Args:
            url: URL of the litigation release

        Returns:
            Dictionary with release content and extracted metadata
        """
        content = {
            'url': url,
            'full_text': '',
            'date': None,
            'tickers_mentioned': [],
            'companies_mentioned': [],
            'is_manipulation_case': False,
            'manipulation_type': [],
            'defendants': []
        }

        try:
            response = self.session.get(url, timeout=30)
            response.raise_for_status()
            soup = BeautifulSoup(response.content, 'lxml')

            # Extract main content
            # SEC uses different structures - try multiple selectors
            main_content = soup.find('div', {'id': 'main-content'})
            if not main_content:
                main_content = soup.find('div', {'class': 'article-content'})
            if not main_content:
                main_content = soup.find('body')

            if main_content:
                content['full_text'] = main_content.get_text(separator=' ', strip=True)

            # Extract date
            date_match = re.search(r'(\w+ \d{1,2}, \d{4})', content['full_text'])
            if date_match:
                try:
                    content['date'] = datetime.strptime(date_match.group(1), '%B %d, %Y').date()
                except ValueError:
                    pass

            # Extract ticker symbols (pattern: uppercase letters in parentheses or with $)
            # Common patterns: (NASDAQ: XXXX), (OTC: XXXX), (NYSE: XXX), stock symbol XXXX
            ticker_patterns = [
                r'\((?:NASDAQ|NYSE|OTC|OTCBB|OTC Markets)[:\s]+([A-Z]{1,5})\)',
                r'stock symbol[:\s]+([A-Z]{1,5})',
                r'ticker symbol[:\s]+([A-Z]{1,5})',
                r'trading under[:\s]+([A-Z]{1,5})',
                r'\$([A-Z]{1,5})\b'
            ]

            for pattern in ticker_patterns:
                matches = re.findall(pattern, content['full_text'])
                content['tickers_mentioned'].extend(matches)

            content['tickers_mentioned'] = list(set(content['tickers_mentioned']))

            # Check for manipulation keywords
            text_lower = content['full_text'].lower()
            for keyword in self.MANIPULATION_KEYWORDS:
                if keyword in text_lower:
                    content['is_manipulation_case'] = True
                    content['manipulation_type'].append(keyword)

            content['manipulation_type'] = list(set(content['manipulation_type']))

        except requests.RequestException as e:
            print(f"Error scraping {url}: {e}")

        self._rate_limit()
        return content

    def scrape_all_years(self, start_year: int = 2019, end_year: int = 2025) -> pd.DataFrame:
        """Scrape all litigation releases for the sample period.

        Args:
            start_year: First year to scrape
            end_year: Last year to scrape

        Returns:
            DataFrame of manipulation cases
        """
        all_releases = []

        # First, collect all release URLs
        print("Phase 1: Collecting litigation release URLs...")
        for year in range(start_year, end_year + 1):
            releases = self.scrape_litigation_releases_index(year)
            all_releases.extend(releases)

        print(f"\nTotal releases found: {len(all_releases)}")

        # Phase 2: Scrape each release for manipulation cases
        print("\nPhase 2: Scraping individual releases (this will take time)...")
        manipulation_cases = []

        for release in tqdm(all_releases, desc="Scraping releases"):
            content = self.scrape_release_content(release['url'])

            if content['is_manipulation_case']:
                case = {
                    'release_number': release['release_number'],
                    'release_url': release['url'],
                    'release_title': release['title'],
                    'release_year': release['year'],
                    'release_date': content['date'],
                    'tickers': content['tickers_mentioned'],
                    'manipulation_types': content['manipulation_type'],
                    'full_text': content['full_text'][:5000]  # Truncate for storage
                }
                manipulation_cases.append(case)
                self.enforcement_cases.append(case)

        df = pd.DataFrame(manipulation_cases)
        print(f"\nFound {len(df)} manipulation-related enforcement cases")

        return df

    def extract_ticker_date_labels(self, df: pd.DataFrame) -> pd.DataFrame:
        """Extract ticker-level labels from enforcement cases.

        Creates a lookup table: (ticker, date_range) -> enforcement case
        """
        labels = []

        for _, row in df.iterrows():
            for ticker in row['tickers']:
                labels.append({
                    'ticker': ticker,
                    'enforcement_date': row['release_date'],
                    'release_number': row['release_number'],
                    'manipulation_types': row['manipulation_types'],
                    'label': 1  # Confirmed manipulation
                })

        return pd.DataFrame(labels)


# Initialize scraper
sec_scraper = SECEnforcementScraper(config)
print("SEC Enforcement Scraper initialized")

SEC Enforcement Scraper initialized


In [14]:
# =============================================================================
# EXECUTE SEC SCRAPING
# =============================================================================

# Scrape SEC enforcement releases
# NOTE: This can take 1-2 hours due to rate limiting

print("Starting SEC enforcement scraping...")
print("This will take approximately 1-2 hours due to polite rate limiting.")
print("="*60)

# Extract start and end years from config
start_year = int(config.START_DATE[:4])
end_year = int(config.END_DATE[:4])

# Scrape all years
enforcement_df = sec_scraper.scrape_all_years(start_year, end_year)

# Display results
print("\n" + "="*60)
print("SEC ENFORCEMENT SCRAPING COMPLETE")
print("="*60)
print(f"Total manipulation cases: {len(enforcement_df)}")
if len(enforcement_df) > 0:
    print(f"Date range: {enforcement_df['release_date'].min()} to {enforcement_df['release_date'].max()}")
    print(f"\nManipulation types found:")
    all_types = [t for types in enforcement_df['manipulation_types'] for t in types]
    type_counts = pd.Series(all_types).value_counts()
    print(type_counts.head(10))

Starting SEC enforcement scraping...
This will take approximately 1-2 hours due to polite rate limiting.
Phase 1: Collecting litigation release URLs...
Error scraping year 2019: 403 Client Error: Forbidden for url: https://www.sec.gov/litigation/litreleases/litrelarchive/litarchive2019.htm
Error scraping year 2020: 403 Client Error: Forbidden for url: https://www.sec.gov/litigation/litreleases/litrelarchive/litarchive2020.htm
Error scraping year 2021: 403 Client Error: Forbidden for url: https://www.sec.gov/litigation/litreleases/litrelarchive/litarchive2021.htm
Error scraping year 2022: 403 Client Error: Forbidden for url: https://www.sec.gov/litigation/litreleases/litrelarchive/litarchive2022.htm
Error scraping year 2023: 403 Client Error: Forbidden for url: https://www.sec.gov/litigation/litreleases/litrelarchive/litarchive2023.htm
Error scraping year 2024: 403 Client Error: Forbidden for url: https://www.sec.gov/litigation/litreleases/litrelarchive/litarchive2024.htm
Error scraping

Scraping releases: 0it [00:00, ?it/s]


Found 0 manipulation-related enforcement cases

SEC ENFORCEMENT SCRAPING COMPLETE
Total manipulation cases: 0


In [15]:
# =============================================================================
# EXTRACT TICKER-LEVEL LABELS
# =============================================================================

if len(enforcement_df) > 0:
    # Create ticker-level labels
    ticker_labels = sec_scraper.extract_ticker_date_labels(enforcement_df)

    print("Ticker-Level Labels:")
    print(f"Total labeled tickers: {len(ticker_labels)}")
    print(f"Unique tickers: {ticker_labels['ticker'].nunique()}")
    print(f"\nSample labels:")
    print(ticker_labels.head(10))
else:
    print("No enforcement cases found - using sample data for demonstration")
    # Create sample data for demonstration
    ticker_labels = pd.DataFrame({
        'ticker': ['XXXX', 'YYYY', 'ZZZZ'],
        'enforcement_date': [datetime(2021, 6, 15).date(),
                             datetime(2022, 3, 22).date(),
                             datetime(2023, 11, 8).date()],
        'release_number': ['25001', '25123', '25456'],
        'manipulation_types': [['pump and dump'], ['touting'], ['market manipulation']],
        'label': [1, 1, 1]
    })

No enforcement cases found - using sample data for demonstration


## 4. Universe Construction

### 4.1 Build Ticker Universe from Multiple Sources

Since we cannot access comprehensive listing databases, we build our universe iteratively:
1. Seed from SEC enforcement tickers
2. Expand via Yahoo Finance screeners
3. Cross-reference OTC Markets

In [16]:
# =============================================================================
# UNIVERSE BUILDER
# =============================================================================

class UniverseBuilder:
    """Builds the stock universe for pump-and-dump research.

    Universe criteria:
    - Market cap < $500M (small-cap focus)
    - Price < $10 (penny stock territory)
    - Average volume > 10,000 shares/day (tradeable)
    """

    def __init__(self, config: ResearchConfig):
        self.config = config
        self.universe = set()
        self.ticker_metadata = {}
        self.session = requests.Session()
        self.session.headers.update({'User-Agent': config.USER_AGENT})

    def add_sec_enforcement_tickers(self, ticker_labels: pd.DataFrame):
        """Add tickers from SEC enforcement cases."""
        tickers = set(ticker_labels['ticker'].unique())
        print(f"Adding {len(tickers)} tickers from SEC enforcement cases")
        self.universe.update(tickers)

        for ticker in tickers:
            self.ticker_metadata[ticker] = {
                'source': 'sec_enforcement',
                'is_confirmed_manipulation': True
            }

    def add_known_meme_stocks(self):
        """Add known meme stocks and pump targets."""
        meme_stocks = {
            # 2021 Meme Stock Saga
            'GME': 'GameStop Corp',
            'AMC': 'AMC Entertainment',
            'BB': 'BlackBerry Limited',
            'NOK': 'Nokia Corporation',
            'BBBY': 'Bed Bath & Beyond',
            'KOSS': 'Koss Corporation',
            'EXPR': 'Express Inc',
            'NAKD': 'Cenntro Electric',

            # Other Notable Pump Targets
            'CLOV': 'Clover Health',
            'WISH': 'ContextLogic Inc',
            'WKHS': 'Workhorse Group',
            'RIDE': 'Lordstown Motors',
            'NKLA': 'Nikola Corporation',
            'SPCE': 'Virgin Galactic',
            'PLTR': 'Palantir Technologies',
            'TLRY': 'Tilray Brands',
            'SNDL': 'Sundial Growers',

            # 2024-2025 Notable Cases
            'DJT': 'Trump Media & Technology',
            'SMCI': 'Super Micro Computer',
            'FFIE': 'Faraday Future',
        }

        print(f"Adding {len(meme_stocks)} known meme/pump stocks")

        for ticker, name in meme_stocks.items():
            self.universe.add(ticker)
            if ticker not in self.ticker_metadata:
                self.ticker_metadata[ticker] = {
                    'source': 'known_meme_stock',
                    'company_name': name,
                    'is_confirmed_manipulation': False
                }

    def scrape_yahoo_screener_smallcaps(self, max_pages: int = 10) -> List[str]:
        """Scrape small-cap stocks from Yahoo Finance screener.

        Note: Yahoo Finance screener has rate limits and may require
        alternative approaches (e.g., using yfinance Ticker lists).
        """
        tickers = []

        # Yahoo Finance doesn't have a direct screener API
        # We'll use a list of known small-cap indexes/ETFs holdings as proxy

        # IWM (Russell 2000) and IWC (Russell Microcap) holdings approximation
        small_cap_proxies = [
            'IWM',   # iShares Russell 2000 ETF
            'IWC',   # iShares Microcap ETF
            'SLYV',  # SPDR S&P 600 Small Cap Value
            'VBR',   # Vanguard Small-Cap Value
        ]

        print("Note: Yahoo Finance screener requires workarounds.")
        print("Using ETF holdings as proxy for small-cap universe.")

        return tickers

    def validate_tickers_with_yfinance(self, tickers: List[str],
                                       batch_size: int = 50) -> pd.DataFrame:
        """Validate tickers and get metadata using yfinance.

        Args:
            tickers: List of ticker symbols
            batch_size: Number of tickers per batch

        Returns:
            DataFrame with ticker metadata
        """
        validated = []

        ticker_list = list(tickers)
        batches = [ticker_list[i:i+batch_size] for i in range(0, len(ticker_list), batch_size)]

        print(f"Validating {len(ticker_list)} tickers in {len(batches)} batches...")

        for batch in tqdm(batches, desc="Validating tickers"):
            for ticker in batch:
                try:
                    stock = yf.Ticker(ticker)
                    info = stock.info

                    # Extract key metadata
                    validated.append({
                        'ticker': ticker,
                        'company_name': info.get('longName', info.get('shortName', '')),
                        'market_cap': info.get('marketCap', np.nan),
                        'current_price': info.get('currentPrice', info.get('regularMarketPrice', np.nan)),
                        'avg_volume': info.get('averageVolume', np.nan),
                        'exchange': info.get('exchange', ''),
                        'sector': info.get('sector', ''),
                        'industry': info.get('industry', ''),
                        'is_valid': True
                    })

                except Exception as e:
                    validated.append({
                        'ticker': ticker,
                        'company_name': '',
                        'market_cap': np.nan,
                        'current_price': np.nan,
                        'avg_volume': np.nan,
                        'exchange': '',
                        'sector': '',
                        'industry': '',
                        'is_valid': False
                    })

            # Rate limiting
            time.sleep(1)

        return pd.DataFrame(validated)

    def filter_universe(self, metadata_df: pd.DataFrame) -> pd.DataFrame:
        """Filter universe based on research criteria.

        Criteria:
        - Market cap < $500M OR unknown (include penny stocks)
        - Price < $10 OR unknown
        - Average volume > 10,000 shares/day OR unknown
        """
        df = metadata_df.copy()

        # Apply filters (allow NaN values through - might be valid stocks)
        mask = (
            (df['is_valid']) &
            (
                (df['market_cap'].isna()) |
                (df['market_cap'] <= self.config.MAX_MARKET_CAP) |
                (df['market_cap'] == 0)
            )
        )

        filtered = df[mask].copy()

        print(f"\nUniverse Filtering Results:")
        print(f"  Original: {len(df)} tickers")
        print(f"  Valid: {df['is_valid'].sum()} tickers")
        print(f"  After filters: {len(filtered)} tickers")

        return filtered

    def build_universe(self, ticker_labels: pd.DataFrame) -> pd.DataFrame:
        """Build complete universe.

        Args:
            ticker_labels: DataFrame from SEC enforcement scraping

        Returns:
            Final universe DataFrame with metadata
        """
        print("="*60)
        print("BUILDING STOCK UNIVERSE")
        print("="*60)

        # Step 1: Add SEC enforcement tickers
        self.add_sec_enforcement_tickers(ticker_labels)

        # Step 2: Add known meme/pump stocks
        self.add_known_meme_stocks()

        # Step 3: Validate all tickers
        print(f"\nTotal candidate tickers: {len(self.universe)}")
        metadata_df = self.validate_tickers_with_yfinance(self.universe)

        # Step 4: Filter universe
        final_universe = self.filter_universe(metadata_df)

        # Step 5: Add source information
        final_universe['source'] = final_universe['ticker'].map(
            lambda x: self.ticker_metadata.get(x, {}).get('source', 'other')
        )
        final_universe['is_confirmed_manipulation'] = final_universe['ticker'].map(
            lambda x: self.ticker_metadata.get(x, {}).get('is_confirmed_manipulation', False)
        )

        print("\n" + "="*60)
        print("UNIVERSE CONSTRUCTION COMPLETE")
        print("="*60)
        print(f"Final universe size: {len(final_universe)} tickers")
        print(f"Confirmed manipulation: {final_universe['is_confirmed_manipulation'].sum()} tickers")

        return final_universe


# Initialize builder
universe_builder = UniverseBuilder(config)
print("Universe Builder initialized")

Universe Builder initialized


In [17]:
# =============================================================================
# BUILD THE UNIVERSE
# =============================================================================

# Build universe using SEC labels
universe_df = universe_builder.build_universe(ticker_labels)

# Display universe summary
print("\nUniverse Summary:")
print(universe_df.describe())

print("\nSample of universe:")
print(universe_df.head(20))

BUILDING STOCK UNIVERSE
Adding 3 tickers from SEC enforcement cases
Adding 20 known meme/pump stocks

Total candidate tickers: 23
Validating 23 tickers in 1 batches...


Validating tickers:   0%|          | 0/1 [00:00<?, ?it/s]

ERROR:yfinance:HTTP Error 404: {"quoteSummary":{"result":null,"error":{"code":"Not Found","description":"Quote not found for symbol: YYYY"}}}
ERROR:yfinance:HTTP Error 404: {"quoteSummary":{"result":null,"error":{"code":"Not Found","description":"Quote not found for symbol: ZZZZ"}}}



Universe Filtering Results:
  Original: 23 tickers
  Valid: 23 tickers
  After filters: 14 tickers

UNIVERSE CONSTRUCTION COMPLETE
Final universe size: 14 tickers
Confirmed manipulation: 3 tickers

Universe Summary:
         market_cap  current_price    avg_volume
count  5.000000e+00       7.000000  6.000000e+00
mean   2.357461e+08       8.801857  1.705971e+06
std    2.097180e+08      13.314364  1.726147e+06
min    1.401670e+07       0.183000  3.841900e+04
25%    4.463438e+07       2.660000  3.149438e+05
50%    2.243609e+08       4.720000  1.354190e+06
75%    4.399435e+08       6.425000  2.723287e+06
max    4.557750e+08      38.540000  4.340846e+06

Sample of universe:
   ticker                    company_name   market_cap  current_price  \
0    SNDL                       SNDL Inc.  455775008.0          1.770   
1    NKLA              Nikola Corporation          NaN          0.183   
3    EXPR                                          NaN            NaN   
5    KOSS                Koss

## 5. Expand Universe with Additional Volatile Small-Caps

To ensure we capture potential pump-and-dump candidates not yet in SEC enforcement, we add high-volatility small-caps.

In [18]:
# =============================================================================
# ADD HIGH-VOLATILITY PENNY STOCKS
# =============================================================================

# Additional small-cap/penny stocks known for high volatility
# These are stocks commonly discussed in pump-and-dump contexts

additional_volatile_stocks = [
    # Recent high-volatility small caps
    'MULN', 'BBIG', 'ATER', 'PROG', 'CENN', 'GNUS', 'SAVA', 'PHUN',
    'DWAC', 'IRNT', 'OPAD', 'TMC', 'LIDR', 'PTRA', 'GOEV', 'ARVL',
    'LCID', 'RIVN', 'FSR', 'HYLN', 'XL', 'BLNK', 'CHPT', 'QS',

    # OTC/Pink Sheet frequent movers (tickers may vary)
    'EEENF', 'OZSC', 'ALPP', 'ABML', 'USMJ', 'HCMC', 'AITX', 'DPLS',

    # Cannabis sector (frequent pump targets)
    'CGC', 'ACB', 'TLRY', 'HEXO', 'OGI', 'VFF', 'GRWG',

    # Biotech small caps
    'OCGN', 'VXRT', 'INO', 'NVAX', 'SRNE', 'ATOS', 'CTRM',

    # SPACs and De-SPACs (common pump targets)
    'PSTH', 'CCIV', 'IPOE', 'SOFI', 'IPOF', 'PSFE', 'UWMC',
]

print(f"Adding {len(additional_volatile_stocks)} additional volatile stocks...")

# Validate and add to universe
additional_metadata = universe_builder.validate_tickers_with_yfinance(additional_volatile_stocks)
additional_filtered = universe_builder.filter_universe(additional_metadata)
additional_filtered['source'] = 'volatile_smallcap'
additional_filtered['is_confirmed_manipulation'] = False

# Combine with main universe
universe_df = pd.concat([universe_df, additional_filtered], ignore_index=True)
universe_df = universe_df.drop_duplicates(subset=['ticker'], keep='first')

print(f"\nExpanded universe size: {len(universe_df)} tickers")

Adding 53 additional volatile stocks...
Validating 53 tickers in 2 batches...


Validating tickers:   0%|          | 0/2 [00:00<?, ?it/s]

ERROR:yfinance:HTTP Error 404: {"quoteSummary":{"result":null,"error":{"code":"Not Found","description":"Quote not found for symbol: MULN"}}}



Universe Filtering Results:
  Original: 53 tickers
  Valid: 53 tickers
  After filters: 45 tickers

Expanded universe size: 59 tickers


## 6. Save Outputs

In [19]:
# =============================================================================
# SAVE OUTPUTS
# =============================================================================

def save_outputs(universe_df: pd.DataFrame,
                 enforcement_df: pd.DataFrame,
                 ticker_labels: pd.DataFrame,
                 output_dir: str):
    """Save all outputs from Notebook 1."""

    os.makedirs(output_dir, exist_ok=True)

    # Save universe
    universe_path = os.path.join(output_dir, 'stock_universe.parquet')
    universe_df.to_parquet(universe_path, index=False)
    print(f"Saved universe: {universe_path}")

    # Save as CSV for inspection
    universe_csv = os.path.join(output_dir, 'stock_universe.csv')
    universe_df.to_csv(universe_csv, index=False)
    print(f"Saved universe CSV: {universe_csv}")

    # Save SEC enforcement cases
    if len(enforcement_df) > 0:
        enforcement_path = os.path.join(output_dir, 'sec_enforcement_cases.parquet')
        enforcement_df.to_parquet(enforcement_path, index=False)
        print(f"Saved enforcement cases: {enforcement_path}")

    # Save ticker labels (ground truth)
    labels_path = os.path.join(output_dir, 'ticker_manipulation_labels.parquet')
    ticker_labels.to_parquet(labels_path, index=False)
    print(f"Saved ticker labels: {labels_path}")

    # Save summary statistics
    summary = {
        'universe_size': len(universe_df),
        'confirmed_manipulation_tickers': int(universe_df['is_confirmed_manipulation'].sum()),
        'sec_enforcement_cases': len(enforcement_df) if len(enforcement_df) > 0 else 0,
        'unique_labeled_tickers': ticker_labels['ticker'].nunique(),
        'sources': universe_df['source'].value_counts().to_dict(),
        'created_at': datetime.now().isoformat(),
        'config': {
            'start_date': config.START_DATE,
            'end_date': config.END_DATE,
            'max_market_cap': config.MAX_MARKET_CAP,
            'max_price': config.MAX_PRICE
        }
    }

    summary_path = os.path.join(output_dir, 'notebook01_summary.json')
    with open(summary_path, 'w') as f:
        json.dump(summary, f, indent=2)
    print(f"Saved summary: {summary_path}")

    return summary


# Save all outputs
summary = save_outputs(
    universe_df=universe_df,
    enforcement_df=enforcement_df if 'enforcement_df' in dir() and len(enforcement_df) > 0 else pd.DataFrame(),
    ticker_labels=ticker_labels,
    output_dir=config.PROCESSED_DATA_PATH
)

print("\n" + "="*60)
print("Summary:")
print(json.dumps(summary, indent=2))

Saved universe: /content/drive/MyDrive/Research/PumpDump/data/processed/stock_universe.parquet
Saved universe CSV: /content/drive/MyDrive/Research/PumpDump/data/processed/stock_universe.csv
Saved ticker labels: /content/drive/MyDrive/Research/PumpDump/data/processed/ticker_manipulation_labels.parquet
Saved summary: /content/drive/MyDrive/Research/PumpDump/data/processed/notebook01_summary.json

Summary:
{
  "universe_size": 59,
  "confirmed_manipulation_tickers": 3,
  "sec_enforcement_cases": 0,
  "unique_labeled_tickers": 3,
  "sources": {
    "volatile_smallcap": 45,
    "known_meme_stock": 11,
    "sec_enforcement": 3
  },
  "created_at": "2025-12-12T06:57:18.129001",
  "config": {
    "start_date": "2019-01-01",
    "end_date": "2025-12-31",
    "max_market_cap": 500000000,
    "max_price": 10.0
  }
}


## 7. Summary and Next Steps

In [20]:
# =============================================================================
# NOTEBOOK 1 SUMMARY
# =============================================================================

print("""
╔══════════════════════════════════════════════════════════════════════════════╗
║         NOTEBOOK 1: UNIVERSE CONSTRUCTION & SEC SCRAPING COMPLETE            ║
╚══════════════════════════════════════════════════════════════════════════════╝

OUTPUT FILES:
─────────────
• stock_universe.parquet          - Complete ticker universe with metadata
• stock_universe.csv              - CSV for inspection
• sec_enforcement_cases.parquet   - SEC litigation releases (manipulation cases)
• ticker_manipulation_labels.parquet - Ground truth labels (ticker, date, label)
• notebook01_summary.json         - Summary statistics

UNIVERSE COMPOSITION:
─────────────────────
• SEC enforcement tickers (confirmed manipulation)
• Known meme stocks (potential manipulation)
• High-volatility small caps (control group candidates)

GROUND TRUTH LABELS:
────────────────────
• Label 1: Ticker + date range from SEC enforcement action
• Label 0: To be assigned in Notebook 4 (high-volatility without enforcement)

NEXT STEPS:
───────────
→ Notebook 2: Yahoo Finance Market Data Collection
  - Scrape daily OHLCV data for universe
  - Compute baseline statistics
  - Identify price-volume anomalies

IMPORTANT NOTES:
────────────────
1. SEC scraping respects rate limits - may take 1-2 hours
2. Some tickers may be delisted - handle gracefully in downstream analysis
3. Ground truth is incomplete - SEC enforcement is tip of iceberg
4. Use PLS (Pump Likelihood Score) as continuous proxy in final analysis

""")


╔══════════════════════════════════════════════════════════════════════════════╗
║         NOTEBOOK 1: UNIVERSE CONSTRUCTION & SEC SCRAPING COMPLETE            ║
╚══════════════════════════════════════════════════════════════════════════════╝

OUTPUT FILES:
─────────────
• stock_universe.parquet          - Complete ticker universe with metadata
• stock_universe.csv              - CSV for inspection
• sec_enforcement_cases.parquet   - SEC litigation releases (manipulation cases)
• ticker_manipulation_labels.parquet - Ground truth labels (ticker, date, label)
• notebook01_summary.json         - Summary statistics

UNIVERSE COMPOSITION:
─────────────────────
• SEC enforcement tickers (confirmed manipulation)
• Known meme stocks (potential manipulation)
• High-volatility small caps (control group candidates)

GROUND TRUTH LABELS:
────────────────────
• Label 1: Ticker + date range from SEC enforcement action
• Label 0: To be assigned in Notebook 4 (high-volatility without enforcement)

NE

In [21]:
# =============================================================================
# ENVIRONMENT INFO FOR REPRODUCIBILITY
# =============================================================================

import sys
import platform

print("Environment Information:")
print(f"  Python: {sys.version}")
print(f"  Platform: {platform.platform()}")
print(f"  Pandas: {pd.__version__}")
print(f"  NumPy: {np.__version__}")
print(f"  yfinance: {yf.__version__}")
print(f"  Timestamp: {datetime.now().isoformat()}")

Environment Information:
  Python: 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]
  Platform: Linux-6.6.105+-x86_64-with-glibc2.35
  Pandas: 2.3.3
  NumPy: 2.3.5
  yfinance: 0.2.66
  Timestamp: 2025-12-12T06:57:18.158000
