# Notebook 3: Financial Data Collection
## Stock Prices, Returns, and Earnings Announcements

---

**Research Project:** Retail Sentiment, Earnings Quality, and Stock Returns

**Purpose:** Collect financial market data and earnings announcement information for the stock universe.

**Data Sources:**
- Yahoo Finance (yfinance) - Daily stock prices
- SEC EDGAR - Earnings announcements
- Alternative APIs for analyst forecasts

**Output:**
- Daily stock returns panel
- Earnings announcement events with surprise measures
- Abnormal return calculations (CAR)

---

## 1. Environment Setup

In [None]:
# =============================================================================
# INSTALL REQUIRED PACKAGES
# =============================================================================

!pip install yfinance==0.2.31
!pip install pandas==2.0.3
!pip install numpy==1.24.3
!pip install pandas-datareader==0.10.0
!pip install requests==2.31.0
!pip install beautifulsoup4==4.12.2
!pip install lxml==4.9.3
!pip install pyarrow==14.0.1
!pip install scipy==1.11.3
!pip install statsmodels==0.14.0
!pip install tqdm==4.66.1
!pip install sec-edgar-downloader==5.0.0

print("All packages installed successfully.")

In [None]:
# =============================================================================
# IMPORT LIBRARIES
# =============================================================================

import os
import re
import json
import time
import warnings
from datetime import datetime, timedelta
from typing import List, Dict, Tuple, Optional
from collections import defaultdict

import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.api as sm
from tqdm.notebook import tqdm

# Financial data
import yfinance as yf
import pandas_datareader as pdr
import requests
from bs4 import BeautifulSoup

warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', '{:.4f}'.format)

print(f"Environment setup complete. Timestamp: {datetime.now()}")

In [None]:
# =============================================================================
# CONFIGURATION
# =============================================================================

class FinancialDataConfig:
    """Configuration for financial data collection."""
    
    # Data paths
    BASE_PATH = "/content/drive/MyDrive/Research/RetailSentiment/"
    RAW_DATA_PATH = BASE_PATH + "data/raw/"
    PROCESSED_DATA_PATH = BASE_PATH + "data/processed/"
    
    # Sample period
    START_DATE = "2017-01-01"  # Extra year for estimation window
    END_DATE = "2023-12-31"
    
    # Event study parameters
    ESTIMATION_WINDOW = 120  # Trading days for market model estimation
    ESTIMATION_GAP = 10  # Gap between estimation and event window
    
    # Market indices
    MARKET_INDEX = "SPY"  # S&P 500 ETF as market proxy
    RISK_FREE_PROXY = "^IRX"  # 13-week T-Bill rate
    
    # Fama-French factors
    USE_FF_FACTORS = True
    FF_FACTORS = ['Mkt-RF', 'SMB', 'HML', 'RF']  # 3-factor model
    
    # API limits
    YAHOO_RATE_LIMIT = 2000  # Requests per hour
    BATCH_SIZE = 50  # Tickers per batch
    SLEEP_BETWEEN_BATCHES = 1.0
    
    @classmethod
    def print_config(cls):
        print("="*60)
        print("FINANCIAL DATA CONFIGURATION")
        print("="*60)
        print(f"Period: {cls.START_DATE} to {cls.END_DATE}")
        print(f"Market Index: {cls.MARKET_INDEX}")
        print(f"Estimation Window: {cls.ESTIMATION_WINDOW} days")
        print(f"Use Fama-French: {cls.USE_FF_FACTORS}")
        print("="*60)

config = FinancialDataConfig()
config.print_config()

In [None]:
# =============================================================================
# MOUNT GOOGLE DRIVE
# =============================================================================

from google.colab import drive
drive.mount('/content/drive')

os.makedirs(config.RAW_DATA_PATH, exist_ok=True)
os.makedirs(config.PROCESSED_DATA_PATH, exist_ok=True)
print("Data directories ready.")

## 2. Load Ticker Universe

Load the ticker universe from the social media data collection.

In [None]:
# =============================================================================
# LOAD TICKER UNIVERSE
# =============================================================================

def load_ticker_universe(data_path: str) -> List[str]:
    """Load ticker universe from firm-day panel.
    
    Args:
        data_path: Path to processed data
        
    Returns:
        List of unique tickers
    """
    # Load firm-day panel
    filepath = os.path.join(data_path, 'wsb_firm_day_panel.parquet')
    
    if os.path.exists(filepath):
        df = pd.read_parquet(filepath)
        tickers = df['ticker'].unique().tolist()
        print(f"Loaded {len(tickers)} tickers from firm-day panel")
    else:
        # Fallback: Load S&P 500
        print("Firm-day panel not found. Loading S&P 500 tickers...")
        tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
        sp500 = tables[0]
        tickers = sp500['Symbol'].str.replace('.', '-').tolist()
        print(f"Loaded {len(tickers)} S&P 500 tickers")
    
    return sorted(tickers)

# Load tickers
tickers = load_ticker_universe(config.PROCESSED_DATA_PATH)
print(f"\nSample tickers: {tickers[:10]}")

## 3. Stock Price Data Collection

### 3.1 Yahoo Finance Price Collector

In [None]:
# =============================================================================
# YAHOO FINANCE DATA COLLECTOR
# =============================================================================

class YahooFinanceCollector:
    """Collects stock price data from Yahoo Finance.
    
    Features:
    - Batch downloading for efficiency
    - Automatic retry on failures
    - Adjusted close prices for splits/dividends
    """
    
    def __init__(self, config: FinancialDataConfig):
        self.config = config
        self.failed_tickers = []
        
    def download_prices(
        self,
        tickers: List[str],
        start_date: str,
        end_date: str
    ) -> pd.DataFrame:
        """Download daily price data for multiple tickers.
        
        Args:
            tickers: List of ticker symbols
            start_date: Start date (YYYY-MM-DD)
            end_date: End date (YYYY-MM-DD)
            
        Returns:
            DataFrame with daily OHLCV data
        """
        print(f"Downloading prices for {len(tickers)} tickers...")
        print(f"Period: {start_date} to {end_date}")
        
        all_data = []
        self.failed_tickers = []
        
        # Process in batches
        batches = [tickers[i:i+self.config.BATCH_SIZE] 
                   for i in range(0, len(tickers), self.config.BATCH_SIZE)]
        
        for batch in tqdm(batches, desc="Downloading batches"):
            try:
                # Download batch
                data = yf.download(
                    batch,
                    start=start_date,
                    end=end_date,
                    auto_adjust=True,  # Adjust for splits/dividends
                    progress=False,
                    threads=True
                )
                
                if len(batch) == 1:
                    # Single ticker returns different format
                    data.columns = pd.MultiIndex.from_product(
                        [data.columns, batch]
                    )
                
                # Reshape to long format
                for ticker in batch:
                    if ticker in data.columns.get_level_values(1):
                        ticker_data = data.xs(ticker, level=1, axis=1).copy()
                        ticker_data['ticker'] = ticker
                        ticker_data = ticker_data.reset_index()
                        ticker_data.columns = ['date', 'open', 'high', 'low', 
                                               'close', 'volume', 'ticker']
                        all_data.append(ticker_data)
                    else:
                        self.failed_tickers.append(ticker)
                        
            except Exception as e:
                print(f"Error downloading batch: {e}")
                self.failed_tickers.extend(batch)
            
            # Rate limiting
            time.sleep(self.config.SLEEP_BETWEEN_BATCHES)
        
        # Combine all data
        if all_data:
            df = pd.concat(all_data, ignore_index=True)
            df['date'] = pd.to_datetime(df['date'])
            
            print(f"\nDownload complete:")
            print(f"  Total observations: {len(df):,}")
            print(f"  Tickers downloaded: {df['ticker'].nunique()}")
            print(f"  Failed tickers: {len(self.failed_tickers)}")
            
            return df
        else:
            print("No data downloaded!")
            return pd.DataFrame()
    
    def download_market_data(
        self,
        start_date: str,
        end_date: str
    ) -> pd.DataFrame:
        """Download market index data.
        
        Returns:
            DataFrame with market returns
        """
        print(f"Downloading market data ({self.config.MARKET_INDEX})...")
        
        market = yf.download(
            self.config.MARKET_INDEX,
            start=start_date,
            end=end_date,
            auto_adjust=True,
            progress=False
        )
        
        market = market.reset_index()
        market.columns = ['date', 'open', 'high', 'low', 'close', 'volume']
        market['market_return'] = market['close'].pct_change()
        
        print(f"Market data: {len(market)} observations")
        return market[['date', 'close', 'market_return']].rename(
            columns={'close': 'market_price'}
        )

# Initialize collector
price_collector = YahooFinanceCollector(config)
print("Yahoo Finance collector initialized")

In [None]:
# =============================================================================
# DOWNLOAD PRICE DATA
# =============================================================================

# Download stock prices
stock_prices = price_collector.download_prices(
    tickers,
    config.START_DATE,
    config.END_DATE
)

# Download market data
market_data = price_collector.download_market_data(
    config.START_DATE,
    config.END_DATE
)

# Save checkpoint
stock_prices.to_parquet(
    os.path.join(config.RAW_DATA_PATH, 'stock_prices_raw.parquet'),
    index=False
)
print("Price data saved.")

### 3.2 Calculate Returns

In [None]:
# =============================================================================
# RETURN CALCULATIONS
# =============================================================================

class ReturnCalculator:
    """Calculates various return measures.
    
    Following standard event study methodology:
    - Simple returns
    - Log returns
    - Market-adjusted returns
    - Risk-adjusted returns (market model)
    """
    
    def __init__(self):
        pass
    
    def calculate_returns(self, prices: pd.DataFrame) -> pd.DataFrame:
        """Calculate daily returns from price data.
        
        Args:
            prices: DataFrame with columns [date, ticker, close]
            
        Returns:
            DataFrame with return columns added
        """
        print("Calculating returns...")
        df = prices.sort_values(['ticker', 'date']).copy()
        
        # Simple return
        df['ret'] = df.groupby('ticker')['close'].pct_change()
        
        # Log return
        df['ret_log'] = np.log(df['close'] / df.groupby('ticker')['close'].shift(1))
        
        # Lagged prices for return calculations
        df['price_lag1'] = df.groupby('ticker')['close'].shift(1)
        df['price_lag2'] = df.groupby('ticker')['close'].shift(2)
        
        # Trading volume in dollars
        df['dollar_volume'] = df['close'] * df['volume']
        
        # Volatility (rolling 20-day)
        df['volatility_20d'] = df.groupby('ticker')['ret'].transform(
            lambda x: x.rolling(window=20, min_periods=10).std() * np.sqrt(252)
        )
        
        print(f"Returns calculated for {df['ticker'].nunique()} tickers")
        return df
    
    def merge_market_returns(self, stock_returns: pd.DataFrame,
                             market_data: pd.DataFrame) -> pd.DataFrame:
        """Merge market returns with stock returns.
        
        Args:
            stock_returns: Stock return DataFrame
            market_data: Market index DataFrame
            
        Returns:
            Merged DataFrame
        """
        print("Merging with market data...")
        
        df = stock_returns.merge(
            market_data[['date', 'market_return']],
            on='date',
            how='left'
        )
        
        # Market-adjusted return
        df['ret_mktadj'] = df['ret'] - df['market_return']
        
        return df
    
    def estimate_market_model(
        self,
        df: pd.DataFrame,
        estimation_window: int = 120,
        min_observations: int = 60
    ) -> pd.DataFrame:
        """Estimate market model parameters for each ticker.
        
        Model: R_it = alpha_i + beta_i * R_mt + epsilon_it
        
        Args:
            df: Stock returns DataFrame with market returns
            estimation_window: Days for estimation
            min_observations: Minimum required observations
            
        Returns:
            DataFrame with model parameters
        """
        print(f"Estimating market model (window={estimation_window} days)...")
        
        params_list = []
        
        for ticker in tqdm(df['ticker'].unique(), desc="Estimating"):
            ticker_data = df[df['ticker'] == ticker].sort_values('date')
            
            if len(ticker_data) < min_observations:
                continue
            
            # Use rolling estimation
            for i in range(estimation_window, len(ticker_data)):
                window_data = ticker_data.iloc[i-estimation_window:i]
                
                # Clean data
                clean_data = window_data[['ret', 'market_return']].dropna()
                
                if len(clean_data) < min_observations:
                    continue
                
                # Estimate market model
                X = sm.add_constant(clean_data['market_return'])
                y = clean_data['ret']
                
                try:
                    model = sm.OLS(y, X).fit()
                    
                    params_list.append({
                        'ticker': ticker,
                        'date': ticker_data.iloc[i]['date'],
                        'alpha': model.params[0],
                        'beta': model.params[1],
                        'r_squared': model.rsquared,
                        'residual_std': model.resid.std()
                    })
                except:
                    continue
        
        params_df = pd.DataFrame(params_list)
        print(f"Model parameters estimated for {params_df['ticker'].nunique()} tickers")
        
        return params_df

# Initialize calculator
return_calc = ReturnCalculator()

# Calculate returns
stock_returns = return_calc.calculate_returns(stock_prices)
stock_returns = return_calc.merge_market_returns(stock_returns, market_data)

In [None]:
# =============================================================================
# ESTIMATE MARKET MODEL (Optional - computationally intensive)
# =============================================================================

# Note: This is computationally intensive. Run if needed for CAR calculations.
# For large samples, consider estimating only around event windows.

ESTIMATE_MARKET_MODEL = False  # Set to True if needed

if ESTIMATE_MARKET_MODEL:
    market_model_params = return_calc.estimate_market_model(
        stock_returns,
        estimation_window=config.ESTIMATION_WINDOW,
        min_observations=60
    )
    
    # Save parameters
    market_model_params.to_parquet(
        os.path.join(config.PROCESSED_DATA_PATH, 'market_model_params.parquet'),
        index=False
    )
else:
    print("Skipping market model estimation (set ESTIMATE_MARKET_MODEL=True to run)")

### 3.3 Download Fama-French Factors

In [None]:
# =============================================================================
# FAMA-FRENCH FACTORS
# =============================================================================

class FamaFrenchLoader:
    """Loads Fama-French factor data from Ken French's data library."""
    
    FF_URL = "https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/"
    
    def __init__(self):
        pass
    
    def load_factors(self, start_date: str, end_date: str,
                    model: str = '3factor') -> pd.DataFrame:
        """Load Fama-French factors.
        
        Args:
            start_date: Start date
            end_date: End date
            model: '3factor' or '5factor'
            
        Returns:
            DataFrame with daily factor returns
        """
        print(f"Loading Fama-French {model} factors...")
        
        try:
            # Use pandas_datareader to fetch FF factors
            if model == '3factor':
                ff = pdr.get_data_famafrench(
                    'F-F_Research_Data_Factors_daily',
                    start=start_date,
                    end=end_date
                )[0]
            else:
                ff = pdr.get_data_famafrench(
                    'F-F_Research_Data_5_Factors_2x3_daily',
                    start=start_date,
                    end=end_date
                )[0]
            
            # Convert to proper format
            ff = ff.reset_index()
            ff.columns = ff.columns.str.strip()
            ff = ff.rename(columns={'Date': 'date'})
            
            # Convert percentages to decimals
            for col in ff.columns:
                if col != 'date':
                    ff[col] = ff[col] / 100
            
            ff['date'] = pd.to_datetime(ff['date'])
            
            print(f"Loaded {len(ff)} observations")
            print(f"Columns: {list(ff.columns)}")
            
            return ff
            
        except Exception as e:
            print(f"Error loading FF factors: {e}")
            print("Attempting alternative method...")
            return self._load_from_url(start_date, end_date, model)
    
    def _load_from_url(self, start_date: str, end_date: str,
                      model: str) -> pd.DataFrame:
        """Fallback method to load from URL directly."""
        # This is a fallback - implement if pandas_datareader fails
        print("URL fallback not implemented - returning empty DataFrame")
        return pd.DataFrame()

# Load Fama-French factors
ff_loader = FamaFrenchLoader()
ff_factors = ff_loader.load_factors(
    config.START_DATE,
    config.END_DATE,
    model='3factor'
)

# Merge with stock returns
if len(ff_factors) > 0:
    stock_returns = stock_returns.merge(
        ff_factors,
        on='date',
        how='left'
    )
    print("Fama-French factors merged.")

## 4. Earnings Announcement Data

### 4.1 Collect Earnings Dates and Actual EPS

In [None]:
# =============================================================================
# EARNINGS DATA COLLECTOR
# =============================================================================

class EarningsDataCollector:
    """Collects earnings announcement data from multiple sources.
    
    Sources:
    - Yahoo Finance earnings calendar
    - SEC EDGAR filings
    - Alternative APIs
    """
    
    def __init__(self):
        self.earnings_data = []
        
    def get_earnings_yahoo(self, tickers: List[str]) -> pd.DataFrame:
        """Get earnings data from Yahoo Finance.
        
        Args:
            tickers: List of ticker symbols
            
        Returns:
            DataFrame with earnings announcements
        """
        print(f"Collecting earnings data for {len(tickers)} tickers...")
        earnings_list = []
        
        for ticker in tqdm(tickers, desc="Fetching earnings"):
            try:
                stock = yf.Ticker(ticker)
                
                # Get earnings dates
                earnings = stock.earnings_dates
                
                if earnings is not None and len(earnings) > 0:
                    earnings_df = earnings.reset_index()
                    earnings_df['ticker'] = ticker
                    earnings_list.append(earnings_df)
                
                # Get quarterly earnings history
                quarterly = stock.quarterly_earnings
                if quarterly is not None and len(quarterly) > 0:
                    # This contains actual EPS
                    pass
                    
            except Exception as e:
                continue
            
            time.sleep(0.1)  # Rate limiting
        
        if earnings_list:
            df = pd.concat(earnings_list, ignore_index=True)
            print(f"Collected earnings for {df['ticker'].nunique()} tickers")
            return df
        else:
            return pd.DataFrame()
    
    def get_earnings_history(self, tickers: List[str]) -> pd.DataFrame:
        """Get historical earnings (actual and estimated EPS).
        
        Args:
            tickers: List of ticker symbols
            
        Returns:
            DataFrame with historical earnings
        """
        print(f"Collecting earnings history...")
        earnings_list = []
        
        for ticker in tqdm(tickers, desc="Fetching earnings history"):
            try:
                stock = yf.Ticker(ticker)
                
                # Quarterly financials
                quarterly = stock.quarterly_financials
                if quarterly is not None and not quarterly.empty:
                    # Get relevant rows
                    if 'Net Income' in quarterly.index:
                        net_income = quarterly.loc['Net Income']
                        
                        # Get shares outstanding for EPS calculation
                        info = stock.info
                        shares = info.get('sharesOutstanding', None)
                        
                        for date, ni in net_income.items():
                            earnings_list.append({
                                'ticker': ticker,
                                'period_end': date,
                                'net_income': ni,
                                'shares_outstanding': shares,
                                'eps_calculated': ni / shares if shares else None
                            })
                            
            except Exception as e:
                continue
            
            time.sleep(0.1)
        
        if earnings_list:
            df = pd.DataFrame(earnings_list)
            df['period_end'] = pd.to_datetime(df['period_end'])
            return df
        else:
            return pd.DataFrame()

# Initialize collector
earnings_collector = EarningsDataCollector()

In [None]:
# =============================================================================
# COLLECT EARNINGS DATA
# =============================================================================

# Collect earnings dates
earnings_dates = earnings_collector.get_earnings_yahoo(tickers[:100])  # Subset for demo

if len(earnings_dates) > 0:
    print(f"\nEarnings data collected:")
    print(earnings_dates.head())
    
    # Save
    earnings_dates.to_parquet(
        os.path.join(config.RAW_DATA_PATH, 'earnings_dates_raw.parquet'),
        index=False
    )

### 4.2 Calculate Earnings Surprise

In [None]:
# =============================================================================
# EARNINGS SURPRISE CALCULATION
# =============================================================================

class EarningsSurpriseCalculator:
    """Calculates earnings surprise measures.
    
    Methods:
    1. Analyst forecast-based: (Actual - Consensus) / Price
    2. Seasonal random walk: (EPS_t - EPS_{t-4}) / Price
    3. Time-series model: AR(1) residuals
    """
    
    def __init__(self):
        pass
    
    def seasonal_random_walk_surprise(
        self,
        earnings: pd.DataFrame,
        prices: pd.DataFrame
    ) -> pd.DataFrame:
        """Calculate earnings surprise using seasonal random walk.
        
        SUE = (EPS_q - EPS_{q-4}) / Price_{t-2}
        
        This is a common proxy when analyst forecasts are unavailable.
        
        Args:
            earnings: DataFrame with [ticker, period_end, eps]
            prices: DataFrame with [ticker, date, close]
            
        Returns:
            DataFrame with surprise measures
        """
        print("Calculating seasonal random walk surprise...")
        
        # Sort and compute lagged EPS
        df = earnings.sort_values(['ticker', 'period_end']).copy()
        df['eps_lag4'] = df.groupby('ticker')['eps_calculated'].shift(4)
        
        # EPS change
        df['eps_change'] = df['eps_calculated'] - df['eps_lag4']
        
        # Merge with price data (price 2 days before announcement)
        # This requires EA date mapping - simplified here
        
        return df
    
    def standardized_unexpected_earnings(
        self,
        earnings: pd.DataFrame
    ) -> pd.DataFrame:
        """Calculate Standardized Unexpected Earnings (SUE).
        
        SUE = (EPS_t - E[EPS_t]) / σ(EPS)
        
        Where expectation is based on seasonal random walk.
        """
        df = earnings.sort_values(['ticker', 'period_end']).copy()
        
        # Expected EPS (same quarter last year)
        df['eps_expected'] = df.groupby('ticker')['eps_calculated'].shift(4)
        
        # Forecast error
        df['forecast_error'] = df['eps_calculated'] - df['eps_expected']
        
        # Rolling standard deviation of forecast errors
        df['forecast_error_std'] = df.groupby('ticker')['forecast_error'].transform(
            lambda x: x.rolling(window=8, min_periods=4).std()
        )
        
        # SUE
        df['SUE'] = df['forecast_error'] / df['forecast_error_std']
        
        return df

# Initialize calculator
surprise_calc = EarningsSurpriseCalculator()

## 5. Cumulative Abnormal Returns (CAR)

### 5.1 Event Study Framework

In [None]:
# =============================================================================
# CUMULATIVE ABNORMAL RETURN CALCULATOR
# =============================================================================

class CARCalculator:
    """Calculates Cumulative Abnormal Returns for event studies.
    
    Supports multiple methodologies:
    1. Market-adjusted returns: AR = R - R_m
    2. Market model: AR = R - (alpha + beta * R_m)
    3. Fama-French 3-factor model
    """
    
    def __init__(self, stock_returns: pd.DataFrame):
        """Initialize with stock returns data.
        
        Args:
            stock_returns: DataFrame with daily returns
        """
        self.returns = stock_returns.copy()
        self.returns['date'] = pd.to_datetime(self.returns['date'])
        
    def get_trading_days_around_event(
        self,
        ticker: str,
        event_date: datetime,
        window_start: int,
        window_end: int
    ) -> pd.DataFrame:
        """Get trading days around an event date.
        
        Args:
            ticker: Stock ticker
            event_date: Event date
            window_start: Start of window (negative = before)
            window_end: End of window
            
        Returns:
            DataFrame with trading days in window
        """
        ticker_data = self.returns[self.returns['ticker'] == ticker].copy()
        ticker_data = ticker_data.sort_values('date')
        
        # Find event date index
        dates = ticker_data['date'].values
        event_idx = np.searchsorted(dates, np.datetime64(event_date))
        
        # Get window indices
        start_idx = max(0, event_idx + window_start)
        end_idx = min(len(ticker_data), event_idx + window_end + 1)
        
        return ticker_data.iloc[start_idx:end_idx].copy()
    
    def calculate_car_market_adjusted(
        self,
        ticker: str,
        event_date: datetime,
        window_start: int = -1,
        window_end: int = 1
    ) -> Dict:
        """Calculate market-adjusted CAR.
        
        CAR = Σ (R_it - R_mt) for t in [window_start, window_end]
        
        Args:
            ticker: Stock ticker
            event_date: Event date
            window_start: Start of event window
            window_end: End of event window
            
        Returns:
            Dictionary with CAR and related statistics
        """
        window_data = self.get_trading_days_around_event(
            ticker, event_date, window_start, window_end
        )
        
        if len(window_data) == 0:
            return {
                'ticker': ticker,
                'event_date': event_date,
                'car': np.nan,
                'window': f'[{window_start},{window_end}]',
                'n_days': 0
            }
        
        # Calculate abnormal returns
        window_data['ar'] = window_data['ret'] - window_data['market_return']
        
        # Cumulative abnormal return
        car = window_data['ar'].sum()
        
        return {
            'ticker': ticker,
            'event_date': event_date,
            'car': car,
            'window': f'[{window_start},{window_end}]',
            'n_days': len(window_data),
            'avg_ar': window_data['ar'].mean(),
            'raw_return': window_data['ret'].sum()
        }
    
    def calculate_cars_for_events(
        self,
        events: pd.DataFrame,
        windows: List[Tuple[int, int]] = [(-1, 1), (0, 2), (2, 20)]
    ) -> pd.DataFrame:
        """Calculate CARs for multiple events and windows.
        
        Args:
            events: DataFrame with [ticker, event_date]
            windows: List of (start, end) tuples
            
        Returns:
            DataFrame with CARs for each event-window combination
        """
        print(f"Calculating CARs for {len(events)} events...")
        print(f"Windows: {windows}")
        
        results = []
        
        for _, event in tqdm(events.iterrows(), total=len(events), desc="Processing events"):
            ticker = event['ticker']
            event_date = pd.to_datetime(event['event_date'])
            
            event_results = {'ticker': ticker, 'event_date': event_date}
            
            for window_start, window_end in windows:
                car_result = self.calculate_car_market_adjusted(
                    ticker, event_date, window_start, window_end
                )
                window_name = f'CAR_{window_start}_{window_end}'
                event_results[window_name] = car_result['car']
            
            results.append(event_results)
        
        df = pd.DataFrame(results)
        print(f"CAR calculations complete: {len(df)} events")
        
        return df

# Initialize CAR calculator
car_calculator = CARCalculator(stock_returns)
print("CAR calculator initialized")

In [None]:
# =============================================================================
# EXAMPLE: CALCULATE CARS FOR SAMPLE EVENTS
# =============================================================================

# Create sample events (replace with actual earnings dates)
sample_events = pd.DataFrame({
    'ticker': ['AAPL', 'MSFT', 'GOOGL', 'AMZN', 'META'],
    'event_date': ['2023-01-26', '2023-01-24', '2023-02-02', 
                   '2023-02-02', '2023-02-01']
})
sample_events['event_date'] = pd.to_datetime(sample_events['event_date'])

# Define event windows
event_windows = [
    (-1, 1),   # EA window: CAR[-1,+1]
    (0, 2),    # Post-EA: CAR[0,+2]
    (2, 20),   # Drift: CAR[+2,+20]
    (-10, -2)  # Pre-EA: CAR[-10,-2]
]

# Calculate CARs
sample_cars = car_calculator.calculate_cars_for_events(
    sample_events,
    windows=event_windows
)

print("\nSample CAR Results:")
print(sample_cars.to_string())

## 6. Firm Characteristics

In [None]:
# =============================================================================
# FIRM CHARACTERISTICS
# =============================================================================

class FirmCharacteristicsCollector:
    """Collects firm-level characteristics for control variables.
    
    Variables:
    - Market capitalization (size)
    - Book-to-market ratio
    - Prior returns (momentum)
    - Volatility
    - Industry/sector
    """
    
    def __init__(self):
        pass
    
    def get_firm_info(self, tickers: List[str]) -> pd.DataFrame:
        """Get firm characteristics from Yahoo Finance.
        
        Args:
            tickers: List of ticker symbols
            
        Returns:
            DataFrame with firm characteristics
        """
        print(f"Collecting firm characteristics for {len(tickers)} tickers...")
        
        firm_data = []
        
        for ticker in tqdm(tickers, desc="Fetching firm info"):
            try:
                stock = yf.Ticker(ticker)
                info = stock.info
                
                firm_data.append({
                    'ticker': ticker,
                    'company_name': info.get('longName', ''),
                    'sector': info.get('sector', ''),
                    'industry': info.get('industry', ''),
                    'market_cap': info.get('marketCap', np.nan),
                    'enterprise_value': info.get('enterpriseValue', np.nan),
                    'book_value': info.get('bookValue', np.nan),
                    'price_to_book': info.get('priceToBook', np.nan),
                    'trailing_pe': info.get('trailingPE', np.nan),
                    'forward_pe': info.get('forwardPE', np.nan),
                    'beta': info.get('beta', np.nan),
                    'avg_volume': info.get('averageVolume', np.nan),
                    'shares_outstanding': info.get('sharesOutstanding', np.nan),
                    'float_shares': info.get('floatShares', np.nan),
                    'short_ratio': info.get('shortRatio', np.nan)
                })
                
            except Exception as e:
                firm_data.append({'ticker': ticker})
            
            time.sleep(0.1)
        
        df = pd.DataFrame(firm_data)
        print(f"Collected info for {len(df)} firms")
        
        return df
    
    def calculate_time_varying_characteristics(
        self,
        returns: pd.DataFrame
    ) -> pd.DataFrame:
        """Calculate time-varying characteristics.
        
        Args:
            returns: Stock returns DataFrame
            
        Returns:
            DataFrame with rolling characteristics
        """
        print("Calculating time-varying characteristics...")
        df = returns.sort_values(['ticker', 'date']).copy()
        
        # Prior returns (momentum)
        df['ret_1m'] = df.groupby('ticker')['ret'].transform(
            lambda x: x.rolling(21).sum()
        )
        df['ret_3m'] = df.groupby('ticker')['ret'].transform(
            lambda x: x.rolling(63).sum()
        )
        df['ret_6m'] = df.groupby('ticker')['ret'].transform(
            lambda x: x.rolling(126).sum()
        )
        df['ret_12m'] = df.groupby('ticker')['ret'].transform(
            lambda x: x.rolling(252).sum()
        )
        
        # Volatility (annualized)
        df['vol_1m'] = df.groupby('ticker')['ret'].transform(
            lambda x: x.rolling(21).std() * np.sqrt(252)
        )
        df['vol_3m'] = df.groupby('ticker')['ret'].transform(
            lambda x: x.rolling(63).std() * np.sqrt(252)
        )
        
        # Illiquidity (Amihud)
        df['illiquidity'] = np.abs(df['ret']) / (df['dollar_volume'] / 1e6)
        df['illiquidity_avg'] = df.groupby('ticker')['illiquidity'].transform(
            lambda x: x.rolling(21).mean()
        )
        
        # Market cap (daily)
        df['log_mcap'] = np.log(df['close'] * df.groupby('ticker')['volume'].transform('mean') * 1e6)
        
        return df

# Initialize collector
char_collector = FirmCharacteristicsCollector()

# Collect firm info (subset for demo)
firm_characteristics = char_collector.get_firm_info(tickers[:50])

In [None]:
# =============================================================================
# ADD TIME-VARYING CHARACTERISTICS TO RETURNS
# =============================================================================

# Calculate time-varying characteristics
stock_returns = char_collector.calculate_time_varying_characteristics(stock_returns)

print("Time-varying characteristics added.")
print(f"Columns: {list(stock_returns.columns)}")

## 7. Save Final Output

In [None]:
# =============================================================================
# SAVE ALL OUTPUTS
# =============================================================================

def save_financial_data(output_dir: str):
    """Save all financial data with documentation."""
    
    os.makedirs(output_dir, exist_ok=True)
    
    # Stock returns panel
    stock_returns.to_parquet(
        os.path.join(output_dir, 'stock_returns_panel.parquet'),
        index=False
    )
    print(f"Saved: stock_returns_panel.parquet ({len(stock_returns):,} rows)")
    
    # Market data
    market_data.to_parquet(
        os.path.join(output_dir, 'market_returns.parquet'),
        index=False
    )
    print(f"Saved: market_returns.parquet")
    
    # Fama-French factors
    if len(ff_factors) > 0:
        ff_factors.to_parquet(
            os.path.join(output_dir, 'ff_factors.parquet'),
            index=False
        )
        print(f"Saved: ff_factors.parquet")
    
    # Firm characteristics
    firm_characteristics.to_parquet(
        os.path.join(output_dir, 'firm_characteristics.parquet'),
        index=False
    )
    print(f"Saved: firm_characteristics.parquet")
    
    # Data dictionary
    data_dict = {
        'stock_returns_panel': {
            'ticker': 'Stock ticker symbol',
            'date': 'Trading date',
            'open/high/low/close': 'OHLC prices (adjusted)',
            'volume': 'Trading volume',
            'ret': 'Simple daily return',
            'ret_log': 'Log daily return',
            'market_return': 'Market (SPY) return',
            'ret_mktadj': 'Market-adjusted return',
            'Mkt-RF/SMB/HML/RF': 'Fama-French factors',
            'volatility_20d': '20-day rolling volatility (annualized)',
            'ret_1m/3m/6m/12m': 'Cumulative returns over 1/3/6/12 months',
            'vol_1m/3m': 'Rolling volatility (1/3 months)',
            'illiquidity_avg': 'Amihud illiquidity (21-day avg)'
        },
        'firm_characteristics': {
            'ticker': 'Stock ticker symbol',
            'sector/industry': 'GICS sector and industry',
            'market_cap': 'Market capitalization',
            'price_to_book': 'Price-to-book ratio',
            'beta': 'Market beta',
            'short_ratio': 'Short interest ratio'
        }
    }
    
    with open(os.path.join(output_dir, 'financial_data_dictionary.json'), 'w') as f:
        json.dump(data_dict, f, indent=2)
    print("Saved: financial_data_dictionary.json")

# Save all data
save_financial_data(config.PROCESSED_DATA_PATH)

## 8. Summary

In [None]:
# =============================================================================
# NOTEBOOK SUMMARY
# =============================================================================

print("""
╔══════════════════════════════════════════════════════════════════╗
║        NOTEBOOK 3: FINANCIAL DATA COLLECTION COMPLETE            ║
╚══════════════════════════════════════════════════════════════════╝

OUTPUT FILES:
─────────────
• stock_returns_panel.parquet  - Daily returns with characteristics
• market_returns.parquet       - Market index returns
• ff_factors.parquet           - Fama-French factor returns
• firm_characteristics.parquet - Cross-sectional firm data

KEY VARIABLES:
──────────────
Returns:
  • ret, ret_log (raw returns)
  • ret_mktadj (market-adjusted)
  • CAR calculations available

Characteristics:
  • Size (market cap)
  • Book-to-market
  • Momentum (1m, 3m, 6m, 12m)
  • Volatility
  • Liquidity (Amihud)

Factors:
  • Mkt-RF, SMB, HML (Fama-French)
  • Risk-free rate

NEXT STEPS:
───────────
→ Notebook 4: Earnings Quality Measures
  - Financial statement data from SEC EDGAR
  - Accrual-based quality metrics
  - Dechow-Dichev model implementation

""")