Phase 1 — Binary Target
Post-earnings return over N trading days (start with 1-day, also test 5-day):

In [2]:
def compute_target(df, n_days=1):
    # df has a 'close' column and 'earnings_date' flag
    df['post_ret'] = df['close'].shift(-n_days) / df['close'] - 1
    df['target_binary'] = (df['post_ret'] > 0).astype(int)
    return df

Phase 2 — 4-Class Magnitude

In [3]:
def compute_magnitude_class(ret, threshold=0.04):
    # threshold ~ median expected move for S&P500 stocks
    if ret <= -threshold: return 0   # Big Down
    elif ret <= 0:           return 1   # Small Down
    elif ret <= threshold:   return 2   # Small Up
    else:                    return 3   # Big Up

# Or: use the options-implied expected move as the threshold
# so "big" = move > priced-in vol
def compute_vs_expected(ret, expected_move):
    return 1 if abs(ret) > expected_move else 0

Building the Dataset
Since each ticker only has ~4 earnings per year, you need breadth. Target 200–500 tickers across sectors, going back 5–8 years. That gives you ~4,000–16,000 observations.

Ticker Universe

In [2]:
import yfinance as yf
import pandas as pd
import numpy as np

# Good starting universes
SP500_URL = "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"
sp500 = pd.read_html(SP500_URL)[0]['Symbol'].tolist()

# Or sector-specific ETF holdings for Russell 2000 small caps
# Include a mix: large cap, mid cap, small cap, different sectors
# Avoid: very illiquid tickers, recent IPOs (<2yr history)

def get_price_data(ticker, start="2017-01-01"):
    t = yf.Ticker(ticker)
    hist = t.history(start=start, auto_adjust=True)
    hist.index = hist.index.tz_localize(None)
    return hist

ImportError: `Import lxml` failed.  Use pip or conda to install the lxml package.