# Staggered DiD Event-Study Analysis: Major Patches Impact on Steam Games

This notebook performs a **staggered difference-in-differences** (DiD) event-study analysis to estimate the causal impact of major game patches on:
- **Player counts** (concurrent players)
- **Ownership/Sales** (estimated owners from SteamDB)
- **Game ratings** (Metacritic scores)

**Key Features:**
- **Event-study design**: Events are staggered—each game's "treatment" occurs on its last major patch date
- **Panel structure**: Monthly observations from 4 months before to 4 months after the patch (rel_month = -4..+4)
- **Control variables**: Metacritic scores, review counts, review sentiment (% positive)
- **Baseline specification**: Standard DiD with event-time dummies
- **Visualization**: Event-study plots showing coefficient dynamics across pre/post period

**Interpretation:**
- Coefficient for rel_month = k represents the effect k months away from the patch
- Pre-period coefficients (k < 0) test for parallel trends assumption
- Post-period coefficients (k ≥ 0) show the treatment effect trajectory

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import statsmodels.api as sm
from statsmodels.formula.api import ols
import warnings

warnings.filterwarnings('ignore')

# Configure plotting
sns.set_theme(style="whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 10

print("Libraries loaded successfully")

Libraries loaded successfully


## 1. Load and Prepare Panel Data

In [2]:
# Load the panel data created by collect_panel_for_did.py
panel = pd.read_csv('../did_panel.csv')

print(f"Loaded panel with {len(panel)} rows and {len(panel.columns)} columns")
print(f"\nColumn names: {list(panel.columns)}")
print(f"\nFirst few rows:")
print(panel.head(10))

print(f"\n\nPanel structure:")
print(f"  Unique apps: {panel['appid'].nunique()}")
print(f"  Observations per app: {len(panel) / panel['appid'].nunique():.0f}")
print(f"  Relative months range: {panel['rel_month'].min()} to {panel['rel_month'].max()}")
print(f"  Treatment group (treatment=1): {(panel['treatment']==1).sum()} rows")
print(f"  Control group (treatment=0): {(panel['treatment']==0).sum()} rows")

# Check for missing data
print(f"\n\nMissing data:")
print(panel.isnull().sum())

# Outcome variable preparation
panel['log_avg_players'] = np.log(panel['avg_players'] + 1)
panel['log_peak_players'] = np.log(panel['peak_players'] + 1)
panel['log_owners'] = np.log(panel['owners_estimate'] + 1)

print(f"\n\nOutcome variables created (log-transformed)")
print(f"  Log avg players: {panel['log_avg_players'].notna().sum()} non-null")
print(f"  Log peak players: {panel['log_peak_players'].notna().sum()} non-null")
print(f"  Log owners: {panel['log_owners'].notna().sum()} non-null")

Loaded panel with 270 rows and 12 columns

Column names: ['appid', 'name', 'event_date', 'rel_month', 'month', 'avg_players', 'peak_players', 'owners_estimate', 'metacritic_score', 'review_count', 'review_positive_pct', 'treatment']

First few rows:
     appid              name  event_date  rel_month    month  avg_players  \
0      730  Counter-Strike 2  2024-12-01         -4  2024-08    897337.88   
1      730  Counter-Strike 2  2024-12-01         -3  2024-09    836306.66   
2      730  Counter-Strike 2  2024-12-01         -2  2024-10    829438.76   
3      730  Counter-Strike 2  2024-12-01         -1  2024-11    852164.30   
4      730  Counter-Strike 2  2024-12-01          0  2024-12    913953.36   
5      730  Counter-Strike 2  2024-12-01          1  2025-01    914092.22   
6      730  Counter-Strike 2  2024-12-01          2  2025-02   1003570.56   
7      730  Counter-Strike 2  2024-12-01          3  2025-03   1039662.81   
8      730  Counter-Strike 2  2024-12-01          4  2025

## 2. Staggered DiD with Event-Time Dummies

In [3]:
# Create event-time dummy variables for the event-study design
# We'll include dummies for each rel_month, excluding one as the baseline

panel['post'] = (panel['rel_month'] >= 0).astype(int)

# Create treatment-year interaction for DiD  
panel['treatment_post'] = panel['treatment'] * panel['post']

# For event-study, we'll create lead/lag dummies
# Set rel_month = -1 as baseline (one month before patch)
baseline_rel_month = -1
event_months = sorted(panel['rel_month'].unique())
event_months = [m for m in event_months if m != baseline_rel_month]

for rel_m in event_months:
    panel[f'rel_month_{rel_m}'] = (panel['rel_month'] == rel_m).astype(int)

print(f"Created {len(event_months)} event-time dummies (baseline: rel_month = {baseline_rel_month})")
print(f"Event times included: {event_months}")

# Control variables
# Fill missing values for controls with group means
control_vars = ['metacritic_score', 'review_count', 'review_positive_pct']
for var in control_vars:
    if var in panel.columns:
        # Fill with game-specific mean or 0
        panel[f'{var}_filled'] = panel.groupby('appid')[var].transform(
            lambda x: x.fillna(x.mean())
        ).fillna(panel[var].mean())

print(f"\nControl variables prepared (filled missing with group means)")

Created 8 event-time dummies (baseline: rel_month = -1)
Event times included: [np.int64(-4), np.int64(-3), np.int64(-2), np.int64(0), np.int64(1), np.int64(2), np.int64(3), np.int64(4)]


TypeError: Could not convert string '{'score': 79, 'url': 'https://www.metacritic.com/game/pc/euro-truck-simulator-2?ftag=MCD-06-10aaa1f'}{'score': 79, 'url': 'https://www.metacritic.com/game/pc/euro-truck-simulator-2?ftag=MCD-06-10aaa1f'}{'score': 79, 'url': 'https://www.metacritic.com/game/pc/euro-truck-simulator-2?ftag=MCD-06-10aaa1f'}{'score': 79, 'url': 'https://www.metacritic.com/game/pc/euro-truck-simulator-2?ftag=MCD-06-10aaa1f'}{'score': 79, 'url': 'https://www.metacritic.com/game/pc/euro-truck-simulator-2?ftag=MCD-06-10aaa1f'}{'score': 79, 'url': 'https://www.metacritic.com/game/pc/euro-truck-simulator-2?ftag=MCD-06-10aaa1f'}{'score': 79, 'url': 'https://www.metacritic.com/game/pc/euro-truck-simulator-2?ftag=MCD-06-10aaa1f'}{'score': 79, 'url': 'https://www.metacritic.com/game/pc/euro-truck-simulator-2?ftag=MCD-06-10aaa1f'}{'score': 79, 'url': 'https://www.metacritic.com/game/pc/euro-truck-simulator-2?ftag=MCD-06-10aaa1f'}' to numeric

## 3. Fit Event-Study Regressions

In [None]:
# Function to run event-study regression and extract coefficients
def run_event_study(data, outcome_var, control_vars=None, event_months=None):
    """
    Fit event-study regression with event-time dummies.
    Returns: (model, event_coefs_df)
    """
    if control_vars is None:
        control_vars = []
    if event_months is None:
        event_months = []
    
    # Build formula: outcome ~ sum(rel_month_t) for all t + treatment + controls
    dummy_terms = [f'rel_month_{m}' for m in event_months if f'rel_month_{m}' in data.columns]
    control_terms = [f'{c}_filled' for c in control_vars if f'{c}_filled' in data.columns]
    
    formula = f"{outcome_var} ~ treatment + " + " + ".join(dummy_terms)
    if control_terms:
        formula += " + " + " + ".join(control_terms)
    
    # Keep only non-missing observations
    data_clean = data[[outcome_var] + ['treatment'] + dummy_terms + control_terms].dropna()
    
    if len(data_clean) < 10:
        print(f"    Insufficient data ({len(data_clean)} rows) for {outcome_var}")
        return None, None
    
    model = ols(formula, data=data_clean).fit()
    
    # Extract event-study coefficients
    event_coefs = []
    for rel_m in event_months:
        coef_name = f'rel_month_{rel_m}'
        if coef_name in model.params.index:
            coef = model.params[coef_name]
            se = model.bse[coef_name]
            t_stat = model.tvalues[coef_name]
            p_val = model.pvalues[coef_name]
            ci_lower = model.conf_int().loc[coef_name, 0]
            ci_upper = model.conf_int().loc[coef_name, 1]
            event_coefs.append({
                'rel_month': rel_m,
                'coef': coef,
                'se': se,
                't_stat': t_stat,
                'p_val': p_val,
                'ci_lower': ci_lower,
                'ci_upper': ci_upper,
                'sig': '***' if p_val < 0.01 else ('**' if p_val < 0.05 else ('*' if p_val < 0.10 else ''))
            })
    
    event_coefs_df = pd.DataFrame(event_coefs).sort_values('rel_month')
    return model, event_coefs_df

# Fit models for each outcome
outcomes = [
    ('log_avg_players', 'Log(Avg Players)'),
    ('log_peak_players', 'Log(Peak Players)'),
    ('log_owners', 'Log(Owners)'),
    ('metacritic_score', 'Metacritic Score')
]

control_vars = ['metacritic_score', 'review_count', 'review_positive_pct']

models = {}
event_study_results = {}

for outcome_col, outcome_label in outcomes:
    print(f"\nFitting event-study regression for {outcome_label}...")
    model, event_coefs = run_event_study(panel, outcome_col, control_vars=control_vars, event_months=event_months)
    
    if model is not None:
        models[outcome_col] = model
        event_study_results[outcome_col] = event_coefs
        print(f"  R-squared: {model.rsquared:.4f}")
        print(f"  N obs: {len(model.fittedvalues)}")
    else:
        print(f"  Model fitting failed")

print("\n" + "="*70)
print("Event-study regressions completed")

## 4. Visualize Event-Study Results

In [None]:
# Plot event-study coefficients with 95% confidence intervals
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()

plot_idx = 0
for outcome_col, outcome_label in outcomes:
    if outcome_col not in event_study_results:
        continue
    
    ax = axes[plot_idx]
    coefs_df = event_study_results[outcome_col]
    
    # Plot coefficients with CIs
    ax.errorbar(coefs_df['rel_month'], coefs_df['coef'], 
                yerr=1.96*coefs_df['se'], fmt='o-', linewidth=2, markersize=8,
                capsize=5, capthick=2, color='#2E86AB', label='Coef ± 95% CI')
    
    # Add zero line
    ax.axhline(0, color='red', linestyle='--', linewidth=1.5, alpha=0.7, label='Zero')
    
    # Add shading for post-period
    ax.axvspan(-0.5, coefs_df['rel_month'].max() + 0.5, alpha=0.1, color='green', label='Post-period')
    
    ax.set_xlabel('Months Relative to Patch', fontsize=11, fontweight='bold')
    ax.set_ylabel('Coefficient', fontsize=11, fontweight='bold')
    ax.set_title(f'Event Study: {outcome_label}', fontsize=12, fontweight='bold')
    ax.grid(True, alpha=0.3)
    ax.legend(loc='best', fontsize=9)
    
    plot_idx += 1

plt.tight_layout()
plt.savefig('../staggered_did_event_study.png', dpi=300, bbox_inches='tight')
plt.show()

print("Event-study plot saved to staggered_did_event_study.png")

## 5. Regression Coefficients Table

In [None]:
# Compile and display all event-study coefficients
all_results = []

for outcome_col, outcome_label in outcomes:
    if outcome_col not in event_study_results:
        continue
    coefs_df = event_study_results[outcome_col]
    coefs_df_copy = coefs_df.copy()
    coefs_df_copy['outcome'] = outcome_label
    coefs_df_copy['outcome_col'] = outcome_col
    all_results.append(coefs_df_copy)

if all_results:
    combined_results = pd.concat(all_results, ignore_index=True)
    
    # Reorder columns for display
    display_cols = ['outcome', 'rel_month', 'coef', 'se', 't_stat', 'p_val', 'sig', 'ci_lower', 'ci_upper']
    combined_results = combined_results[display_cols]
    
    # Format for display
    combined_results['coef'] = combined_results['coef'].round(6)
    combined_results['se'] = combined_results['se'].round(6)
    combined_results['t_stat'] = combined_results['t_stat'].round(4)
    combined_results['p_val'] = combined_results['p_val'].round(6)
    combined_results['ci_lower'] = combined_results['ci_lower'].round(6)
    combined_results['ci_upper'] = combined_results['ci_upper'].round(6)
    
    print("\n" + "="*130)
    print("EVENT-STUDY REGRESSION RESULTS: Treatment Effect by Relative Month")
    print("="*130)
    print(combined_results.to_string(index=False))
    
    # Export to CSV
    combined_results.to_csv('../staggered_did_coefficients.csv', index=False)
    print(f"\n\nResults exported to staggered_did_coefficients.csv")
    
    # Summary statistics
    print("\n" + "="*130)
    print("SUMMARY: Average Treatment Effects")
    print("="*130)
    for outcome_col, outcome_label in outcomes:
        if outcome_col not in event_study_results:
            continue
        coefs_df = event_study_results[outcome_col]
        # Average post-period effect (rel_month >= 0)
        post_coefs = coefs_df[coefs_df['rel_month'] >= 0]
        if len(post_coefs) > 0:
            mean_effect = post_coefs['coef'].mean()
            mean_p_val = post_coefs['p_val'].mean()
            print(f"\n{outcome_label}:")
            print(f"  Average post-period coefficient: {mean_effect:.6f}")
            print(f"  Average p-value: {mean_p_val:.6f}")
            if mean_p_val < 0.05:
                print(f"  *** Significant at 5% level ***")

## 6. Parallel Trends Test (Pre-Period Analysis)