Step 1: Define Social Media Spikes
Identify spike days/weeks/months where social media post volume exceeds the rolling historical mean by more than two standard deviations, indicating statistically significant bursts of attention.

In [None]:
import pandas as pd

df = pd.read_csv('bluesky_merged_mentions.csv')

import pandas as pd

# Assume df is your DataFrame already sorted and with a datetime 'date' column

# Ensure date is datetime type
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(['company', 'date'])

# Define functions to calculate rolling stats and flags for spikes

def flag_spikes(df, window_size, count_col, flag_col):
    """Calculate rolling mean and std, then flag spikes above mean + 1.5*std"""
    rolling_mean = df.groupby('company')[count_col].transform(lambda x: x.rolling(window=window_size, min_periods=1).mean())
    rolling_std = df.groupby('company')[count_col].transform(lambda x: x.rolling(window=window_size, min_periods=1).std().fillna(0))
    df[flag_col] = df[count_col] > (rolling_mean + 1.5 * rolling_std)
    return df

# 1. Daily counts vs 7-day rolling mean/std for spike detection
df = flag_spikes(df, 7, 'all_keywords_mentions', 'daily_spike')

daily_spikes = df[df['daily_spike']][['date', 'company', 'all_keywords_mentions']]

# 2. Weekly counts need aggregation first
df_weekly = df.set_index('date').groupby('company')['all_keywords_mentions'].resample('W').sum().reset_index()

# Flag weekly spikes using 21-day rolling window (3 weeks ~ 3 data points)
df_weekly = df_weekly.sort_values(['company', 'date'])
df_weekly = flag_spikes(df_weekly, 21, 'all_keywords_mentions', 'weekly_spike')

weekly_spikes = df_weekly[df_weekly['weekly_spike']][['date', 'company', 'all_keywords_mentions']]

# 3. Monthly counts similarly
df_monthly = df.set_index('date').groupby('company')['all_keywords_mentions'].resample('M').sum().reset_index()

# Flag monthly spikes using 60-day rolling window - approximate with a 2-month rolling window (2 data points)
# Since monthly data is monthly, 2-month ≈ 2-data point rolling window
df_monthly = df_monthly.sort_values(['company', 'date'])
df_monthly = flag_spikes(df_monthly, 30, 'all_keywords_mentions', 'monthly_spike')

monthly_spikes = df_monthly[df_monthly['monthly_spike']][['date', 'company', 'all_keywords_mentions']]

Daily Spikes:
            date      company  all_keywords_mentions
27    2024-08-28         AT&T                   2621
32    2024-09-02         AT&T                   4830
69    2024-10-09         AT&T                   4882
77    2024-10-17         AT&T                   4832
78    2024-10-18         AT&T                   4991
...          ...          ...                    ...
18142 2025-06-03  Wells Fargo                     71
18143 2025-06-04  Wells Fargo                    241
18171 2025-07-02  Wells Fargo                     77
18173 2025-07-04  Wells Fargo                     91
18184 2025-07-15  Wells Fargo                    141

[1492 rows x 3 columns]

Weekly Spikes:
           date      company  all_keywords_mentions
5    2024-09-08         AT&T                  24609
7    2024-09-22         AT&T                  30778
14   2024-11-10         AT&T                  32763
15   2024-11-17         AT&T                  37023
16   2024-11-24         AT&T                  383

  df_monthly = df.set_index('date').groupby('company')['all_keywords_mentions'].resample('M').sum().reset_index()


Step 2: Estimate Post-Spike Returns and Volatility

To estimate log returns, use the ??????
To estimate weekly volatility, use the average weekly volatility for the previous 5 days.
To estimate monthly volatility, use the average monthly volatility for the previous 21 days. 

Step 3: Compare realized volatility/returns with estimated returns/volatility


Step 4: Regress abnormal returns and abnormal volatility on social media spike indicators controlling for market index