# Notebook 06: Main regression analysis (OLS vs IV)

**Objective**
1. **Baseline OLS:** Estimate the naive relationship between analyst coverage and market quality
2. **IV Estimation (2SLS):** Isolate the casual effect using the brokerage closure instrument.
3. **Comparison:** Demonstrate how addressing endogeneity changes the coefficients.

In [39]:
import pandas as pd
import numpy as np
from linearmodels.panel import PanelOLS

In [40]:
file_path = 'data/final_regression_panel.csv'
df = pd.read_csv(file_path, low_memory=False)

In [41]:
# formatting
df['Month_ID'] = pd.to_datetime(df['Month_ID'])
df['Shock_Date'] = pd.to_datetime(df['Shock_Date'])

# create unique entity ID for panel methods
df['Panel_ID'] = df['CUSIP'] + "-" + df['Event_ID'].astype(str)

# needed for linearmodels
df = df.set_index(['Panel_ID', 'Month_ID'])

# create the instrument interaction term
df['Treated_Post'] = df['Treated'] * df['Post']

Define regression configuration

In [42]:
controls = ['Size', 'ROA', 'Leverage', 'MTB', 'Opaqueness']
cols_to_check = controls + ['Price_Delay', 'Avg_Spread', 'NCSKEW', 'Coverage', 'Treated_Post']
df = df.replace([np.inf, -np.inf], np.nan).dropna(subset=cols_to_check)
print(f"Observations after cleaning NaNs/Infs: {len(df)}")

Observations after cleaning NaNs/Infs: 27558


Helper function: OLS & Manual 2SlS

In [43]:
def run_comparison(outcome_var, df_panel, entity_fe=True, time_fe=True):
    """
    Runs Naive OLS and Manual 2SlS for a given outcome variable
    """
    print(f"\n{'='*40}")
    print(f"ANALYSIS: {outcome_var}")
    print(f"{'='*40}")

    # build fixed effects string
    fe_terms = []
    if entity_fe: fe_terms.append('EntityEffects')
    if time_fe: fe_terms.append('TimeEffects')
    fe_formula = ' + '.join(fe_terms)
    control_formula = ' + '.join(controls)

    #  Naive OLS: Outcome ~ Coverage + Controls
    formula_ols = f"{outcome_var} ~ Coverage + {control_formula} + {fe_formula}"
    mod_ols = PanelOLS.from_formula(formula_ols, data=df_panel)
    res_ols = mod_ols.fit(cov_type='clustered', cluster_entity=True)

    # ---------------------------------------

    # Manual 2SLS (Panel IV)
    # Stage 1: regress coverage on instrument (Treated_Post) + Controls + FEs
    formula_stage1 = f"Coverage ~ Treated_Post + {control_formula} + {fe_formula}"
    mod_stage1 = PanelOLS.from_formula(formula_stage1, data=df_panel)
    res_stage1 = mod_stage1.fit(cov_type='clustered', cluster_entity=True)

    # get predicted coverage
    df_panel['Coverage_Hat'] = res_stage1.fitted_values

    # Stage 2: regress outcome on predicted coverage + Controls + FEs
    formula_stage2 = f"{outcome_var} ~ Coverage_Hat + {control_formula} + {fe_formula}"
    mod_stage2 = PanelOLS.from_formula(formula_stage2, data=df_panel)
    res_stage2 = mod_stage2.fit(cov_type='clustered', cluster_entity=True)

    # ---------------------------------------

    # display comparison
    beta_ols = res_ols.params['Coverage']
    t_ols = res_ols.tstats['Coverage']

    beta_iv = res_stage2.params['Coverage_Hat']
    t_iv = res_stage2.tstats['Coverage_Hat']

    # first stage strength (t-stat of instrument)
    fs_t_stat = res_stage1.tstats['Treated_Post']
    f_stat_approx = fs_t_stat ** 2

    print("-" * 30)
    print(f"OLS Coefficient: {beta_ols:.4f} (t={t_ols:.2f})")
    print(f"IV Coefficient:  {beta_iv:.4f} (t={t_iv:.2f})")
    print(f"1st Stage F-stat:{f_stat_approx:.2f}")
    print("-" * 30)

    return {'OLS': res_ols, 'IV': res_stage2, 'FS': res_stage1}

In [44]:
results_store = {}

## Analysis 1: Price Efficiency (Price Delay)

**Hypothesis:** Higher analyst coverage reduces price delay (improves efficiency).
* **Expected OLS:** Negative.
* **Expected IV:** Negative (and likely larger magnitude if selection bias is present).

In [45]:
res_delay = run_comparison('Price_Delay', df)
results_store['Price_Delay'] = res_delay


ANALYSIS: Price_Delay
------------------------------
OLS Coefficient: -0.0006 (t=-0.30)
IV Coefficient:  -0.0556 (t=-3.41)
1st Stage F-stat:53.62
------------------------------


## Analysis 2: Liquidity (Bid-Ask Spread)

**Hypothesis:** Higher coverage reduces information asymmetry, leading to lower spreads.
* **Expected OLS:** Negative.
* **Expected IV:** Negative.

In [46]:
res_spread = run_comparison('Avg_Spread', df)
results_store['Avg_Spread'] = res_spread


ANALYSIS: Avg_Spread
------------------------------
OLS Coefficient: 0.0000 (t=0.53)
IV Coefficient:  -0.0005 (t=-2.39)
1st Stage F-stat:53.62
------------------------------


## 6. Analysis 3: Crash Risk (NCSKEW)

**Hypothesis:** Does coverage prevent (monitoring) or exacerbate (herding) crash risk?
* **Metric:** NCSKEW (Higher values = Higher Crash Risk).

In [47]:
res_crash = run_comparison('NCSKEW', df)
results_store['NCSKEW'] = res_crash


ANALYSIS: NCSKEW
------------------------------
OLS Coefficient: 0.0015 (t=0.24)
IV Coefficient:  -0.0296 (t=-0.61)
1st Stage F-stat:53.62
------------------------------


## 7. Summary Table Output
We compile the key coefficients into a clean format for the paper.

In [48]:
summary_rows = []
outcomes = ['Price_Delay', 'Avg_Spread', 'NCSKEW']

for out in outcomes:
    # OLS Stats
    ols_beta = results_store[out]['OLS'].params['Coverage']
    ols_se = results_store[out]['OLS'].std_errors['Coverage']
    ols_t = results_store[out]['OLS'].tstats['Coverage']

    # IV Stats
    iv_beta = results_store[out]['IV'].params['Coverage_Hat']
    iv_se = results_store[out]['IV'].std_errors['Coverage_Hat']
    iv_t = results_store[out]['IV'].tstats['Coverage_Hat']

    summary_rows.append({
        'Outcome': out,
        'Model': 'OLS',
        'Coef': ols_beta,
        'Std Err': ols_se,
        't-stat': ols_t
    })

    summary_rows.append({
        'Outcome': out,
        'Model': 'IV (2SLS)',
        'Coef': iv_beta,
        'Std Err': iv_se,
        't-stat': iv_t
    })

df_results = pd.DataFrame(summary_rows)
print(df_results)

       Outcome      Model      Coef   Std Err    t-stat
0  Price_Delay        OLS -0.000638  0.002152 -0.296538
1  Price_Delay  IV (2SLS) -0.055626  0.016330 -3.406477
2   Avg_Spread        OLS  0.000018  0.000033  0.527486
3   Avg_Spread  IV (2SLS) -0.000468  0.000196 -2.394699
4       NCSKEW        OLS  0.001496  0.006215  0.240630
5       NCSKEW  IV (2SLS) -0.029577  0.048493 -0.609929


In [49]:
df_results.to_csv('data/final_regression_results.csv', index=False)

## Price Efficiency (Price_Delay):

* In the naive model, analyst coverage seems irrelevant to price delay.
* Once we instrument for coverage, the effect becomes massive and statistically significant
* Analysts cause prices to update faster. The OLS estimate was biased toward zero likely because analysts prefer to cover stocks that are already hard to value or inefficient (selection bias), masking their positive impact. The shock reveals their true value.

## Liquidity (Avg_Spread)
* Naive OLS actually shows a slightly positive (bad) or null relationship. This is counter-intuitive (analysts should help liquidity).
* The IV estimate flips the sign and becomes significant.
* Analysts significantly reduce bid-ask spreads. The OLS result was biased because analysts tend to cover stocks that are inherently liquid (or perhaps volatile/risky in ways that correlate with spreads). Controlling for endogeneity reveals that losing an analyst causes liquidity to dry up.

# Crash Risk (NCSKEW)
* IV: The coefficient is negative (suggesting coverage reduces crash risk), but it is statistically insignificant.
* We cannot reject the null hypothesis. This suggests that while analysts improve efficiency and liquidity, they do not necessarily prevent (or cause) sudden market crashes. This is a valid and interesting "non-result" that contrasts with some literature suggesting analysts herd and cause crashes.