<a href="https://colab.research.google.com/github/Rocking-Priya/2025-summer-mod-6/blob/main/Mod_6_Homework_Reflection_Week_9_12.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Homework Reflection 9

1. Write some code that will use a simulation to estimate the standard deviation of the coefficient when there is heteroskedasticity.  
Compare these standard errors to those found via statsmodels OLS or a similar linear regression model.

2. Write some code that will use a simulation to estimate the standard deviation of the coefficient when errors are highly correlated / non-independent.
Compare these standard errors to those found via statsmodels OlS or a similar linear regression model.

Show that if the correlation between coefficients is high enough, then the estimated standard deviation of the coefficient, using bootstrap errors,
might not match that found by a full simulation of the Data Generating Process.  (This can be fixed if you have a huge amount of data for the bootstrap simulation.)

Estimate standard error under heteroskedasticity

Heteroskedasticity = error variance changes with X or W.

In [1]:
import numpy as np
import statsmodels.api as sm

In [4]:
import numpy as np
import statsmodels.api as sm

# Set simulation parameters
np.random.seed(42)
n_simulations = 1000
n = 1000
beta_estimates = []

for _ in range(n_simulations):
    W = np.random.normal(0, 1, n)
    X = W + np.random.normal(0, 1, n)
    # Heteroskedastic noise: depends on W
    noise = np.random.normal(0, 1 + 0.5 * np.abs(W))
    Y = 1 * X - W + noise

    # Add constant for OLS
    XW = sm.add_constant(np.column_stack([X, W]))
    model = sm.OLS(Y, XW).fit()
    beta_estimates.append(model.params[1])  # coefficient for X

# Simulation-based standard deviation of X's coefficient
simulated_std = np.std(beta_estimates)
print(f"Simulated std of beta (X): {simulated_std:.4f}")


Simulated std of beta (X): 0.0474


In [11]:
print(np.std(beta_estimates))


0.047403283183051995


In [12]:
from scipy.stats import skew
print(skew(beta_estimates))


-0.09132191293582789


Compare to statsmodels’ OLS standard error (one run)

In [5]:
# Just one regression to compare OLS standard error
W = np.random.normal(0, 1, n)
X = W + np.random.normal(0, 1, n)
noise = np.random.normal(0, 1 + 0.5 * np.abs(W))
Y = 1 * X - W + noise

XW = sm.add_constant(np.column_stack([X, W]))
model = sm.OLS(Y, XW).fit()
ols_std = model.bse[1]  # standard error of X
print(f"OLS std of beta (X): {ols_std:.4f}")


OLS std of beta (X): 0.0439


Estimate std error when errors are correlated

Non-independent errors = correlation between nearby error terms.




In [6]:
from scipy.stats import norm

beta_corr_estimates = []
for _ in range(n_simulations):
    W = np.random.normal(0, 1, n)
    X = W + np.random.normal(0, 1, n)

    # Correlated errors: add AR(1)-style autocorrelation
    epsilon = np.random.normal(0, 1, n)
    for i in range(1, n):
        epsilon[i] += 0.9 * epsilon[i-1]  # correlation with previous error

    Y = 1 * X - W + epsilon

    XW = sm.add_constant(np.column_stack([X, W]))
    model = sm.OLS(Y, XW).fit()
    beta_corr_estimates.append(model.params[1])

# Standard deviation from simulation
simulated_corr_std = np.std(beta_corr_estimates)
print(f"Simulated std with correlated errors: {simulated_corr_std:.4f}")


Simulated std with correlated errors: 0.0743


Show bootstrap error can mismatch under high collinearity

Let’s show if X and W are highly correlated, bootstrap errors might not match the truth.

In [7]:
from sklearn.utils import resample

# Generate one dataset with highly correlated X and W
W = np.random.normal(0, 1, n)
X = W + np.random.normal(0, 0.01, n)  # very little noise → X ≈ W
Y = 1 * X - W + np.random.normal(0, 1, n)

# Run bootstrap
bootstrap_betas = []
for _ in range(500):
    Y_sample, X_sample, W_sample = resample(Y, X, W)
    XW_sample = sm.add_constant(np.column_stack([X_sample, W_sample]))
    model = sm.OLS(Y_sample, XW_sample).fit()
    bootstrap_betas.append(model.params[1])

# Bootstrap std
bootstrap_std = np.std(bootstrap_betas)
print(f"Bootstrap std with high collinearity: {bootstrap_std:.4f}")

# Compare to statsmodels
XW = sm.add_constant(np.column_stack([X, W]))
model = sm.OLS(Y, XW).fit()
ols_std = model.bse[1]
print(f"OLS std with high collinearity: {ols_std:.4f}")


Bootstrap std with high collinearity: 3.2284
OLS std with high collinearity: 3.2500


In [9]:
n_simulations = 1000
significant_count = 0
n = 1000

for _ in range(n_simulations):
    # Simulate data using the same logic as in previous cells
    W = np.random.normal(0, 1, n)
    X = W + np.random.normal(0, 1, n)
    # Heteroskedastic noise: depends on W
    noise = np.random.normal(0, 1 + 0.5 * np.abs(W))
    Y = 1 * X - W + noise

    XW = sm.add_constant(np.column_stack([X, W]))
    model = sm.OLS(Y, XW).fit()

    t_value = model.tvalues[1]  # t-value of X
    if abs(t_value) > 1.96:
        significant_count += 1

power = significant_count / n_simulations
print(f"Power ≈ {power:.2%}")

Power ≈ 100.00%


In [14]:


np.random.seed(42)

def simulate_power(B, A=1, C=10, D=1000, n_simulations=500):
    detected = 0
    for _ in range(n_simulations):
        W = np.random.normal(0, 1, D)
        X = W + np.random.normal(0, B, D)
        Y = A * X - W + np.random.normal(0, C, D)
        XW = sm.add_constant(np.column_stack([X, W]))
        model = sm.OLS(Y, XW).fit()
        t_value = model.tvalues[1]  # t-value for X
        if abs(t_value) > 1.96:
            detected += 1
    return detected / n_simulations  # return proportion detected

# Try different values of B
B_values = [0.2, 0.6, 1.8, 5.4]
for B in B_values:
    power = simulate_power(B)
    print(f"B = {B}: Detection rate ≈ {power:.2f}")


B = 0.2: Detection rate ≈ 0.11
B = 0.6: Detection rate ≈ 0.47
B = 1.8: Detection rate ≈ 1.00
B = 5.4: Detection rate ≈ 1.00


In [15]:


np.random.seed(42)
n_simulations = 1000
n = 100  # D = 100

detection_rates = []

for A in [0.5, 1.0, 2.0, 4.0]:
    detections = 0
    for _ in range(n_simulations):
        W = np.random.normal(0, 1, n)
        X = W + np.random.normal(0, 1, n)  # B = 1
        Y = A * X - W + np.random.normal(0, 10, n)  # C = 10

        XW = sm.add_constant(np.column_stack([X, W]))
        model = sm.OLS(Y, XW).fit()
        t_value = model.tvalues[1]  # t-stat for X
        if abs(t_value) > 1.96:
            detections += 1
    detection_rate = detections / n_simulations
    detection_rates.append((A, detection_rate))
    print(f"A = {A}, Detection rate = {detection_rate:.2f}")


A = 0.5, Detection rate = 0.08
A = 1.0, Detection rate = 0.16
A = 2.0, Detection rate = 0.51
A = 4.0, Detection rate = 0.98
