# Problem Set 7
### Nikhilesh Belulkar

## Problem 3

In [None]:
import pandas as pd 
from statsmodels.tsa.statespace.sarimax import SARIMAX

In [10]:
df = pd.read_stata('stockprice.dta')
df = df.set_index('ttt')
df.index.freq = 'MS' 
df

Unnamed: 0_level_0,exreturn,ln_divyield
ttt,Unnamed: 1_level_1,Unnamed: 2_level_1
1931-01-01,5.9650,-2.8223
1931-02-01,10.3053,-2.9321
1931-03-01,-6.8408,-2.8786
1931-04-01,-10.4481,-2.7825
1931-05-01,-14.3581,-2.6547
...,...,...
2002-08-01,1.0056,-3.9459
2002-09-01,-10.4640,-3.8359
2002-10-01,5.9082,-3.8861
2002-11-01,4.7428,-3.9347


In [11]:
x = df.isna().sum()
x

exreturn       0
ln_divyield    0
dtype: int64

a) Autoregressive forecasts. Estimate AR(1), AR(2), and AR(4) models of one-month returns. Construct a table which reports the coefficients, the
F-statistics and p-values testing the hypothesis that the lag(s) of the one-month return do not help to predict the one-month return, along with the adjusted R2’s of the regressions. What are your conclusions on whether autoregressive models are helpful in predicting stock returns

In [15]:
## AR Models Table
import statsmodels.api as sm

# Helper to run AR(p) via OLS and return metrics
def run_ar_model(data, p):
    # Create lags
    temp_df = pd.DataFrame({'y': data})
    lag_cols = []
    for i in range(1, p + 1):
        col_name = f'L{i}'
        temp_df[col_name] = temp_df['y'].shift(i)
        lag_cols.append(col_name)
    
    # Drop missing values created by lags
    temp_df = temp_df.dropna()
    
    # Define X (lags + constant) and y
    X = sm.add_constant(temp_df[lag_cols])
    y = temp_df['y']
    
    # Fit OLS
    model = sm.OLS(y, X).fit()
    
    # Extract results
    res = {
        'Model': f'AR({p})',
        'Adj. R2': model.rsquared_adj,
        'F-stat': model.fvalue,
        'Prob(F-stat)': model.f_pvalue,
        'Intercept': model.params['const']
    }
    
    # Add lag coefficients
    for col in lag_cols:
        res[col] = model.params[col]
        
    return res

# Run for AR(1), AR(2), AR(4)
results = []
for p in [1, 2, 4]:
    results.append(run_ar_model(df['exreturn'], p))

# Create Table
ar_table = pd.DataFrame(results)
ar_table = ar_table.set_index('Model')

# Reorder columns
cols = [c for c in ar_table.columns if c.startswith('L')] + ['Intercept', 'F-stat', 'Prob(F-stat)', 'Adj. R2']
ar_table = ar_table[cols]

# Display
ar_table

Unnamed: 0_level_0,L1,L2,L3,L4,Intercept,F-stat,Prob(F-stat),Adj. R2
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
AR(1),0.082242,,,,0.454026,5.864287,0.015657,0.005611
AR(2),0.081587,-0.016811,,,0.451952,2.910158,0.055004,0.004417
AR(4),0.083958,-0.002732,-0.101778,0.039725,0.494774,3.885751,0.003902,0.01326


Conclusion about their ability to predict stock return: 
1. AR(1), AR(2) and AR(4) all have reasonable p-values and are statistically singificant although they do not explain much of the variation in the stock return as evidenced by low Adj-R2 for all regressions.

In [16]:
# Part (b) Regression Implementation

# 1. Create 6-month rolling average return: R6_t = 1/6 * sum(R1_t ... R1_{t-5})
# Note: rolling(6).mean() computes exactly this.
df['R6'] = df['exreturn'].rolling(window=6).mean()

# 2. Create lags t-6, t-7, t-8
df['R6_L6'] = df['R6'].shift(6)
df['R6_L7'] = df['R6'].shift(7)
df['R6_L8'] = df['R6'].shift(8)

# 3. Run Regression (1): R6_t = b0 + b6*R6_{t-6} + b7*R6_{t-7} + b8*R6_{t-8}
reg_data = df[['R6', 'R6_L6', 'R6_L7', 'R6_L8']].dropna()
X_b = sm.add_constant(reg_data[['R6_L6', 'R6_L7', 'R6_L8']])
y_b = reg_data['R6']

model_b = sm.OLS(y_b, X_b).fit()
print(model_b.summary())

# Parts (c) and (d): Empirical Checks

# Check for Strict Exogeneity: Are residuals correlated with FUTURE regressors (leads)?
# If Strictly Exogenous, E[u_t | X_{t+k}] = 0.

residuals = model_b.resid
# We can check correlation with, say, R6_{t+1} (Lead 1)
# Note: We need to align the index.
check_df = pd.DataFrame({'resid': residuals, 'R6_Lead1': df['R6'].shift(-1)}).dropna()

print("\n--- Empirical Check for Strict Exogeneity ---")
print("Correlation between residuals and R6_{t+1}:", check_df.corr().iloc[0,1])

# Regression of residuals on Lead regressor
X_check = sm.add_constant(check_df['R6_Lead1'])
model_check = sm.OLS(check_df['resid'], X_check).fit()
print(model_check.summary())

                            OLS Regression Results                            
Dep. Variable:                     R6   R-squared:                       0.004
Model:                            OLS   Adj. R-squared:                  0.000
Method:                 Least Squares   F-statistic:                     1.070
Date:                Fri, 28 Nov 2025   Prob (F-statistic):              0.361
Time:                        15:46:59   Log-Likelihood:                -1869.5
No. Observations:                 851   AIC:                             3747.
Df Residuals:                     847   BIC:                             3766.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.5267      0.077      6.818      0.0

### Part (b): 6-month return autoregressions

**Question:** Why wouldn't you use a more recent 6-month return, for example, why wouldn't you include $R6_{t-1}$ and $R6_{t-2}$ as regressors?

**Answer:**
If we are standing at time $t-6$ attempting to forecast the return over the next six months (which will be realized at time $t$ as $R6_t$), we can only use information available at time $t-6$ or earlier.

- $R6_{t-1}$ represents the 6-month return ending at $t-1$. It includes returns from $t-1, t-2, \dots, t-6$. Some of these (like $R1_{t-1}$) are in the future relative to $t-6$.
- Similarly, $R6_{t-2}$ includes returns that occur after $t-6$.

Therefore, these variables are not observable at the time the forecast is made ($t-6$) and cannot be included in a predictive regression.

### Part (c): Exogeneity

**Question:** Under the null hypothesis that lagged returns cannot be used to forecast future returns, are the regressors in Equation (1) plausibly exogenous?

**Answer:**
**Yes.** Under the null hypothesis (e.g., Efficient Market Hypothesis), stock returns are unpredictable and serially uncorrelated. The error term $u_t$ represents the "surprise" or innovation in returns from $t-6$ to $t$. The regressors ($R6_{t-6}, R6_{t-7}, \dots$) consist entirely of returns realized at or before $t-6$. Since future return innovations are independent of past information under the null, the error term is uncorrelated with the regressors ($E[u_t | X] = 0$).

### Part (d): Strict Exogeneity

**Question:** Under the null hypothesis that lagged returns cannot be used to forecast future returns, are the regressors in Equation (1) plausibly strictly exogenous?

**Answer:**
**No.** Strict exogeneity requires the error term $u_t$ to be uncorrelated with the regressors at *all* points in time (past, present, and future). In time series models with lagged dependent variables (or autoregressive structures), this condition almost always fails.

Specifically, the error term $u_t$ (which drives the return $R6_t$) determines the value of $R6_t$. However, $R6_t$ will appear as a *regressor* in future equations (e.g., when forecasting $R6_{t+6}$, the regressor will be $R6_t$). Because the current error is correlated with future regressors (feedback effect), strict exogeneity does not hold.