**Fin 585**  
**Diether**  
**Problem Set**  

**Testing the CAPM using Analyst Disagreement Portfolios**

The primary purpose of this problem set is to give you a portfolio formation task that makes you go through all five steps of our portfolio formation framework including testing the CAPM as a model.

1. Data Preparation.

2. Create portfolio formation or criterion variable.

3. Bin the data based on the formation variable.

4. Portfolio creation using the bins.

5. Test the historical performance and test a model.

A secondary goal is to introduce another interesting portfolio strategy. It produces a large spread in average return. Given that, it's a good set of portfolios for testing the CAPM.

To accomplish the programming tasks, you should be able to adapt a lot of code we've used before, and apply it this situation. <br><br>

**Overview**

In this problem set you reproduce another important empirical result in academic finance. Specifically, you reproduce the **dispersion effect** (or the analyst disagreement effect) of Diether, Malloy, and Scherbina (2002). This empirical result spawned a large literature in academic finance, and certainly some quant funds have traded on this effect.

Dispersion (or analyst disagreement) portfolios are formed based on the standard deviation of analyst eps (earnings per share) forecasts over a given period. Here the standard deviation of analyst eps forecasts is the standard deviation across analysts for a given stock and month (most stocks have between 3 to 13 analysts covering them). Diether, Malloy, and Scherbina don't use raw standard deviation. Instead, they scale the standard deviation of analyst forecasts by the absolute value of the mean forecast. Therefore for a given month ($t$), dispersion for stock $i$ is defined as the following:
\begin{align*}
disp_{it} &= \frac{stdev_{it}}{|mean_{it}|}
\end{align*}
DMS form dispersion portfolios using $disp_{i,t-1}$; in other words, they lag dispersion one month. In this homework you will do the same.

There are three datasets for this problem set. The first is the CRSP data (security prices and returns) during the period from January of 1980 to September of 2024. The second is the analyst earnings per share data from IBES. It also covers the period of January of 1980 to September 2024. The frequency for both datasets is monthly. The stock level identifier in the IBES data is called a CUSIP. Consequently, I also included CUSIPs in the CRSP data. The CUSIP and the calendar month uniquely identify the analyst earnings per share observations.

You can download the CRSP data directly using the following link: [the CRSP data](https://diether.org/prephd/08-mstk_80-24.csv). There is also a link on *Learning Suite*. The data contain the following variables:

|Variable | Description                                              |
|---------|----------------------------------------------------------|
|permno   | stock identifier                                         |
|cusip    | stock identifier also in IBES data                       |
|caldt    | calendar date (the day is not truncated to 1)            |
|ret      | monthly return                                           |
|prc      | stock price (not lagged, contemporaneous with returns)   |   


You can download the IBES data directly using the following link: [the IBES data](https://diether.org/prephd/08-ibes_eps_analyst.csv). There is also a link on *Learning Suite*. The data contain the following variables:

|Variable | Description                                          |
|---------|------------------------------------------------------|
|cusip    | stock identifier also in IBES data                   |
|caldt    | calendar date (the day is not truncated to 1)        |
|meanest  | average analyst forecast for that month/stock        |
|stdev    | standard deviation of forecasts for that month/stock |


Finally, to test the CAPM you are going to need a proxy for the market portfolio and for the riskfree rate. Data from these can be found at [Ken French's Data Library](https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html). For your convenience I have created a csv file that contains both these variables, and it can be loaded directly into a dataframe from my website (see the code below). The `dataframe` contains the excess return on a proxy for the market portfolio (`exmkt`), a proxy for the riskfree rate (`rf`), and some other portfolios you can ignore. The returns from Ken French's library are in percent: raw returns multiplied by 100 (so make sure after forming your portfolios, you multiply your portfolio returns by 100 so it matches the units of the market return and riskfree rate).<br><br>


**Tasks**

1. Form quintile based equal-weight dispersion portfolios where dispersion is lagged one month. Report summary statistics (including a t-test of whether the average return is statistically different from zero for each portfolio). You should exclude low price stocks from your portfolios (price below $5). 

2. Test the CAPM by running a time series CAPM regression for each of the analyst dispersion portfolios:
$$
r_{pt} - r_{ft} = \alpha_p + \beta_{pM}( r_{Mt} - r_{ft}) + \epsilon_{it}
$$
Consolidate all your regression results into one table using the `Regtable` function in the BYU Finance library: [Regtable Docs](https://fin-library.readthedocs.io/en/latest/regtables.html)

3. Interpret the regression results from question 2). What can you infer? Can you reject that the CAPM holds? Is the market portfolio, the tangency portfolio? Explain your answers.

4. Create a spread portfolio that goes 100% long in portfolio 0 and 100% short in portfolio 4. Test the CAPM using this portfolio. Can you reject the CAPM? Explain your answers.

5. Estimate the security market line using the data available for this homework. Specifically, estimate the following line:
$$
E(r_p) = r_f + \beta_{p}\bigl[E(r_M) - r_f\bigr]
$$
You don't need to plot the estimated line, but report your estimates of $r_f$ and $E(r_M) - r_f$ as a line. So something like:
$$
\overline{r}_p = 4\% + \hat{\beta}_p(6\%)
$$

6. Why is the intercept in a time series CAPM regression called an *average abnormal return*? Explain.

In [15]:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
from scipy import stats
from finance_byu.summarize import summary
from finance_byu.regtables import Regtable

In [16]:
fac = pd.read_csv('https://diether.org/prephd/08-factors.csv',parse_dates=['caldt'])
fac

Unnamed: 0,caldt,exmkt,smb,hml,umd,rf
0,1927-01-31,-0.06,-0.37,4.54,0.36,0.25
1,1927-02-28,4.18,0.04,2.94,-2.14,0.26
2,1927-03-31,0.13,-1.65,-2.61,3.61,0.30
3,1927-04-30,0.46,0.30,0.81,4.30,0.25
4,1927-05-31,5.44,1.53,4.73,3.00,0.30
...,...,...,...,...,...,...
1168,2024-05-31,4.34,0.78,-1.67,-0.02,0.44
1169,2024-06-28,2.77,-3.06,-3.31,0.90,0.41
1170,2024-07-31,1.24,6.80,5.74,-2.42,0.45
1171,2024-08-30,1.61,-3.55,-1.13,4.79,0.48


In [17]:
stk = pd.read_csv('08-mstk_80-24.csv',parse_dates=['caldt'])
stk

Unnamed: 0,permno,caldt,cusip,ret,prc,me
0,10000,1986-01-31,68391610,,4.37500,16.1000
1,10000,1986-02-28,68391610,-0.257143,3.25000,11.9600
2,10000,1986-03-31,68391610,0.365385,4.43750,16.3300
3,10000,1986-04-30,68391610,-0.098592,4.00000,15.1720
4,10000,1986-05-30,68391610,-0.222656,3.10938,11.7939
...,...,...,...,...,...,...
2741076,93436,2024-05-31,88160R10,-0.028372,178.08000,567932.0000
2741077,93436,2024-06-28,88160R10,0.111186,197.88000,632155.0000
2741078,93436,2024-07-31,88160R10,0.172781,232.07000,741380.0000
2741079,93436,2024-08-30,88160R10,-0.077391,214.11000,684004.0000


In [18]:
ibes = pd.read_csv("08-ibes_eps_analyst.csv",parse_dates=['caldt'])
ibes

Unnamed: 0,cusip,caldt,meanest,stdev
0,00000000,2010-06-17,1.00,0.01
1,00000000,2010-07-15,0.98,0.02
2,00000000,2016-04-14,0.25,0.08
3,00000000,2016-05-19,0.31,0.01
4,00000000,2016-06-16,0.31,0.01
...,...,...,...,...
1827951,ZNPRICES,2024-07-18,1.19,0.05
1827952,ZNPRICES,2024-08-15,1.20,0.06
1827953,ZNPRICES,2024-09-19,1.20,0.06
1827954,ZNPRICES,2024-10-17,1.21,0.05


<br>

**Hint About Merging the two Datasets**

In the datasets I've include the full calendar dates of the observations. Even though the frequency for both is monthly, the timing is not the same. The CRSP data are from the last trading day in the month, and the IBES data tend to be around the middle of the month. Therefore, to merge these dataframes you need to create a new date variable that only preserves uniqueness at the year-month level. Here is a shortcut way to accomplish that:

In [19]:
stk['mdt'] = stk['caldt'].values.astype('datetime64[M]')
stk.head(5)

Unnamed: 0,permno,caldt,cusip,ret,prc,me,mdt
0,10000,1986-01-31,68391610,,4.375,16.1,1986-01-01
1,10000,1986-02-28,68391610,-0.257143,3.25,11.96,1986-02-01
2,10000,1986-03-31,68391610,0.365385,4.4375,16.33,1986-03-01
3,10000,1986-04-30,68391610,-0.098592,4.0,15.172,1986-04-01
4,10000,1986-05-30,68391610,-0.222656,3.10938,11.7939,1986-05-01


In [20]:
ibes['mdt'] = ibes['caldt'].values.astype('datetime64[M]')
ibes.drop('caldt', axis=1, inplace=True)
ibes.head(5)

Unnamed: 0,cusip,meanest,stdev,mdt
0,0,1.0,0.01,2010-06-01
1,0,0.98,0.02,2010-07-01
2,0,0.25,0.08,2016-04-01
3,0,0.31,0.01,2016-05-01
4,0,0.31,0.01,2016-06-01


What is the code above doing? Pandas stores all dates with precision to the nanosecond. But Numpy (the library Pandas uses for its date functionality) actually includes date types for varying levels of precision (including monthly). So the above code changes the original nanosecond datetype to a monthly datetype; this causes all the information about time beyond a month to be lost and when pandas automatically reconverts the date to a nanosecond datetype the day gets set equal to one for all observations.

Now you should be able to merge the two datasets.

In [24]:
# merge on cusip and month-date
df = stk.merge(ibes, on = ['mdt', 'cusip'], how = 'left') 

# create dispersion varibale
df['disp'] = df['stdev']/np.abs(df['meanest'])

# lag dispersion
"""We can only use information that you had last month to make decisions on what to do next month"""
df['displag'] = df.groupby('permno')['disp'].shift()

# lag price
df['prclag'] = df.groupby('permno')['prc'].shift()

# keep only rows with a dsiplag observation with prices 5 and above
df = df.query('displag == displag and prclag >= 5').reset_index(drop = True)

# create bins by month and lagged sipersion
df['bin'] = df.groupby('mdt')['displag'].transform(pd.qcut, 5, labels = False)

ew = (
    df.groupby(['caldt', 'bin'])['ret']
      .mean()
      .unstack(level='bin')
      .rename(columns=lambda x: 'p{:.0f}'.format(x))
) * 100



# initialize somewhere to store information
t_test_results = {} 

for bin_id in range(5):  # iterates through all bins
    bin_data = df[df['bin'] == bin_id]['displag'].dropna()  # Filter non-NaN returns for the bin

    # Compute t-test
    t_stat, p_value = stats.ttest_1samp(bin_data, 0)
    # Compute summary statistics
    summary_stats = bin_data.describe()  # Pandas describe() gives count, mean, std, min, etc.
    t_test_results[bin_id] = {
        't-stat': t_stat,
        'p-value': p_value,
        'n': int(summary_stats['count']),
        'mean': summary_stats['mean'],
        'std': summary_stats['std'],
        'min': summary_stats['min'],
        '25%': summary_stats['25%'],
        'median': summary_stats['50%'],
        '75%': summary_stats['75%'],
        'max': summary_stats['max'],
        }
    
for bin_id in range(5):
    if bin_id in t_test_results:
        print(f"bin: {bin_id}\n t-stat: {t_test_results[bin_id]['t-stat']}\n mean: {t_test_results[bin_id]['mean']}\n min: {t_test_results[bin_id]['min']}\n median: {t_test_results[bin_id]['median']}\n max: {t_test_results[bin_id]['max']}\n standard devaiation: {t_test_results[bin_id]['std']}")
    else:
        print(f"No data for bin {bin_id}")

df.head(5)

bin: 0
 t-stat: 742.9875490563843
 mean: 0.009925252027212209
 min: 0.0
 median: 0.009259259259259259
 max: 0.06349206349206349
 standard devaiation: 0.006836600354077372
bin: 1
 t-stat: 1165.4194475814404
 mean: 0.024113491172515242
 min: 0.007246376811594204
 median: 0.021739130434782608
 max: 0.13004484304932734
 standard devaiation: 0.010572303544923679
bin: 2
 t-stat: 1221.1210740642464
 mean: 0.04378566810532664
 min: 0.014084507042253521
 median: 0.04
 max: 0.21777777777777776
 standard devaiation: 0.01832272217752572
bin: 3
 t-stat: 1203.2667525364427
 mean: 0.08885194873303247
 min: 0.02678571428571428
 median: 0.0819672131147541
 max: 0.4423076923076923
 standard devaiation: 0.0377237892672656
bin: 4
 t-stat: 130.91459628919662
 mean: 0.756005508227327
 min: 0.06410256410256411
 median: 0.2695865302642796
 max: 206.99999999999997
 standard devaiation: 2.9501820933583662


Unnamed: 0,permno,caldt,cusip,ret,prc,me,mdt,meanest,stdev,disp,displag,prclag,bin
0,10001,1990-05-31,39040610,-0.012658,9.75,10.0132,1990-05-01,1.05,0.07,0.066667,0.14,9.875,3
1,10001,1990-06-29,39040610,0.014103,9.75,10.0523,1990-06-01,1.1,0.14,0.127273,0.066667,9.75,2
2,10001,1990-07-31,39040610,0.025641,10.0,10.31,1990-07-01,1.1,0.14,0.127273,0.127273,9.75,3
3,10001,1990-08-31,39040610,-0.05,9.5,9.7945,1990-08-01,1.05,0.08,0.07619,0.127273,10.0,3
4,10001,1990-09-28,39040610,0.04079,9.75,10.179,1990-09-01,,,,0.07619,9.5,3
