# Data Exercise on Estimation of Risk Premia
Read the background paper by Jonathan Lewellen, (2015), “The cross section of expected stock returns”, Critical Finance Review, pp. 1-44, and also Fama and French (1992) covered in class. You can also check up Predicting Stock Returns Using Firm Characteristics - (alphaarchitect.com)
Data set GPEX1set1.csv obtained via from WRDS contains 2,290,809 rows and 6 columns viz. GVKEY (firm identifier), Date, Return (monthly return rate % of firm's stock), Log market value of equity at the end of the prior month, Log book value of equity minus log market value of equity at the end of the prior month, Stock return from month −12 to month −2 earlier.
The rows are arranged by firms, then by dates. Note that different firms’ data may start and end at different dates. 

In Q1, we find and report the time series of each monthly cross-sectional regression estimates for the entire data set. The regressand is stock return and the regressors are the given 3 firm's characteristics. Use a constant for the regression. Ignore months when the number of firms for cross-sectional regression is less than the threshold of 30.

In Q2, compute the time series averages of the slope estimates (risk premium estimates each month) and their standard errors. This follows the Fama-MacBeth procedure. Hence perform a t-test if the slope average is significantly different from zero. 

In [1]:
import numpy as np
import pandas as pd
import scipy as sp
import scipy.stats
import statsmodels.api as sm
import datetime as dt
import warnings
# warnings.resetwarnings()
warnings.filterwarnings("ignore")


In [2]:
df = pd.read_csv('GPEX1set1.csv')
df['Date'] = pd.to_datetime(df['Date']) #convert to pandas datetime for indexing
df.set_index(['Date','GVKEY'],inplace = True)
df['Return'] = df['Return']/100 #convert to decimal
df

Unnamed: 0_level_0,Unnamed: 1_level_0,Return,LogSize_-1,LogB/M_-1,"Return_-2,-12"
Date,GVKEY,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1972-04-30,1000,0.266667,2.819480,-0.266121,-0.461539
1972-05-31,1000,-0.070175,3.053069,-0.934455,-0.476744
1972-06-30,1000,-0.094340,2.977503,-0.861695,-0.173913
1972-07-31,1000,0.000000,2.875596,-0.762604,-0.196970
1972-08-31,1000,-0.020833,2.875935,-0.762604,-0.020408
...,...,...,...,...,...
2020-08-31,328795,0.096400,7.620326,-0.130132,0.129600
2020-09-30,328795,-0.047526,7.712358,-0.222164,0.305948
2020-10-31,328795,0.048310,7.663665,-0.173471,0.359853
2020-11-30,328795,0.123890,7.709071,-0.219568,0.152096


In [3]:
def cs_ols(df, threshold=30):
    cs_stats = []
    cs_result = {}
    
    for dates, datapoints in df.groupby('Date'):
        if datapoints.count().min() >= threshold:        
            OLS = sm.OLS(datapoints['Return'], sm.add_constant(datapoints.iloc[:,1:])).fit()
            OLS_stats = OLS.params 
            OLS_stats.name = dates
            OLS_stats['Adj $\mathbb{R}^2$'] = OLS.rsquared_adj
            OLS_stats['N.Obs'] = OLS.nobs
            cs_stats.append(OLS_stats)
            cs_result[dates.strftime('%Y-%m-%d')] = OLS
    
    cs_stats = pd.DataFrame(cs_stats)
    return cs_stats, cs_result

#  <a id='Q1'><font color = 'black'>Question 1</font></a>

In [4]:
GPEX1_summary,GPEX1_results = cs_ols(df)

In [5]:
GPEX1_summary.head(2)

Unnamed: 0,const,LogSize_-1,LogB/M_-1,"Return_-2,-12",Adj $\mathbb{R}^2$,N.Obs
1967-04-30,-0.007385,-0.000439,0.004839,0.127069,0.260407,39.0
1967-05-31,-0.036995,0.001527,-0.007863,-0.146485,0.16926,41.0


In [6]:
GPEX1_summary.tail(20)

Unnamed: 0,const,LogSize_-1,LogB/M_-1,"Return_-2,-12",Adj $\mathbb{R}^2$,N.Obs
2019-08-31,-0.036783,-0.001712,-0.001916,0.053777,0.015879,3702.0
2019-09-30,-0.013596,0.008546,0.030158,-0.038796,0.067908,3685.0
2019-10-31,-0.01968,0.004526,-0.000637,0.02999,0.015054,3612.0
2019-11-30,-0.002562,0.003886,-0.013128,-0.036935,0.00889,3629.0
2019-12-31,0.064604,-0.001202,0.017621,-0.020152,0.013839,3614.0
2020-01-31,0.061434,-0.015367,-0.027279,-0.000291,0.022052,3577.0
2020-02-29,-0.03324,-0.007448,-0.005352,-0.007484,0.004634,3564.0
2020-03-31,-0.252737,-0.002991,-0.050584,0.005838,0.046268,3546.0
2020-04-30,0.256116,-0.013799,-0.0105,-0.037973,0.016429,3696.0
2020-05-31,0.094314,-0.00817,-0.033578,-0.011722,0.025951,3620.0


In [7]:
GPEX1_results

{'1967-04-30': <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x203b1ed5970>,
 '1967-05-31': <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x203b2ab7470>,
 '1967-06-30': <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x203b3ffb710>,
 '1967-07-31': <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x203b3ffbf20>,
 '1967-08-31': <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x203b4044a70>,
 '1967-09-30': <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x203b4045130>,
 '1967-10-31': <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x203b4045250>,
 '1967-11-30': <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x203b4045fa0>,
 '1967-12-31': <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x203b4046150>,
 '1968-01-31': <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x203b4046b70>,
 '1968-02-29': <statsmodels.regression.l

# <a id='Q2'><font color = 'black'>Question 2</font></a>

In [8]:
mean = GPEX1_summary.iloc[:, :-2].mean()
std = GPEX1_summary.iloc[:, :-2].std()
nobs = len(GPEX1_summary)
tstat = mean / (std / np.sqrt(nobs))

In [9]:
df_t1 = pd.concat([mean, tstat], axis=1)
df_t1.columns = ['average', 't-statistic']
df_t1['p-value'] = scipy.stats.t.sf(abs(tstat), df=len(GPEX1_summary)-1)*2
# above p-value for two-tailed t-test
print('No. of Obs:',nobs)
df_t1

No. of Obs: 648


Unnamed: 0,average,t-statistic,p-value
const,0.017795,5.582568,3.489432e-08
LogSize_-1,-0.001086,-2.93076,0.003500591
LogB/M_-1,0.004723,7.565198,1.335177e-13
"Return_-2,-12",0.009976,6.626166,7.259303e-11
