
# Empirical Asset Pricing - PS1
## Task 1

In [5]:
using CSV, DataFrames, NullableArrays, Plots; gr();

In [6]:
df = CSV.read("./Data/a5986cc31aad4a02.csv", delim=',', 
        types=[Date, Float64, Float64, Float64], 
        dateformat = DateFormat("yyyymmdd"), nullable=false);

### (b)

In [9]:
df[:div] = collect((df[:vwretd]-df[:vwretx])[1:end].*df[:totval][vcat(1, 1:(size(df, 1)-1))])
df[:year] = Dates.value.(Dates.Year.(df[:DATE]))
df[:month] = Dates.value.(Dates.Month.(df[:DATE]));

In [316]:
head(df)

Unnamed: 0,DATE,vwretd,vwretx,totval,div,year,month
1,1945-01-31,0.020218,0.018951,47861970.3,60641.11637010003,1945,1
2,1945-02-28,0.064477,0.059894,50725183.6,219351.40988490015,1945,2
3,1945-03-31,-0.039177,-0.043164,48551743.8,202241.30701320025,1945,3
4,1945-04-30,0.078232,0.076981,52320643.2,60738.2314938001,1945,4
5,1945-05-31,0.018185,0.012439,53195465.9,300634.4158272,1945,5
6,1945-06-30,0.004676,0.001087,53280582.2,190918.52711509995,1945,6


### (c)

In [317]:
yearly_data = by(df, [:year], df -> sum(df[:div]))

yearly_data[:mkt_div] = zeros(floor(Int64, size(df, 1)/12))

for t in 1:size(df, 1)   
    if ((t-1)%12 == 0) #January
        yearly_data[:mkt_div][1+floor(Int64, (t-1)/12)] = df[:div][t]
        else #other months
        yearly_data[:mkt_div][1+floor(Int64, (t-1)/12)] = 
            yearly_data[:mkt_div][1+floor(Int64, (t-1)/12)]*(1+df[:vwretd][t-1]) + df[:div][t]
    end
end

names!(yearly_data, [:year, :cash_div, :mkt_div]);

In [318]:
(df[df[:year] .== 1945, :div])

12-element DataArrays.DataArray{Float64,1}:
  60641.1      
      2.19351e5
      2.02241e5
  60738.2      
      3.00634e5
      1.90919e5
  66014.6      
 243777.0      
      1.83325e5
  65865.3      
      4.18723e5
      2.37665e5

In [319]:
2.2498955541939e6 - 60641.1 

2.1892544541939e6

In [320]:
plot(yearly_data[:year], yearly_data[:cash_div], label="Cash", title="Annual Dividends invested in ...")
plot!(yearly_data[:year], yearly_data[:mkt_div], label="Market")

### (d)

In [321]:
yearly_data[:ret] = log.(by(df, [:year], df -> prod(1 .+ df[:vwretd]))[:x1]);

In [322]:
yearly_data[:cash_div_growth] = 
    vcat(0, log.(yearly_data[:cash_div][2:end]./yearly_data[:cash_div][1:end-1]))

yearly_data[:mkt_div_growth] = 
    vcat(0, log.(yearly_data[:mkt_div][2:end]./yearly_data[:mkt_div][1:end-1]));

In [323]:
yearly_data[:totval] = by(df, [:year], df -> df[:totval][12])[:x1]
yearly_data[:cash_logPD] = log.(yearly_data[:totval] ./ yearly_data[:cash_div])
yearly_data[:mkt_logPD] = log.(yearly_data[:totval] ./ yearly_data[:mkt_div]);

In [324]:
head(yearly_data)

Unnamed: 0,year,cash_div,mkt_div,ret,cash_div_growth,mkt_div_growth,totval,cash_logPD,mkt_logPD
1,1945,2249895.5541939,2647709.078443113,0.3297315675814011,0.0,0.0,64330622.1,3.3531519610447926,3.1903409884224825
2,1946,2599129.628707599,2319925.542087249,-0.066366312101153,0.144292836013417,-0.1321596761676265,59684383.6,3.1338937741466646,3.247535313705398
3,1947,3231865.534292101,3263680.9326087963,0.0324011382287758,0.2178829045105071,0.3413205875984371,59885360.7,2.919372544287957,2.909576400758761
4,1948,3757383.750905099,3610419.3620426296,0.0210895363296691,0.1506633689790038,0.1009682535350394,58301258.2,2.741900770315167,2.7817997422299343
5,1949,4139718.261552899,4534289.215466398,0.1832323109661117,0.0969048284729966,0.2278444058309379,67344403.4,2.7891920693939216,2.698151463950748
6,1950,5265457.501587398,5784492.814004975,0.2659995037158896,0.2405403039426631,0.243512346001775,84833973.9,2.779528061478855,2.6855154139765696


Mean and standard deviation of the cash- and market-invested dividends growth is:

In [325]:
colwise(mean, yearly_data[[:cash_div_growth, :mkt_div_growth]])

2-element Array{Any,1}:
 [0.0784899]
 [0.0771895]

In [326]:
colwise(std, yearly_data[[:cash_div_growth, :mkt_div_growth]])

2-element Array{Any,1}:
 [0.0686968]
 [0.133035] 

Investing the dividends in the market portfolio does not yield a higher average dividend growth, due to the [January effect](https://en.wikipedia.org/wiki/January_effect effect). The standard deviation of the market-invested dividend growth is also much higher, due to stock market fluctuations.

### (e)

In [327]:
using GLM

In [328]:
data_prep = yearly_data[2:end, [:year, :ret, :cash_logPD, :cash_div_growth]]
data_prep[:lagged_cash_logPD] = yearly_data[:cash_logPD][1:end-1];

Full Sample:

In [329]:
model = lm(@formula(ret ~ lagged_cash_logPD), data_prep)

DataFrames.DataFrameRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: ret ~ 1 + lagged_cash_logPD

Coefficients:
                    Estimate Std.Error  t value Pr(>|t|)
(Intercept)         0.530521  0.157918  3.35947   0.0013
lagged_cash_logPD  -0.122306 0.0446251 -2.74074   0.0078


In [330]:
r2(model)

0.09817662392924986

In [331]:
model_d = lm(@formula(cash_div_growth ~ lagged_cash_logPD), data_prep)

DataFrames.DataFrameRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: cash_div_growth ~ 1 + lagged_cash_logPD

Coefficients:
                      Estimate Std.Error  t value Pr(>|t|)
(Intercept)          0.0986845  0.069243  1.42519   0.1586
lagged_cash_logPD  -0.00543238 0.0195669 -0.27763   0.7821


In [332]:
r2(model_d)

0.0011158350008704243

First part of the sample:

In [333]:
model_early = lm(@formula(ret ~ lagged_cash_logPD), data_prep[data_prep[:year] .< 1990, :])

DataFrames.DataFrameRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: ret ~ 1 + lagged_cash_logPD

Coefficients:
                   Estimate Std.Error  t value Pr(>|t|)
(Intercept)         1.08774  0.266012  4.08904   0.0002
lagged_cash_logPD  -0.30019 0.0812821 -3.69319   0.0006


In [334]:
r2(model_early)

0.24514226762662428

In [286]:
model_d_early = lm(@formula(cash_div_growth ~ lagged_cash_logPD), data_prep[data_prep[:year] .< 1990, :])

DataFrames.DataFrameRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: cash_div_growth ~ 1 + lagged_cash_logPD

Coefficients:
                      Estimate Std.Error   t value Pr(>|t|)
(Intercept)           0.101015  0.108621  0.929978   0.3577
lagged_cash_logPD  -0.00394508   0.03319 -0.118863   0.9060


In [287]:
r2(model_d_early)

0.0003362798055228655

Second part:

In [288]:
model_late = lm(@formula(ret ~ lagged_cash_logPD), data_prep[data_prep[:year] .>= 1990, :])

DataFrames.DataFrameRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: ret ~ 1 + lagged_cash_logPD

Coefficients:
                    Estimate Std.Error  t value Pr(>|t|)
(Intercept)         0.858665  0.450902  1.90433   0.0684
lagged_cash_logPD  -0.196313  0.114617 -1.71278   0.0991


In [289]:
r2(model_late)

0.10502071705460247

In [290]:
model_d_late = lm(@formula(cash_div_growth ~ lagged_cash_logPD), data_prep[data_prep[:year] .>= 1990, :])

DataFrames.DataFrameRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: cash_div_growth ~ 1 + lagged_cash_logPD

Coefficients:
                    Estimate Std.Error  t value Pr(>|t|)
(Intercept)        -0.259776  0.215559 -1.20513   0.2394
lagged_cash_logPD  0.0829554 0.0547938  1.51396   0.1426


In [291]:
r2(model_d_late)

0.08398287215184552

In the 90s the price dividend ratio increased dramatically, but stocks performed well. This seems like a structural break, so splitting up the sample benefits the $R^2$ in both.

Dividend growth is hardly predictable.

### Preliminaries for (f) 
The [Campbell Shiller decomposition](http://onlinelibrary.wiley.com/doi/10.1111/j.1540-6261.1988.tb04598.x/abstract) starts out from the defininition of the return on a stock: $R_{t+1}=\frac{P_{t+1}+D_{t+1}}{P_{t}}$.

Taking logs:

$$r_{t+1}=p_{t+1}-p_{t}+\log\Big(1+\exp(\underset{dp_{t+1}}{\underbrace{d_{t+1}-p_{t+1}}})\Big)$$

The Taylor approximation of the last term around $\bar{dp}$ is:

$$\log\Big(1+\exp(dp_{t+1})\Big)\approx\underset{\kappa}{\underbrace{\log\Big(1+\exp(\bar{dp})\Big)}}+\underset{1-\rho}{\underbrace{\frac{\exp(\bar{dp})}{1+\exp(\bar{dp})}}}\Big(dp_{t+1}-\bar{dp}\Big)+\frac{1}{2}\frac{\exp(\tilde{dp})}{\Big(1+\exp(\tilde{dp})\Big)^{2}}\Big(dp_{t+1}-\bar{dp}\Big)^{2}$$

where $\tilde{dp}$ is between $dp_{t+1}$ and $\bar{dp}$ (this is the [Lagrange form](https://en.wikipedia.org/wiki/Taylor%27s_theorem#Explicit_formulas_for_the_remainder) of the remainder).

Mind also that the remainder coefficient is a increasing function for all $\tilde{dp} \leq 0$.

In [292]:
dp = -yearly_data[:cash_logPD]
dp_bar = mean(dp)
κ = log(1 + exp(dp_bar))
ρ = 1 - exp(dp_bar)/(1 + exp(dp_bar))

0.9712135004103063

In [293]:
exp(2.8)

16.444646771097048

In [294]:
dp_sample = linspace(minimum(dp), maximum(dp), 100)
f(dp) = log(1 + exp(dp))
f_approx(dp) = κ + exp(dp_bar)/(1 + exp(dp_bar))*(dp-dp_bar)
f_approx2(dp) = κ + exp(dp_bar)/(1 + exp(dp_bar))*(dp-dp_bar) + 1/2 * exp(dp_bar)/(1 + exp(dp_bar))^2 * (dp - dp_bar)^2
local_max_remainder(dp) = 1/2 * exp(max(dp, dp_bar))/(1 + exp(max(dp, dp_bar)))^2 * (dp - dp_bar)^2
local_min_remainder(dp) = 1/2 * exp(min(dp, dp_bar))/(1 + exp(min(dp, dp_bar)))^2 * (dp - dp_bar)^2

plot(dp_sample, f.(dp_sample) .- f_approx.(dp_sample), label="Error from 1st order")
plot!(dp_sample, local_max_remainder.(dp_sample), label="Max Error")
plot!(dp_sample, local_min_remainder.(dp_sample), label="Min Error")
plot!(dp_sample, f.(dp_sample) .- f_approx2.(dp_sample), label="Error from 2nd order")

In [295]:
(f.(dp_sample) .- f_approx.(dp_sample))' * fit(Histogram, dp, vcat(0, dp_sample), closed=:right).weights/length(dp)

0.002284290406909674

The approximation understates the true return by 0.015 maximally and by 0.0026 on average (over the observed sample).

In [296]:
summarystats(yearly_data[:ret])

Summary Stats:
Mean:           0.103926
Minimum:        -0.481478
1st Quartile:   0.003057
Median:         0.133459
3rd Quartile:   0.225562
Maximum:        0.408210


Given that the mean return is around 0.1, this approximation is okay!

Hence,

$$pd_t=\rho pd_{t+1}+\kappa - (1-\rho)\bar{dp}+\Delta d_{t+1} - r_{t+1}$$

and hence:

$$pd_t=\frac{\kappa}{1-\rho}+\sum_{j=0}^{\infty}\rho^{j}\Big(\Delta d_{t+j+1}-r_{t+j+1}\Big)$$

The average approximation error accumulates to:

In [297]:
1/(1-ρ)*0.002577539284226769

0.08953986490075345

In [298]:
summarystats(yearly_data[:cash_logPD])

Summary Stats:
Mean:           3.518640
Minimum:        2.741901
1st Quartile:   3.223868
Median:         3.454982
3rd Quartile:   3.842705
Maximum:        4.502235


Given that the mean log price dividend ratio is around 3.4, this error is acceptable.

### (f)

In [335]:
model_pd = lm(@formula(cash_logPD ~ lagged_cash_logPD), data_prep)

DataFrames.DataFrameRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: cash_logPD ~ 1 + lagged_cash_logPD

Coefficients:
                   Estimate Std.Error t value Pr(>|t|)
(Intercept)        0.325237  0.180611 1.80076   0.0761
lagged_cash_logPD  0.909443 0.0510376 17.8191   <1e-26


The variance of $pd_t$ is then:

$$\mathbb{V}\Big[pd_t\Big]=\mathbb{C}\Big[\sum_{s=1}^\infty \rho^{s-1} \mathbb{E}_t\big[\Delta d_{t+s}\big], pd_t\Big] + \mathbb{C}\Big[-\sum_{s=0}^\infty \rho^{s-1} \mathbb{E}_t\big[r_{t+s}\big], pd_t\Big]$$

$$1=\frac{\mathbb{C}\Big[\sum_{s=1}^\infty \rho^{s-1} \mathbb{E}_t\big[\Delta d_{t+s}\big], pd_t\Big]}{\mathbb{V}\Big[pd_t\Big]} + \frac{\mathbb{C}\Big[-\sum_{s=0}^\infty \rho^{s-1} \mathbb{E}_t\big[r_{t+s}\big], pd_t\Big]}{\mathbb{V}\Big[pd_t\Big]}$$

The first term on the right-hand-side is (mind that all constants in the first argument of the covariance disappear):

$$\frac{\sum_{s=1}^{\infty}\rho^{s-1}\mathbb{C}\Big[\mathbb{E}_{t}\big[\Delta d_{t+s}\big],pd_{t}\Big]}{\mathbb{V}\Big[pd_{t}\Big]}=\frac{\sum_{s=1}^{\infty}\rho^{s-1}\mathbb{C}\Big[a_{d}+b_{d}\mathbb{E}_{t}\big[pd_{t+s-1}\big],pd_{t}\Big]}{\mathbb{V}\Big[pd_{t}\Big]}=\frac{b_{d}\sum_{s=0}^{\infty}\big(\rho\phi\big)^{s}\mathbb{V}\Big[pd_{t}\Big]}{\mathbb{V}\Big[pd_{t}\Big]}=\frac{b_{d}}{1-\rho\phi} $$

In [338]:
b_d = coef(model_d)[2]
b_r = coef(model)[2]
ϕ = coef(model_pd)[2]

b_d/ (1 - ρ*ϕ)

-0.046535188778414484

The variation of the price dividend ratio is only driven by variation in the expected returns.

### (g)

The second term in the variance decomposition, using the linearization for the 0-th element of the sum, is:

$$\frac{\mathbb{C}\Big[-r_{t},pd_{t}\Big]+\mathbb{C}\Big[-\sum_{s=1}^{\infty}\rho^{s-1}\mathbb{E}_{t}\big[r_{t+s}\big],pd_{t}\Big]}{\mathbb{V}\Big[pd_{t}\Big]}\approx\frac{-\frac{(1-\rho)}{\rho}\mathbb{V}\Big[pd_{t}\Big]+\mathbb{C}\Big[-\sum_{s=0}^{\infty}\rho^{s}\mathbb{E}_{t}\big[r_{t+s+1}\big],pd_{t}\Big]}{\mathbb{V}\Big[pd_{t}\Big]}=\frac{-\frac{(1-\rho)}{\rho}\mathbb{V}\Big[pd_{t}\Big]-b_{r}\sum_{s=0}^{\infty}\rho^{s}\mathbb{C}\Big[\mathbb{E}_{t}\big[pd_{t+s}\big],pd_{t}\Big]}{\mathbb{V}\Big[pd_{t}\Big]}=-\frac{1-\rho}{\rho}-\frac{b_{r}}{1-\rho\phi}$$


Hence, the variance decomposition implies:

$$\frac{1}{\rho}=b_{d}+\phi-b_{r}$$

In [336]:
1/ρ

1.0296397234774148

In [337]:
coef(model_d)[2]+coef(model_pd)[2]-coef(model)[2]

1.0263161028426322