# Homework 1.X. - Assessing the OLS Model
## Data Analysis
### FINM August Review 

Mark Hendricks

hendricks@uchicago.edu

# 1 Multivariate Regression

This problem utilizes the data in `../data/multi_asset_etf_data.xlsx`.
* Return data on various asset classes.
* This data comes via ETFs, which we will discuss in the Markets series.

## 1.1 Correlation

Calculate and display the correlation matrix of the returns.

Consider displaying it with `seaborn.heatmap`.

Which pair has the highest correlation? And the smallest (most negative)?

## 1.2 Multivariate Regression

Suppose that we want to decompose `PSP` into a linear combination of other asset classes.
* `PSP` is a benchmark of private equity returns.
* There is substantial research (and controversy) as to whether private equity returns can be produced from other simple assets.
* We will see.

$$r_t^{\text{PSP}} = \alpha + \boldsymbol{\beta}\boldsymbol{r}_t + \epsilon_t$$

where $\boldsymbol{r}_t$ denotes the vector of all the other returns (excluding PSP) at time $t$.

Report
* the estimated alpha
* the estimated betas
* the r-squared

#### Python tip
Consider forming `X = rets.drop(columns=['PSP'])`.

Consider using one of the following for the regression.
* `statsmodels.OLS`
* `sklearn.linear_models.LinearRegression()`

The former will include various regression statistics. The latter will just produce the estimates.

## 1.3 Interpretation

Based on your estimates, do you think it is feasible to replicate `PSP` with these other assets? Be specific, citing your answers to the previous question. What does $\alpha$ indicate? What does the r-squared statistic indicate?

## 1.4 Multicollinearity

Should we be worried about multicollinearity in this case?

Calculate some metrics about $X'X$, (noting that in our case "$X$" is the array of return data, excluding `PSP`.

* determinant
* conditioning number

What do these metrics indicate?

#### Python tip
You may find these `numpy` functions helpful:
* `numpy.linalg.cond()`
* `numpy.linalg.det()`

## 1.5 Impact of multicollinearity

With multicollinearity, we are concerned that the regression estimates
* are imprecise.
* will change a lot in response to small changes in new data.
* will perform badly out of sample.

To investigate...
* report the t-stats of the betas

## 1.X Extra: 
Estimate the regression, but this time using only data through 2019. 
* Apply these estimated betas to the data in 2020-2022 to construct the replication of `PSP` ($\hat{y}$) out of sample.
* What is the correlation of PSP in 2020-2022 versus this out-of-sample regression estimate?

Graph `PSP` against the regression estimate, both through 2019 (in sample) and 2020-2022 (out of sample.)

# Appendix: Condition number of a matrix

$\newcommand{\olsb}{\boldsymbol{b}}$
$\newcommand{\olsy}{\boldsymbol{y}}$

Consider the linear equation

$$\olsy = a + X\olsb + e$$

Solving for $\olsb$,
$$(X'X)\olsb = X'\olsy$$

Denote the condition number of $X'X$ as $\kappa$.

Then,
$$\frac{||\delta \olsb||}{||\olsb||} \le \kappa \frac{||\delta X'\olsy||}{||X'\olsy||}$$

#### This says that 
- estimation error of size $\delta$ in the covariation of $X$ and $\olsy$ 
- will lead to errors up to size $\kappa\delta$ in the estimation of $\olsb$.

***

# 2 Heteroskedasticity \& Serial Correlation

$$\newcommand{\rspyt}{r_{\text{spy}}}$$
$$\newcommand{\rspyt}{r_{\text{spy},t}}$$

## Data

This problem uses the file, `../data/spy_rates_data.xlsx`.
* Return rates for SPY, the return on the S\&P 500. Denote this as $\rspyt$.
* Dividend-price ratio for the S\&P 500.
* 10-year yields on US Treasuries.

## 2.1
Use linear regression to calculate whether S\&P 500 returns (SPY) are impacted by 10-year yields and the dividend-price ratio.

$$\begin{align}
\rspyt = \alpha + \boldsymbol{\beta}'\boldsymbol{X}_t + \epsilon_t
\label{eq:spy_on_macro}
\end{align}$$

where $\boldsymbol{X}$ denotes the matrix of values of the 10-year-yield and the dividend-price ratio.

Report the betas.

## 2.2
Try using `statsmodels.OLS` to estimate the regression, and print the "summary" of the results which will show the t-stats, p-values, etc. Are either of the regressors are statistically significant?

## 2.3
Calculate the correlation between the sample residuals, $e_t$, and their lagged value, $e_{t-1}$. Are they highly correlated? You may find it helpful to use .shift in pandas to get the lagged series.

## 2.4
Calculate the regression of
$$e_t = \alpha + \boldsymbol{\beta}'\boldsymbol{X}_t + u_t$$

## 2.5
What do the previous two calculations have to do with identifying serial correlation and heteroskedasticity?