# Assignment 3.3 - Regularized Regression

## DRW & UofC Quant Foundations
### Summer 2024
#### Mark Hendricks
#### hendricks@uchicago.edu

***

# Penalized Regression

$$\newcommand{\nsecs}{450}$$
$$\newcommand{\target}{GLD}$$
$$\newcommand{\spy}{\text{spy}}$$
$$\newcommand{\hyg}{\text{hyg}}$$

## Data
* This homework uses the file, `data/spx_weekly_returns_single_names.xlsx`.
* Find the data in the Github repo associated with the module, (link on Canvas.)

The data file contains...
* Return rates, $r^{\target}_t$, for the \target, (an ETF,) which tracks the returns on gold.
* Return rates, $r^i_t$, for $\nsecs$ single-name equities. 

#### Note
These are less than 500 return series in the sample due to filtering out securities with insufficient return histories.

## Model
Consider a regression of $\target$, denoted $r^{\target}$, on all $\nsecs$ returns of the S&P 500 stocks.

$$
r^{\target}_t = \alpha + \sum_{j=1}^k \beta^j r^j_t + \epsilon_t
\label{eq:REG}
$$

We refer to this equation below as the `MODEL`.

***

# 1. 

Estimate the `MODEL` with OLS.

#### Note
For this OLS estimation, along with the estimations below, try using scikit-learn in Python

`from sklearn import LinearRegression Lasso Ridge`

For OLS specifically, try

`model_ols = LinearRegression().fit(X,y)`

### 1.1. 
Report the R-squared.
### 1.2. 
Which factors have the largest betas in explaining $r^{\target}$?
### 1.3. 
Calculate $\beta^j \sigma^j$ for each regressor. Which of these is largest in magnitude, and thus most influential in explaining $r^{\target}$?

#### Note
The beta being large may simply be because the regressor volatility is small. By scaling by the volatility, we get a better idea of which regressor is driving the most variation.

### 1.4. 
Report the matrix condition number of $R'R$, where $R$ denotes the matrix of single-name equity return data. Why should this condition number give us pause about trusting the OLS estimates out-of-sample?

#### Note
To get the matrix condition number, consider using, in Python, `numpy.linalg.cond()`.

#  2. 

Estimate `MODEL` with Ridge Regression. 
* Use a penalty of `ALPHA=0.5` in the estimation.
* Try using `est = Ridge(alpha=ALPHA).fit(X,y)`

### 2.1.
Report the R-squared.
### 2.2.
Based on $\beta^j \sigma^j$, which factor is most influential for $r^{\target}$?

### 2.3.
Report the matrix condition number of the $R'R$.

### 2.4.
How many regressors have a non-zero beta estimates? 

***

# 3. 

Estimate `MODEL` with LASSO Regression. 
* Use a penalty of `ALPHA=7e-5` in the estimation.
* Try using `est = Lasso(alpha=ALPHA).fit(X,y)`

### 3.1.
Report the R-squared.

### 3.2.
Based on $\beta^j \sigma^j$, which factor is most influential for $r^{\target}$ ?

### 3.3.
Report the matrix condition number of the $R'R$.

### 3.4.
How many regressors have a non-zero beta estimates? 

***

# 4.

How do the estimations compare across the three methods?

### 4.1.
Create a histogram of estimated betas across the three methods, (OLS, Ridge, LASSO.) 

Are they all nonzero? Are there positive and negative values? Do they range widely in magnitude? 

### 4.2.
Which has the largest R-squared? Is this a surprise?

***

# 5.

Try using cross-validation (with K-folds) to estimate the penalty parameter for Ridge and LASSO.

Estimate this CV using two functions from  `sklearn.linear_model`
* RidgeCV
* LassoCV

Feel free to use the default parameters, including the default number of folds.

Report the CV penalty parameter for Lasso and Ridge.

***

# 6.

Use your estimations based on data through 2022 to fit the model for 2023-2024. Use the CV penalty parameters (from the previous problem) for Ridge and Lasso.

* What is the r-squared in these out-of-sample fits?


#### Note
Doing this is really easy in Python. For instance, for the LASSO estimation, you could try

`est = Lasso(alpha=ALPHA).fit(X,y)`

`score_is = est.score(X_insamp,y_insamp)`

`score_oos = est.score(X_oos,y_oos)`

Which method does better out-of-sample?

***