**Work completed:** Implemented missing functions in `Reg.py` and `auxiliary_func.py` (proximal-gradient solvers, Huber gradient/loss, PCR/PLS, OLS/OLSH, PCAR/PLSR, oracle regression, and VIP utilities).

# Part 0:  Set up 
Run the following import statements.  Make sure this cell runs properly for the rest of the code.  

In [None]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt

%matplotlib inline

%load_ext autoreload
%autoreload 2

def rel_error(x, y):
    """ returns relative error """
    return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))

# Part 1: R2 Average of models
To assess predictive performance for individual excess stock return forecasts,
we calculate the out-of-sample $R^2$ as

$$ R^{2}_{OOS} = 1-\frac
{\sum_{(i,t) \in \mathcal{T}_3} \;\;(r_{i,t+1} - \hat{r}_{i,t+1})^{2}}
{\sum_{(i,t) \in \mathcal{T}_3} \;\; r_{i,t+1}^{2}}
$$

where $\mathcal{T}_3$ indicates that fits are only assessed on the testing subsample, whose data never enter into model estimation or tuning. $R^{2}_{OOS}$ pools prediction errors across firms and over time into a grand panel-level assessment of each model.

A subtle but important aspect of our $R^2$ metric is that the denominator is the sum of squared excess returns without demeaning. In many out-of sample forecasting applications, predictions are compared against historical mean returns. Although this approach is sensible for the aggregate index or long-short portfolios, for example, it is flawed when it comes to analyzing individual stock returns. Predicting future excess stock returns with historical averages typically underperforms a naive forecast of zero by a large margin. That is, the historical mean stock return is so noisy that it artificially lowers the bar for “good” forecasting performance. We avoid this pitfall by benchmarking our $R^2$ against a forecast value of zero. To give an indication of the importance of this choice, when we benchmark model predictions against historical mean stock returns, the out-of-sample monthly $R^2$ of all methods rises by roughly three percentage points.

Now, we have created all models including regressions and trees. In-sample and out-of-sample $R^{2}$ are stored, we want to compare these models, so we have to first compute the mean $R^{2}$ of each model. By running more simulations, we can get a more accurate evaluation of each model, since we calculate the average of the results of each model and more simulations decreases deviation. Open the file `Table_R2.py` and read through the function. You don't need to modify or add code, but make sure you understand what the function is doing. 

This section is cited of paper [Empirical Asset Pricing via Machine Learning](https://dachxiu.chicagobooth.edu/download/ML.pdf).

In [None]:
from Table_R2 import table_r2
MM = 10
table_r2(MM)

In [None]:
for hh in [1,3,6,12]:
    if os.path.exists("TableR2_%d.csv"%hh):
        data = pd.read_csv("TableR2_%d.csv"%hh,delimiter=',',header=None)
        print(data)

# Part 4: Variable Selection Frequencies
In this section, we generate average variable selection frequencies. We want to understand the results of the six models of **Lasso, Lasso+H, Enet, Enet+H, GLasso, GLasso+H** for the model parameter selection. We are interested in which parameters are most frequently chosen in these models. so we will read through all of our previous results, and generate average variable selection frequencies. Open the file `Table_selection.py`, go through the function `selection()` carefully to understand what we will output. Then run it and check the results.

This section is cited of paper [Empirical Asset Pricing via Machine Learning](https://dachxiu.chicagobooth.edu/download/ML.pdf).

In [None]:
from Table_selection import selection
selection()

In [None]:
if os.path.exists("TableSelect.csv"):
        data = pd.read_csv('TableSelect.csv',delimiter=',',header=None)
        print(data)

# Part 5: Average Variable Importance

Our goal in interpreting machine learning models is modest.We aim to identify covariates that have an important influence on the cross-section of expected returns while simultaneously controlling for the many other predictors in the system.

We discover influential covariates by ranking them according to a notion of variable importance, which we denote as $\text{VI}_{j}$ for the $j$ th input variable. We consider two different notions of importance. The first is the reduction in panel predictive $R^2$ from setting all values of predictor $j$ to zero, while holding the remaining model estimates fixed (used, e.g., in the context of dimension reduction by Kelly, Pruitt, and Su 2019). The second, proposed in the neural networks literature by Dimopoulos, Bourret, and Lek (1995), is the sum of squared partial derivatives (SSD) of the model to each input variable $j$ , which summarizes the sensitivity of model fits to changes in that variable.

As part of our analysis, we also trace out the marginal relationship between expected returns and each characteristic. Despite obvious limitations, such a plot is an effective tool for visualizing the first-order impact of covariates in a machine learning model.

In this section, we generate average variable importance which is defined as the decreasing of in-sample R2 when excluding the certain variable. We want to understand the results of the six models of **Lasso, Lasso+H, Enet, Enet+H, GLasso, GLasso+H** for the model parameter importance. We are interested in which variable are most importance in prediction, so we will read through all of our previous results, and generate average variable importance. Open the file `Table_VIP.py`, go through the function `importance()` carefully to understand what we will output. Then run it and check the results.


This section is cited of paper [Empirical Asset Pricing via Machine Learning](https://dachxiu.chicagobooth.edu/download/ML.pdf).

In [None]:
from Table_VIP import importance
importance()

In [None]:
if os.path.exists("TableVIP.csv"):
        data = pd.read_csv('TableVIP.csv',delimiter=',',header=None)
        print(data)