### Part 1: Prediction quality vs feature selection

#### Tasks

In this taks you are supposed to

1. repeatedly simulate new datasets and for each dataset do Steps 2 and 3 
2.  determine λ_{min} and λ_{1se} using cross-validation on the training data

3.  compare the two resulting Lasso models with respect to
    - Mean squared error on the training as well as test data
    - Feature selection quality in comparison to the original simulated regression coefficients


Let us look a bit closer at each task.

1. Simulate data
Simulating useful data for the Lasso isn't complicated but to make sure you can focus on the interesting tasks I included a Python and a R version of a data-generating function. You can either ignore them and follow the description here, use them or take them as inspiration to write your own. The ideas are

    1. Generate n observations from an isometric normal distribution $N(0,Ip)$, i.e. theoretically uncorrelated features.
    2. Generate regression coefficients beta as a vector that contains ceiling((1 - sparsity) p) non-zero elements [ceiling = closest integer less or equal to] that are normal distributed with a standard deviation of beta_scale. All other elements are zero.
    3. At this point Xβ is the noise-less response and √∥Xβ∥22/(n−1) is its sample standard deviation. Signal-to-noise ratio is a measure for the proportion of the signal (noise-less response) standard deviation to the standard deviation of the noise, i.e. SNR = sd_signal / sd_noise. Specifying the SNR is a convenient way to determine what standard deviation for the noise is reasonable by choosing sd_noise = sd_signal / SNR. As an example, SNR = 2 means that the standard deviation of the noise-less is twice as large as the standard deviation of the noise and the noise-less response will be more "pronounced" the larger the SNR is. If 0 <= SNR < 1, then the noise is stronger than the signal which is hard to deal with for most methods.
    4. Finally, the response is created in the form of a linear model y = X beta + sigma eps.

In [1]:
import numpy as np

def simulate_data(n, p, rng, *, sparsity=0.95, SNR=2.0, beta_scale=5.0):
    """Simulate data for Project 3, Part 1.

    Parameters
    ----------
    n : int
        Number of samples
    p : int
        Number of features
    rng : numpy.random.Generator
        Random number generator (e.g. from `numpy.random.default_rng`)
    sparsity : float in (0, 1)
        Percentage of zero elements in simulated regression coefficients
    SNR : positive float
        Signal-to-noise ratio (see explanation above)
    beta_scale : float
        Scaling for the coefficient to make sure they are large

    Returns
    -------
    X : `n x p` numpy.array
        Matrix of features
    y : `n` numpy.array
        Vector of responses
    beta : `p` numpy.array
        Vector of regression coefficients
    """
    X = rng.standard_normal(size=(n, p))
    
    q = int(np.ceil((1.0 - sparsity) * p))
    beta = np.zeros((p,), dtype=float)
    beta[:q] = beta_scale * rng.standard_normal(size=(q,))
    
    sigma = np.sqrt(np.sum(np.square(X @ beta)) / (n - 1)) / SNR

    y = X @ beta + sigma * rng.standard_normal(size=(n,))

    # Shuffle columns so that non-zero features appear
    # not simply in the first (1 - sparsity) * p columns
    idx_col = rng.permutation(p)
    
    return X[:, idx_col], y, beta[idx_col]

In [3]:
p = 500     # Fix p at something large, e.g. 500 or 1000
n_list = [125, 250, 375]    # Let n vary compared to p, e.g. iterate through [200, 500, 750] if you set p = 1000. What truly matters here is the ratio p / n, so if you choose p differently, adjust your choices for n
sparsities = [0.75, 0.9, 0.95, 0.99]    # Let sparsity vary for a few choices, e.g. [0.75, 0.9, 0.95, 0.99]
SNR = 2     # You can fix SNR at something reasonable like 2 or 5 throughout
beta_scale = 5      # Same holds for beta_scale, maybe 5 or 10

In addition, it will help you tremendously in the interpretation of the results if you **repeat the simulations a few times**, say 5 or 10 times, for each choice of n and sparsity. Here is why you should be careful with your choices: If you chose three values for n and four for sparsity, as well as 5 repeats, then you need to run your simulations for 60 datasets. A setup with 50 datasets took about 2 minutes on a 2017 MacBook, so if it takes hours, you did something wrong :-)

It can be good to include intermediate print-outs/clock output throughout the code, e.g. for iteration numbers or cross-validation so you can detect if there is a time sink somewhere

In [None]:
x, y, beta = simulate_data()

2. Determine hyperparameters

This works as described above the "Tasks" section. Depending on the the package you are using you will have to perform some different steps to get the coefficients and the predictions.

### 3. Comparing the two Lasso models

#### Computation of train/test MSE

Whenever you use LassoCV or cv.glmnet you can, at the best, access the test error for each fold. To compute both the training and test MSE of the model, you have two options:

1. Write your own cross validation loop. Then you can evaluate the training and test error and safe both.
2. Instead of only creating a training dataset with the options given in Part 1, Task 1 you can do the following:
    - Fix n_test at some larger value, say, n_test = 500 or n_test = 1000
    - Instead of simulating n samples you generate now n + n_test and split the dataset. Train on the n samples with cross validation and use the remaining n_test samples for testing. You do not need to adapt the size of the test set to the size of the training dataset as long as you choose it large enough.
Option 2 is probably less tedious to implement, but the choice is up to you.

When reporting the results for the train/test MSE, **please do not simply report a table of numbers. Find a nice way to visualise the results.**

#### Question
 - How does the MSE of the $λ_{min}$ and $λ_{1se}$ models behave for different n and sparsity levels?

### Evaluating the feature selection capability

Compute sensitivity and specificity using the 0-1 codings of the true and estimated coefficient vectors and plot them in a scatter plot.


#### Questions

 - What differences between sensitivity/specificity computed from the $λ_{min}$ and the $λ_{1se}$ models can you observe?
 - How do different choices for n and sparsity affect the relationship of sensitivity/specificity?