# LAB3: Sparsity
Author: Mathurin Massias (mathurin.massias@gmail.com)

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np

from scipy.io import loadmat

from sklearn.linear_model import ElasticNet, ElasticNetCV
from sklearn.model_selection import train_test_split

from lab3_utils import create_random_data

## Dataset generation and model fitting

In [None]:
def train_test_data(n_samples, n_features, n_informative_features, 
                    noise_level):
    X, y = create_random_data(n_samples, n_features, n_informative_features, 
                          noise_level=noise_level)
    print("X shape:", X.shape)
    print("y shape:", y.shape)
    train_size = 0.8  # proportion of dataset used for training
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, shuffle=False, train_size=train_size)
    return X_train, X_test, y_train, y_test

In [None]:
n_samples = 100
n_features = 200
n_informative_features = 50

# first, noiseless data, split in train and test/validation parts
X_train, X_test, y_train, y_test = train_test_data(
    n_samples, n_features, n_informative_features, noise_level=0.)
print("Training dataset shape:", X_train.shape)
print("Testing dataset shape:", X_test.shape)

In sklearn, the objective function of the ElasticNet optimization is:
$$\frac{1}{2 \times \text{n_samples}} \Vert y - X \beta \Vert_2^2 + \alpha \times \left( \text{l1_ratio} \times \Vert \beta \Vert_1 + \frac{1 - \text{l1_ratio}}{2} \Vert \beta \Vert_2^2\right)$$

See the docstring for more information in the next cell:

In [None]:
ElasticNet?

In [None]:
# fit a classifier with arbitrary values for L1 and L2 penalization
clf = ElasticNet(alpha=0.1, l1_ratio=0.1)

In [None]:
# fit the model and print some its first coefficients
# beware that sklearn fits an intercept by default
clf.fit(X_train, y_train)
print("50 first coefficients of estimated w:\n", clf.coef_[:50])
print("Intercept: %f" % clf.intercept_)
print("Nonzero coefficients: %d" % (clf.coef_ != 0.).sum())
print("Training error: %.4f" % np.mean((y_train - clf.predict(X_train)) ** 2))
# TODO compute testing error on left out data
# print("Testing error: %.4f" % )
# TODO bonus: why is clf.predict(X_train) not equal to X_train @ clf.coef_? 

In [None]:
# test the influence of l1_ratio on the sparsity of the solution
# and on the train /test error
l1_ratios = [0., 0., 0., 0., 0.]  # TODO choose your own values between 0 and 1

train_errs = np.zeros(len(l1_ratios))
test_errs = np.zeros_like(train_errs)
sparsity = np.zeros_like(train_errs)

for i, l1_ratio in enumerate(l1_ratios):
    clf = # TODO; you may need to tune alpha a bit too.
    # TODO fit the model on train data
    # TODO compute train and test errors
    train_errs[i] = 
    test_errs[i] = 
    sparsity[i] = # number of non-zero elements in clf.coef_
    
plt.figure()
plt.plot(l1_ratios, test_errs, label='Test error')
plt.plot(l1_ratios, train_errs, label='Train error')
plt.xlabel("l1_ratio")
plt.legend()

plt.figure()
plt.plot(l1_ratios, sparsity)
plt.ylabel(r'$||w||_0$'')
plt.xlabel('l1_ratio')

In [None]:
# TODO do the same for the influence of alpha, with a fixed l1_ratio
# What happens when alpha becomes too big?
alphas = np.geomspace(1e-4, 1e4, num=9)

In [None]:
# TODO check again the influence of regularization when there is noise in the data.
X_train, X_test, y_train, y_test = train_test_data(
    n_samples, n_features, n_informative_features, noise_level=0.5)
# TODO: plot train/test curves

In [None]:
## Influence of dataset size

•I.E(Prediction and selection) 
Considering the elastic net regularization, we 
want to study the sensitivity of training and test errors with respect to thechoice of regularization parameters and with respect to the number of relevantfeatures of the solution, by changing one parameter at a time.  To that end,you should try to pick some parameters, or run them in a loop, by exploitingthe code inl1l2demosimple.m.  Namely, study what happens as– 

I.E1...  you change the regularization parameterL1parassociated withthe`1-norm.– I.E2...  you change (increase or decrease) the regularization parameterL2parassociated with the`2-norm.Hint:Try the following fixed parameters:  20 points of dimension 100,with 15 relevant features, a noise level equal to 1 andL1par=0.1.– IE.3...   the  size  of  the  training  set  grows  (this  is  not  the  same  asgenerating different training sets of increasing size!).Hint:Try the same parameters as above, withL2par=0.– I.E4...  the amount of noise on the generated data grows (the test setis generated with the same parameters as the training set).•I.F(Largep, smalln)Perform experiments similar to those above but nowchangingp(dimensionality of the points),n(number of training points) ands(number of relevant variables).  In particular,  look at how do the resultsbehave whenpn, depending whethers < nholds or not (e.g.  tryn= 80andp= 300).  
            Try to identify different regimes

Finally, vary $n, p$ and $s$

In [None]:
n_samples = # TODO
n_features = # TODO
n_informative_features = # TODO

X_train, X_test, y_train, y_test = train_test_data(
    n_samples, n_features, n_informative_features, noise_level=0.)

## Parameter selection with cross validation
In the next section, we use scikit-learn's built in functions to perform cross validated selection of alpha and l1_ratio.

In [None]:
X_train, X_test, y_train, y_test = train_test_data(
    n_samples, n_features, n_informative_features, noise_level=0.5)
clf = ElasticNetCV(l1_ratio=[.1, .5, .7, .9, .95, .99, 1])
clf.fit(X_train, y_train)

In [None]:
print("Optimal values for l1_ratio and alpha: %s, %.2e" % (clf.l1_ratio_, clf.alpha_))

## Classification data

Load some data verifying $y = \text{sign}(X \beta^\star)$ where $\beta^\star$ is unknown and $s$-sparse -- but you do not know $s$. the goal of this part is to infer it.

In [None]:
data = loadmat("../../data/part3-data.mat")

In [None]:
X = data["X"]
y = data["Y"][:, 0]
print(X.shape, y.shape)
# TODO check numerically that y only contains values equal to 1 or -1

Now you must infer $s$.
A first possible approach is to use the Cross-Validation procedure used in the previous part: find the sparsity of the optimal $\beta$ obtained by cross-validation on a grid of values for $\alpha$ and l1_ratio.

In [None]:
# TODO find optimal s from a CV point of view

Another way to try to estimate $s$ is to measure the correlation between
the columns of $X$ and $y$. Indeed, the zero coefficients in $\beta^\star$ will ignore the
corresponding columns in $X$ while generating $y$. Can you also identify the indices of these features ?

In [None]:
# TODO compute correlation
# corr = 

In [None]:
# sort:
idx = np.argsort(corr)
plt.plot(corr[idx[::-1]])

In [None]:
# TODO identify the cutoff numerically, get s 
# and the indices of highest correlated features
# highly_corr_feats =

Finally, use again the code of the first part, and tune the sparsity parameter l1_ratio so that
it selects only $s$ features ($s$ being your sparsity estimate from the previous
question). Look at which are the selected features in your solution. Do they
correspond to the ones you identified with the correlation approach? 
If they do not, can you figure out why does this happen?