## Lasso regression

#### Week 4, set problem 5

The file <b><i>mystery.dat</b></i> contains pairs <b>($x$, $y$)</b>, where $x$ $\in$ $\mathbb{R}^{100}$ and $y$ $\in$ $\mathbb{R}$. There is one data point per line, with comma-separated values; the very last number in each line is the $y$-value.

In this data set, $y$ is a linear function of just ten of the features in $x$, plus some noise. Your job is to identify those ten features.

Which of the following contain only relevant features?

For relevant feature selection I will use Lasso linear regression algorithm, which usess $L1$ metric for error computation.
This type of regularization $(L1)$ can lead to zero coefficients i.e. some of the features are completely neglected for the evaluation of output.

In [16]:
# libraries for data manipulation or data representation
import math 
import matplotlib.pyplot as plt 
import pandas as pd
import numpy as np
from pandas.compat import StringIO

# libraries for learning
from sklearn.linear_model import Lasso
from sklearn.cross_validation import train_test_split

In [31]:
# read and store data
df = pd.read_csv('mystery.dat', 
                 sep=",", #separator comma
                 index_col=0,
                 header=None)
data = df.reset_index().values
print(data)

[[-1.14558 -1.29249  0.84911 ...,  1.5532  -1.42135  1.19238]
 [ 1.38724 -1.00201 -0.3337  ...,  0.81903  0.39286 -3.44094]
 [ 1.47233  0.8488  -0.33866 ...,  0.08911 -1.72476  3.75006]
 ..., 
 [-0.83673  0.80514  0.00807 ..., -1.64165  2.04662  1.84121]
 [ 1.12062  0.68561 -1.08    ...,  1.1926   0.33696  3.53143]
 [-0.28943 -0.22213 -0.52226 ...,  0.22531  1.72576 -0.55118]]


In [36]:
# seperate data into feature/label values
X = data[:,:100]
y = data[:,100]
print(data.shape)
print(X.shape)
print(y.shape)

(101, 101)
(101, 100)
(101,)


In [37]:
# create test and train data sets
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.3, random_state=42)

In [39]:
# init. lasso object and fit data
lasso_reg = Lasso() # default alpha param. = 1
lasso_reg.fit(train_X, train_y)
train_err = lasso_reg.score(train_X, train_y)
test_err = lasso_reg.score(test_X, test_y)
num_coeff = np.sum(lasso_reg.coef_ != 0)

print(f"Training error: {train_err}\nTest error: {test_err}\nNumber of features: {num_coeff}")
print(lasso_reg.coef_)
print(np.nonzero(lasso_reg.coef_))

Training error: 0.2465093128230823
Test error: 0.14380730422921983
Number of features: 5
[-0.          0.          0.          0.          0.         -0.
  0.08841838 -0.         -0.         -0.          0.34479077  0.          0.
  0.         -0.         -0.          0.26230792 -0.          0.22236035
  0.         -0.         -0.          0.26659661  0.         -0.         -0.
 -0.          0.          0.          0.         -0.          0.          0.
 -0.         -0.         -0.          0.          0.         -0.         -0.
 -0.          0.         -0.         -0.          0.         -0.         -0.
 -0.          0.          0.         -0.          0.          0.          0.
 -0.          0.         -0.         -0.         -0.          0.          0.
  0.          0.         -0.         -0.          0.         -0.         -0.
  0.          0.         -0.          0.          0.          0.          0.
  0.         -0.          0.         -0.          0.          0.         -0.
 -0

In [46]:
# try different alpha values that are smaller than 1
for alpha in [0.0005, 0.005, 0.05, 0.5]:
    lasso_reg = Lasso(alpha, max_iter=10e5) # default alpha param. = 1
    lasso_reg.fit(train_X, train_y)
    train_err = lasso_reg.score(train_X, train_y)
    test_err = lasso_reg.score(test_X, test_y)
    num_coeff = np.sum(lasso_reg.coef_ != 0)
    
    print(f"Alpha: {alpha}")
    print(f"Training error: {train_err}\nTest error: {test_err}\nNumber of features: {num_coeff}")
    print(np.nonzero(lasso_reg.coef_))
    print("*************************")
    print("*************************")

Alpha: 0.0005
Training error: 0.999994920022373
Test error: 0.21822918175525285
Number of features: 75
(array([ 1,  2,  3,  4,  5,  6,  7, 10, 12, 13, 14, 15, 16, 18, 19, 20, 21,
       22, 23, 24, 26, 27, 28, 29, 30, 31, 32, 34, 35, 36, 38, 40, 41, 42,
       43, 45, 47, 48, 49, 52, 53, 54, 57, 58, 59, 60, 61, 62, 63, 66, 67,
       68, 70, 71, 72, 73, 75, 77, 78, 80, 81, 83, 84, 85, 86, 88, 89, 90,
       91, 92, 93, 94, 95, 96, 98], dtype=int64),)
*************************
*************************
Alpha: 0.005
Training error: 0.9995483097739087
Test error: 0.31364868671332624
Number of features: 71
(array([ 1,  2,  3,  4,  5,  6,  7, 10, 11, 12, 13, 14, 15, 16, 18, 19, 20,
       21, 22, 23, 24, 26, 27, 28, 29, 30, 31, 32, 34, 36, 38, 40, 42, 43,
       45, 47, 48, 49, 52, 53, 54, 57, 58, 60, 61, 62, 63, 66, 67, 68, 70,
       71, 72, 73, 75, 77, 78, 80, 81, 83, 84, 85, 86, 88, 89, 90, 91, 93,
       94, 96, 98], dtype=int64),)
*************************
*************************
Al