# Regularized Regression on House Pricing Dataset
We consider a reduced version of a dataset containing house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015.

[https://www.kaggle.com/harlfoxem/housesalesprediction]

For each house we know 19 house features (e.g., number of bedrooms, number of bathrooms, etc.) plus its price, that is what we would like to predict.

In [None]:
#import all packages needed
%matplotlib inline
from __future__ import division
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Load the data, remove data samples/points with missing values (NaN) and take a look at them.

In [None]:
#load the data
df = pd.read_csv('kc_house_data.csv', sep = ',')

#remove the data samples with missing values (NaN)
df = df.dropna() 

df.describe()

Extract input and output data. We want to predict the price by using othr features (other than id) as input.

In [None]:
Data = df.values
#N = number of input samples
N = 3164
Y = Data[:N,2]
X = Data[:N,3:]

## Data Pre-Processing

Split the data into training a training set of $N_{tr}$ samples and a test set of $N_{te}:=N-N_{tr}$ samples.

In [None]:
# Split data into train (50 samples) and test data (the rest)
Ntr = 50
Nte = N - Ntr
from sklearn.cross_validation import train_test_split

#PUT YOUR NUMERO DI MATRICOLA BELOW!
numero_di_matricola = 1115

Xtr, Xte, Ytr, Yte = train_test_split(X, Y, test_size=Nte/N, random_state=numero_di_matricola)

Standardize the data.

In [None]:
# Data pre-processing
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(Xtr)
Xtr = scaler.transform(Xtr)
Xte = scaler.transform(Xte)

## Least-Squares Solution

The routine LinearRegression.score(X,y) computes the *Coefficient of determination* $R^2$, defined as:

$$R^2 = 1- \frac{RSS}{TSS}$$

where $RSS$ is the *Residual Sum of Squares* and $TSS$ is the *Total Sum of Square*. Denoting with $\hat{y}_i$ the $i$-th predicted output values, they are so defined:

\begin{align*}
RSS &= \sum_{i=1}^N (y_i - \hat{y}_i)^2\\
TSS &= \sum_{i=1}^N (y_i -\bar{y}_i)^2, \qquad \qquad \bar{y}_i=\frac{1}{N} \sum_{i=1}^N y_i
\end{align*}

In this notebook we will mostly use the coefficient of determination $R^2$ (instead of the RSS) as a measure to compare models and choose tuning parameters.

### TODO 1

Answer the following: are we interested in models with low $R^2$ or high $R^2$? Why? (max 5 lines)

Now compute the Least-Squares estimate using LinearRegression() in Scikit-learn, and print the corresponding score in training and test data.

In [None]:
# Least-Squares
from sklearn import linear_model as lm
#OLS is the linear regression model
OLS = ????

#fit the model on training data
????

#obtain predictions on training data
Yhat_tr = ????

#coefficients from the model
b_LS = np.hstack((OLS.intercept_, OLS.coef_))

print "Coefficient of determination on training data:", ????
print "Coefficient of determination on test data:", ????


### TODO 2

Compute the confidence interval for each coefficient.

In [None]:
# Least-Squares: Confidence Intervals
from scipy.stats import t

#add the column of all ones for the intercept of the model
Xtr_intercept = np.hstack((np.ones((Xtr.shape[0],1)), Xtr))

#alpha for confidence intervals
alpha = 0.05

#quantile from t-student distribution
tperc = ????
sigma2 = ????

R = ????

Ri = ????
v = ????
Delta = ????
CI = np.transpose(np.vstack((b_LS,b_LS))) + np.transpose(np.vstack((-Delta,+Delta) ))

Plot the LS coefficients and their confidence interval.

In [None]:
# Plot confidence intervals
plt.figure(1)
plt.plot(b_LS[1:], 'r', marker='o', ms=7.0)
plt.plot(CI[1:,0], 'b--')
plt.plot(CI[1:,1], 'b--')
plt.plot(np.zeros(b_LS.shape[0],), 'k', linewidth=2.0)
plt.xlabel('Coefficient Index')
plt.ylabel('LS Coefficient')
plt.title('Coefficients and Confidence Sets')
plt.show()

### Question: based on the results above, if you had to choose at most 5 features for a linear regression model, which ones would you choose? Why?

### TODO 3
Answer the question above (max 5 lines)

## Best-Subset Selection

Split the training data into a training and validation dataset. For $k$ going from 1 to $n_{sub}=4$:
1. Compute the LS estimate using all the possible subsets of $k$ features
2. Compute the prediction error on the validation dataset

Choose the subset of $k^*$ features giving the lowest validation error.


In [None]:
import itertools
Xtr_cv, Xva_cv, Ytr_cv, Yva_cv = train_test_split(Xtr, Ytr, test_size=0.33)
nsub = 5 # Xtr.shape[1]
features_idx_dict = {}
validation_err_dict = {}
validation_err_min = np.zeros(nsub,)
validation_err_min_idx = np.zeros(nsub, dtype=np.int64)
for k in range(1,nsub+1):
    features_idx = list(itertools.combinations(range(Xtr.shape[1]),k))
    validation_error = np.zeros(len(features_idx),)
    for j in range(len(features_idx)):
        OLS_subset = lm.LinearRegression()
        OLS_subset.fit(Xtr_cv[:,features_idx[j]], Ytr_cv)
        validation_error[j] = 1 - OLS_subset.score(Xva_cv[:,features_idx[j]], Yva_cv)
    validation_err_min[k-1] = np.min(validation_error)    
    validation_err_min_idx[k-1] = np.argmin(validation_error)
    features_idx_dict.update({k: features_idx})
    validation_err_dict.update({k: validation_error})

Plot the validation error as a function of the number of retained features.

In [None]:
# Plot
plt.figure(2)
for k in range(1,nsub+1):
    plt.scatter(k*np.ones(validation_err_dict[k].shape), validation_err_dict[k], color='k', alpha=0.5)
    #plt.scatter(k, validation_err_min[k-1], color='r', alpha=0.8)
    if k > 1:
        plt.plot([k-1, k], [validation_err_min[k-2], validation_err_min[k-1]], color='r',marker='o', 
            markeredgecolor='k', markerfacecolor = 'r', markersize = 10)
plt.xlabel('Number of retained features')
plt.ylabel('RSS/TSS')
plt.title('Best-Subset Selection')
plt.show()

Compute the LS estimate using the selected subset of features.

### TODO 4: pick the number of features for the best subset according to figure above, learn the model on the entire training data, and compute score on training and on test data

In [None]:
OLS_best_subset = lm.LinearRegression()

# now pick the number of features according to best subset
opt_num_features = ????

#opt_features_idx contains the indices of the features from best subset
opt_features_idx = features_idx_dict[opt_num_features][validation_err_min_idx[opt_num_features - 1]]

#let's print the indices of the features from best subset
print opt_features_idx

#fit the best subset on the entire training set
????

#print the coefficient of determination on training and on test data
print "Coefficient of determination on training data:", ????
print "Coefficient of determination on test data:", ????

### TODO 5: do the features from best subset selection correspond to the ones you would have chosen based on confidence intervals for the linear regression coefficients? Comment (max 5 lines)

## Ridge Regression

### TODO 6: Shrinkage Evaluation

Compute the ridge regression coefficients on the training data (write the formula manually) using different values of the regularization parameter $\lambda$.


In [None]:
#these are the values of lambda that you are going to use
lam_values = np.logspace(0, 4, 300)

#ridge_coeff will contain the solutions; note that we include \beta_0 in the model
ridge_coeff = np.zeros((Xtr_intercept.shape[1], len(lam_values)))

#norm will contain the norm of the solutions
norm_ridge_coeff = np.zeros(len(lam_values),)

for i in range(len(lam_values)):
    ridge_coeff[:,i] = ????
    norm_ridge_coeff[i] = ????

Plot the norm of the estimated coefficient vector vs the regularization parameter $\lambda$. In this way you will be able to evaluate the coefficients shrinkage achieved through ridge regression.

In [None]:
plt.figure(3)
plt.xscale('log')
plt.plot(lam_values, norm_ridge_coeff/Xtr_intercept.shape[1])
plt.xlabel('Lambda')
plt.ylabel('Coefficients norm')
plt.title('Average norm of the Ridge Regression coefficients vs Regularization Parameter')
plt.show()

### TODO 7: explain the results shown in the figure above (max 5 lines)

### TODO 8: Use k-fold Cross-Validation to fix the regularization parameter

Use the scikit-learn built-in routine *Ridge* (from the *linear_regression* package) to compute the ridge regression coefficients.

Use *KFold* from *sklearn.cross_validation* to split the data into the desired number of folds.

The pick $lam\_opt$ to be the chosen value for the regularization parameter.

In [None]:
from sklearn.cross_validation import KFold
num_folds = 5
kf = KFold(n=Ntr, n_folds=num_folds)

#loss_ridge_kfold will contain the value of the loss
loss_ridge_kfold = np.zeros(len(lam_values),)

for i in range(len(lam_values)):
    
    #define a ridge regressor using Ridge() for the i-th value of lam_values
    ridge_kfold = lm.Ridge(alpha=lam_values[i])
    for train_index, validation_index in kf:
        Xtr_kfold, Xva_kfold = Xtr[train_index], Xtr[validation_index]
        Ytr_kfold, Yva_kfold = Ytr[train_index], Ytr[validation_index]
        
        #learn the model using the training data from the k-fold
        ????
        
        #compute the loss using the validation data from the k-fold
        ????

loss_ridge_kfold /= Ntr

#choose the regularization parameter that minimizes the loss
lam_opt = ????
print "Best value of the regularization parameter:", lam_opt

Plot the Cross-Validation estimate of the prediction error as a function of the regularization parameter

In [None]:
plt.figure(4)
plt.xscale('log')
plt.plot(lam_values, loss_ridge_kfold, color='b')
plt.scatter(lam_opt, loss_ridge_kfold[np.argmin(loss_ridge_kfold)], color='b', marker='o', linewidths=5)
plt.xlabel('Lambda')
plt.ylabel('Validation Error')
plt.title('Ridge Regression: choice of regularization parameter')
plt.show()

### TODO 9: now estimate the ridge regression coefficients using all the training data and the optimal regularization parameter (chosen at previous step)

In [None]:
# Estimate Ridge Regression Coefficients with all data for the the optimal value lam_opt of the regularization paramter

#define the model using the optimal value lam_opt
????
#fit using the training data
????

print "Coefficient of determination on training data:", ????
print "Coefficient of determination on test data:", ????

Comapre the LS and the ridge regression coefficients.

In [None]:
# Compare LS and ridge coefficients
ind = np.arange(1,len(OLS.coef_)+1)  # the x locations for the groups
width = 0.35       # the width of the bars
fig, ax = plt.subplots()
rects1 = ax.bar(ind, OLS.coef_, width, color='r')
rects2 = ax.bar(ind + width, ridge_reg.coef_, width, color='y')
ax.legend((rects1[0], rects2[0]), ('LS', 'Ridge'))
plt.xlabel('Coefficient Idx')
plt.ylabel('Coefficient Value')
plt.title('LS and Ridge Coefficient')
plt.show()

### TODO 10: comment on the comparison among the LS and Ridge Regression coefficients (max 5 lines)


## Lasso

Use the routine *lasso_path* from *sklearn.linear_regression* to compute the "lasso path" for different values of the regularization parameter $\lambda$. 


In [None]:
from sklearn.linear_model import lasso_path

# To pass a specific range of lambda values
#lasso_lams, lasso_coefs, _ = lasso_path(Xtr, Ytr, alphas=lam_values)  

# If no value is passed, the routine automatically select a range of values 
lasso_lams, lasso_coefs, _ = lasso_path(Xtr, Ytr) 



Evaluate the sparsity in the estimated coefficients as a function of the regularization parameter $\lambda$: to this purpose, compute the number of non-zero entries in the estimated coefficient vector.

In [None]:
l0_coef_norm = np.zeros(len(lasso_lams),)

for i in range(len(lasso_lams)):
    l0_coef_norm[i] = sum(lasso_coefs[:,i]!=0)


plt.figure(6)
plt.plot(lasso_lams, l0_coef_norm, marker='o', markersize=5)
plt.xlabel('Lambda')
plt.ylabel('Number of non-zero coefficients')
plt.title('Sparsity Degree')
plt.show()

### TODO 11: explain the results in the figure above (max 5 lines)

### TODO 12: Use k-fold Cross-Validation to fix the regularization parameter

Use the routine *LassoCV* from the package *sklearn.linear_regression* to fix the regularization parameter $\lambda$ throug k-fold cross-validation.

In [None]:
# use LassoCV passing num_folds for the number of folds in CV
lasso_kfold = ????

#fit using the training set
????

print "Total number of coefficients:", len(lasso_kfold.coef_)
print "Number of non-zero coefficients:", sum(lasso_kfold.coef_ != 0)
print "Best value of regularization parameter:", lasso_kfold.alpha_
loss_lasso_kfold = np.sum(lasso_kfold.mse_path_, axis=1)


### TODO 13: Plot the Cross-Validation estimate of the prediction error  as a function of the regularization parameter $\lambda$

In [None]:
plt.figure(7)
plt.xscale('log')

#plot the lasso k-fold loss as a function of the lasso_kfold.alphas_
????

#this plots the best value of the regularization parameter
plt.scatter(lasso_kfold.alpha_, loss_lasso_kfold[np.where(lasso_kfold.alphas_ == lasso_kfold.alpha_)], 
    color='b', marker='o', linewidths=5)
plt.xlabel('Lambda')
plt.ylabel('Validation Error')
plt.title('Lasso: choice of regularization parameter')
plt.show()

### TODO 14: describe the results in the figure above and its relation to the best lambda chosen by CV

### TODO 15: Now estimate the LASSO regression coefficients using all the training data and the optimal regularization parameter (chosen through k-fold cross-validation).

Use the routine *lasso* from *sklearn.linear_regression* to do it.


In [None]:
#define a lasso model with Lasso() using the best value of the regularization parameter
lasso_reg = ????
#fit the model using the entire training set
????

print "Coefficient of determination on training data:", ????
print "Coefficient of determination on test data:", ????

## Compare LS, Ridge and Lasso coefficients

Use a bar plot to compare the estimated coefficients by means of these three estimators.

In [None]:
ind = np.arange(1,len(OLS.coef_)+1)  # the x locations for the groups
width = 0.25       # the width of the bars
fig, ax = plt.subplots()
rects1 = ax.bar(ind, OLS.coef_, width, color='r')
rects2 = ax.bar(ind + width, ridge_reg.coef_, width, color='y')
rects3 = ax.bar(ind + 2*width, lasso_reg.coef_, width, color='g')
ax.legend((rects1[0], rects2[0], rects3[0]), ('LS', 'Ridge', 'Lasso'))
plt.xlabel('Coefficient Idx')
plt.ylabel('Coefficient Value')
plt.title('LS, Ridge and Lasso Coefficient')
plt.show()

### TODO 16: how do coefficient from the Lasso model compare to LS and Ridge Regression? (max 5 lines)

## Evaluate the performance on the test set



In [None]:
print "Coefficient of determination of LS on test data:", OLS.score(Xte,Yte)
print "Coefficient of determination of LS (with subset selection) on test data:", OLS_best_subset.score(Xte[:,opt_features_idx],Yte)
print "Coefficient of determination of Ridge Regression on test data:", ridge_reg.score(Xte,Yte)
print "Coefficient of determination of LASSO on test data:", lasso_reg.score(Xte,Yte)

### TODO 17: comment and compare the results obtained by the different methods (max 10 lines)

### TODO 18: use different data size

Perform the same estimation procedures using different more points on the training data, that is fix Ntr = 100. You can simply copy and paste the code above into the cell below.

In [None]:
#put in this cell the code to do the same analysis as before but with Ntr=100

????

### TODO 19: how do the results change with Ntr=100? (max 10 lines)

# Regularized Classification on Titanic Dataset

We are going to use a dataset from a Kaggle competition (https://www.kaggle.com/c/titanic/data)
 
### Dataset description

>The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew.  This sensational tragedy shocked the international community and led to better safety regulations for ships.

>One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew.  Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

>In this contest, we ask you to complete the analysis of what sorts of people were likely to survive. 

[From the competition [homepage](http://www.kaggle.com/c/titanic-gettingStarted).]




In [None]:
%matplotlib inline  
import matplotlib.pyplot as plt

Load the data from a .csv file

In [None]:
from __future__ import division
import pandas as pd
import numpy as np

df = pd.read_csv("titanicData.csv") 
df = df.drop(['Ticket','Cabin','Name'], axis=1)
# Remove missing values
df = df.dropna() 
df.describe()

Create data matrices: many of the features (columns of indices 0,1,3,4,6 in Xcat below) are categorical, so we first encode them with integers with LabelEncoder() and then obtain the indicator variables with OneHotEncoder()

In [None]:
Data = df.values
Xcat = Data[:,2:]
Y = Data[:,1]
n = Xcat.shape[1]  # number of features

num_samples = Xcat.shape[0]

#now encode categorical variables using integers and one-hot-encoder

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
label_encoder = LabelEncoder()
onehot_encoder = OneHotEncoder()

#transform first column in integers; no need to use one hot encoder since it
#has only 2 values

X = onehot_encoder.fit_transform(Xcat[:,0].reshape(-1,1)).toarray()

#repeat for the other categorical input variables

index_categorical = [1,3,4,6]

for i in range(1,7):
    if i in index_categorical:
        X_tmp = label_encoder.fit_transform(Xcat[:,i])
        X_tmp = X_tmp.reshape(X_tmp.shape[0],1)
        X_tmp = onehot_encoder.fit_transform(X_tmp[:,0].reshape(-1,1)).toarray()
        X = np.hstack((X,X_tmp))
    else:
        X_tmp = Xcat[:,i]
        X_tmp = X_tmp.reshape(X_tmp.shape[0],1)
        X = np.hstack((X,X_tmp))

## Data Preprocessing

The class labels are already 0-1, so we can use them directly.

In [None]:
# Rename the class labels
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
Y = label_encoder.fit_transform(Y)
K = max(Y) + 1 # number of classes

print "Number of classes: "+str(K)

Given $N$ total data points, keep $N_{tr}$ data points as data for training and validation and $N_{te}:=N-N_{tr}$ as test data. Splitting is random, use as seed your ``numero di matricola'' (see below)

In [None]:
# Split data into training and validation data
from sklearn.cross_validation import train_test_split
N = np.shape(X)[0]

#put here your ``numero di matricola''
Numero_di_Matricola = 11 

Ntrain = 50  # use 50 samples for training + validation...
Ntest = N-Ntrain # and the rest for testing

Xtr, Xtest, Ytr, Ytest = train_test_split(X, Y, test_size=Ntest/N, random_state = Numero_di_Matricola)

Ntr = Xtr.shape[0]
Ntest = Xtest.shape[0]

Design matrix is standardized to have zero-mean and unit variance (columnwise):

In [None]:
# Standardize the Features Matrix
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(X)
Xtr = scaler.transform(Xtr)
Xtest = scaler.transform(Xtest)  # use the same transformation on test data

### Perform Logistic Regression

We now perform logistic regression using the function provided by Scikit-learn.

Note: as provided by Scikit-learn, logistic regression is always implemented using regularization. However, the impact of regularization can be dampened to have almost no regularization by changing the parameter $C$. In particular, using a very high value of $C$ reduces the impact from regularization. ($C$ is the inverse of the regularization parameter $\lambda$ - see TODO 4.)

Note that the intercept is estimated in the model.

In [None]:
from sklearn import linear_model
# define a logistic regression model with very high C parameter -> low impact from regularization
reg = linear_model.LogisticRegression(C=100000000, solver='newton-cg')

#fit the model on training data
reg.fit(Xtr, Ytr)

Note that the logistic regression function in Scikit-learn has many optional parameters. Read the documentation to understand what they do!

## TODO 1
### Examine coefficients from Logistic Regression (by print and plotting them)

In [None]:
#print the coefficients from the logistic regression model.
print ????

# Plot the coefficients
reg_coef = reg.coef_.reshape(reg.coef_.shape[1],)
plt.figure()
ind = np.arange(1,len(reg_coef)+1)  # the x locations for the groups
width = 0.45       # the width of the bars
plt.bar(ind, reg_coef, width, color='r')
plt.xlabel('Coefficient Idx')
plt.ylabel('Coefficient Value')
plt.title('Logistic Regression Coefficients')
plt.show()

## TODO 2

### Questions: How many coefficients do you get? Why? How many of them are equal to 0? (max 5 lines)

## TO DO 3
### Predict labels on training and validation

- Compute the predicted labels on training and validation data using reg.predict
 - Evaluate the accuracy using metrics.accuracy_score from scikit-learn (it returns the percentage of data correctly classified).
 - Evaluate the score used by logistic regression on training and validation data using metrics.accuracy_score()

In [None]:
from sklearn import metrics

#prediction on training data
Yhat_tr_LR = ????

#prediction on test data
Yhat_test_LR = ????

# compute accuracy as suggested above using metrics.accuracy_score from scikit-learn for training dataset
print "Training Accuracy:", 100*????

# compute accuracy as suggested above using metrics.accuracy_score from scikit-learn for test dataset
print "Test Accuracy:", 100*????

## TODO 4
### Use L2 regularized logistic regression with cross-validation

We perform the L2 regularization for different values of the regularization parameter $C$, and use the Scikit-learn function to perform cross-validation (CV).

In L2 regularized logistic regression, the following L2 regularization term is subtracted to the log-likelihood.

$$
    \lambda \sum_{j=1}^p \beta_j^2
$$

where $\lambda >0 $ is the complexity or regularization parameter. Note that the term above is *subtracted* since for logistic regression we want to maximize the log-likelihood.

The parameter $C$ used by Scikit learn corresponds to the inverse of $\lambda$, that is $C = \frac{1}{\lambda}$.

Note: the CV in Scikit-learn is by default a *stratified* CV, that means that data is split into train-validation while maintaining the proportion of different classes in each fold.

In the code below:
- use LogisticRegressionCV() to select the best value of C with a 10-fold CV with L2 penalty;
- use LogisticRegression() to learn the best model for the best C with L2 penalty on the entire training set

In [None]:
#define the model using LogisticRegressionCV passing an appropriate solver, cv value, and choice of penalty
regL2 = ????

#fit the model on training data
????

#print the best C
print ????

#define the model using the best C and an appropriate solver
regL2_final = ????

#fit the model using the best C on the entire training set
????

### TODO 5: print and plot the coefficients from logistic regression and the regularized version.

In [None]:
#print the coefficients from logistic regression
print ????

#print the coefficients from L2 regularized logistic regression
print ????


# Plot the coefficients
regL2_final_coef = regL2_final.coef_.reshape(regL2_final.coef_.shape[1],)
ind = np.arange(1,len(reg_coef)+1)  # the x locations for the groups
width = 0.35       # the width of the bars
fig, ax = plt.subplots()

rects1 = ax.bar(ind, reg_coef, width, color='r')
rects2 = ax.bar(ind + width, regL2_final_coef, width, color='y')
ax.legend((rects1[0], rects2[0]), ('Log Regr', 'Log Regr + L2 Regul'))
plt.xlabel('Coefficient Idx')
plt.ylabel('Coefficient Value')
plt.title('Logistic Regression Coefficients: Standard and Regularized Version')
plt.show()

### TODO 6: how do the coefficients from L2 regularization compare to the ones from logistic regression? (max 5 lines)

### TODO 7: obtain classification accuracy on training and test data for the L2 regularized model

In [None]:
#now get training and test error and print training and test accuracy

# predictions on training data 
Yhat_tr_LR_L2 = ????

# predictions on test data 
Yhat_test_LR_L2 = ????

# compute accuracy as suggested above using metrics.accuracy_score from scikit-learn on training data
print "Training Accuracy:", 100*????

# compute accuracy as suggested above using metrics.accuracy_score from scikit-learn on test data
print "Test Accuracy:",100*????

### TODO 8: how does accuracy compare to logistic regression? Comment (max 5 lines).

### TODO 9: Use L1 regularized logistic regression with cross-validation

We perform the L1 regularization for different values of the regularization parameter $C$, and use the Scikit-learn function to perform cross-validation (CV).

In L1 regularized logistic regression, the following L1 regularization term is added to the loss:

$$
    \lambda \sum_{j=1}^p |\beta_j|
$$

where $\lambda >0 $ is the complexity or regularization parameter. Note that the term above is *subtracted* since for logistic regression we want to maximize the log-likelihood.

The parameter $C$ used by Scikit learn corresponds to the inverse of $\lambda$, that is $C = \frac{1}{\lambda}$.

Note: the CV in Scikit-learn is by default a *stratified* CV, that means that data is split into train-validation while maintaining the proportion of different classes in each fold.

In the code below:
- use LogisticRegressionCV() to select the best value of C with a 10-fold CV with L1 penalty;
- use LogisticRegression() to learn the best model for the best C with L1 penalty on the entire training set

Note: not all the solvers in LogisticRegressionCV() and LogisticRegression() can be used for L1 regularization! See the documentation and choose an appropriate solver.

In [None]:
#define the model using LogisticRegressionCV passing an appropriate solver, cv value, and choice of penalty
regL1 = ????

#fit the model on training data
????

#print the best C
print ????


#define the model using the best C and an appropriate solver
regL1_final = ????

#fit the model using the best C on the entire training set
????

### TODO 10: plot the coefficients from logistic regression and the regularized version.

In [None]:
#print the coefficients from logistic regression
print ????

#print the coefficients from L2 regularized logistic regression
print ????

#print the coefficients from L1 regularized logistic regression
print ????

# Plot the coefficients
regL1_final_coef = regL1_final.coef_.reshape(regL1_final.coef_.shape[1],)

ind = np.arange(1,len(reg_coef)+1)  # the x locations for the groups
width = 0.25       # the width of the bars
fig, ax = plt.subplots()
rects1 = ax.bar(ind, reg_coef, width, color='r')
rects2 = ax.bar(ind + width, regL2_final_coef, width, color='y')
rects3 = ax.bar(ind + 2*width, regL1_final_coef, width, color='g')
ax.legend((rects1[0], rects2[0], rects3[0]), ('Log Regr', 'Log Regr + L2 Regul', 'Log Regr + L1 Regul'))
plt.xlabel('Coefficient Idx')
plt.ylabel('Coefficient Value')
plt.title('Logistic Regression Coefficients: Standard, Regularized L2 and L1 Version')
plt.show()

### TODO 11: how do the coefficients from L1 regularization compare to the ones from logistic regression and to the ones from L2 regularization? (max 5 lines)

### TODO 12: obtain classification accuracy on training and test data for the best L1 regularized model

In [None]:
#now get training and test error and print training and test accuracy

# predictions on training data 
Yhat_tr_LR_L1 = ????

# predictions on test data 
Yhat_test_LR_L1 = ????

# compute accuracy as suggested above using metrics.accuracy_score from scikit-learn on training data
print "Training Accuracy:", 100*????

# compute accuracy as suggested above using metrics.accuracy_score from scikit-learn on test data
print "Test Accuracy:",100*????

### TODO 13: how does accuracy compare to logistic regression and to L2 regularization? (max 5 lines)


### TODO 14: use different data size

Perform the same estimation procedures using different more points on the training data, that is fix Ntr = 100. You can simply copy and paste the code above into the cell below.

In [None]:
#put in this cell the code to do the same analysis as before but with Ntr=100

????

### TODO 15: how do the results change with Ntr=100? (max 10 lines)