# Section 04 
### Introduction to Data Science EN.553.436/EN.553.636 - Fall 2021

# Ridge Regression and LASSO Regression 

# Problem 1
Below we load the [Boston house prices dataset](https://scikit-learn.org/stable/datasets/toy_dataset.html#boston-dataset). We store the labels of predictors for you and split the dataset into a training and test set using 1/3 as the test size and a random state of 553.

In [53]:
from sklearn.datasets import load_boston
boston_bunch = load_boston()
X = boston_bunch.data
y = boston_bunch.target
labels = boston_bunch.feature_names

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=553)

print(X_train.shape)

(339, 13)


## 1.1
In the homework1, we built three different linear models by OLS to predict house price (MEDV) and calculated their $R^{2}$'s. The models of using all the 13 predictor variables and using the polynomial combinations of the 13 predictor variables both have high $R^{2}$'s. But is it optimal to use all the predictor variables? Compute the variance, bias and the Mean Squared Error (MSE) of three models we built in the homework1. What can be observed?

(Import the function bias_variance_decomp() from mlxtend.evaluate to calculate the variance, bias and the MSE.
You may see more details here: http://rasbt.github.io/mlxtend/user_guide/evaluate/bias_variance_decomp/#api. 
And the source code of the function is here:https://github.com/rasbt/mlxtend/blob/master/mlxtend/evaluate/bias_variance_decomp.py )

In [25]:
import numpy as np
from sklearn.linear_model import LinearRegression

# MODEL 1: using all predictor variables
reg1 = LinearRegression().fit(X_train, y_train)

# MODEL 2: using only AGE, NOX, DIS, and RAD as predictor variables
ind2 = np.where([a in ['AGE','NOX','DIS','RAD'] for a in boston_bunch.feature_names])[0]
reg2 = LinearRegression().fit(X_train[:,ind2], y_train)

# MODEL 3: using all polynomial combinations of degree  ≤2  of the original thirteen predictor variables
import sklearn.preprocessing as prepro
poly = prepro.PolynomialFeatures(2)
X_train_enhanced = poly.fit_transform(X_train)
X_test_enhanced = poly.fit_transform(X_test)
reg3 = LinearRegression().fit(X_train_enhanced, y_train)

In [22]:
#!pip install mlxtend
from mlxtend.evaluate import bias_variance_decomp
# Model 1
avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(reg1, X_train, y_train, X_test, y_test,loss='mse',num_rounds= 100, random_seed=123)
print("Model 1's MSE, Bias, Var: %f, %f, %f " % (avg_expected_loss, avg_bias,avg_var))

#Model 2
avg_expected_loss2, avg_bias2, avg_var2 = bias_variance_decomp(reg2, X_train[:,ind2], y_train, X_test[:,ind2], y_test,loss='mse',num_rounds=100,random_seed=123)
print("Model 2's MSE, Bias, Var: %f, %f, %f " % (avg_expected_loss2, avg_bias2,avg_var2))

#Model 3
avg_expected_loss3, avg_bias3, avg_var3 = bias_variance_decomp(reg3, X_train_enhanced, y_train, X_test_enhanced, y_test,loss='mse',num_rounds= 100, random_seed=  123)
print("Model 3's MSE, Bias, Var: %f, %f, %f " % (avg_expected_loss3, avg_bias3, avg_var3))

Model 1's MSE, Bias, Var: 30.329931, 29.074385, 1.255546 
Model 2's MSE, Bias, Var: 76.069107, 75.267040, 0.802066 
Model 3's MSE, Bias, Var: 375.233667, 20.588502, 354.645166 


## 1.2
In fact, as we add more and more parameters to our model, its complexity increases, which results in increasing variance and decreasing bias, i.e., overfitting. See the picture: <img src="https://cdn.analyticsvidhya.com/wp-content/uploads/2017/06/05153332/model-complex.png" alt="image info" /> So we need to balance the variance and the bias. In practice, we usually use regularization to overcome overfitting.

Implement Ridge Regression and LASSO Regression on the enhanced dataset. Compute the $R^{2}$, variance, bias and MSE. Which one performs better?.(use alpha = 0.1)

In [34]:
from sklearn.linear_model import Ridge,Lasso
import matplotlib.pyplot as plt
from sklearn.exceptions import ConvergenceWarning
import warnings

reg4 = Ridge(alpha=0.1).fit(X_train_enhanced, y_train)
R2_train = reg4.score(X_train_enhanced,y_train)
R2_test = reg4.score(X_test_enhanced,y_test)
avg_expected_loss4, avg_bias4, avg_var4 = bias_variance_decomp(reg4, X_train_enhanced, y_train, X_test_enhanced, y_test,loss='mse',num_rounds= 100, random_seed=  123)
print("Model 4's training set R2, test set R2, MSE, Bias, Var: %f,%f,%f, %f, %f " % (R2_train,R2_test,avg_expected_loss4, avg_bias4, avg_var4))

warnings.filterwarnings("ignore", category=ConvergenceWarning) # this is to suppress warnings related to the optimization in the training of Lasso
reg5 = Lasso(alpha=0.1).fit(X_train_enhanced, y_train)
R2_train_5 = reg5.score(X_train_enhanced,y_train)
R2_test_5 = reg5.score(X_test_enhanced,y_test)
avg_expected_loss5, avg_bias5, avg_var5 = bias_variance_decomp(reg5, X_train_enhanced, y_train, X_test_enhanced, y_test,loss='mse',num_rounds= 100, random_seed=  123)
print("Model 5's training set R2, test set R2, MSE, Bias, Var: %f,%f,%f, %f, %f " % (R2_train_5,R2_test_5,avg_expected_loss5, avg_bias5, avg_var5))


Model 4's training set R2, test set R2, MSE, Bias, Var: 0.942414,0.818889,30.761943, 16.741151, 14.020792 
Model 5's training set R2, test set R2, MSE, Bias, Var: 0.922681,0.807759,22.543665, 17.879390, 4.664275 


## 1.3
Now implement Ridge Regression and LASSO Regression on the original dataset. Change the values of the hyperparameters of Ridge and LASSO, how does the magnitude of the coefficients change? Is there any difference between these two methods? If we have a large dataset with 10,000 features, and some of the independent features are correlated with other independent features, which regression would you use, Ridge or LASSO?

In [38]:
reg6 = Ridge(alpha=0.5).fit(X_train, y_train)
reg7 = Lasso(alpha=0.5).fit(X_train, y_train)
print(reg1.coef_)
print(reg6.coef_)
print(reg7.coef_)

[-1.09694695e-01  4.87796793e-02  6.00389621e-02  2.33204686e+00
 -1.66912244e+01  4.06068613e+00  1.50662965e-02 -1.25022813e+00
  3.71520546e-01 -1.52345646e-02 -9.44628972e-01  1.01469212e-02
 -5.41447857e-01]
[-1.07243922e-01  4.96802447e-02  3.16703636e-02  2.20691944e+00
 -1.10650624e+01  4.09992200e+00  1.01624370e-02 -1.17629314e+00
  3.60534305e-01 -1.58028537e-02 -8.84461645e-01  1.03427764e-02
 -5.47704300e-01]
[-0.08286329  0.05152916 -0.          0.         -0.          2.6772327
  0.01897004 -0.75779683  0.32597057 -0.01746293 -0.77193667  0.00921241
 -0.65953976]


# 2. Principal Component Analysis

## Problem 2


## 2.1
For the Boston house prices dataset, split the dataset into a training and test set using 2/5 as the test size and a random state of 553, and use polynomial features of degree 3, then run standard LinearRegression. Can you interpret the resulting test and training error in the context of the bias-variance tradeoff? 

In [46]:
X_train1, X_test1, y_train1, y_test1 = train_test_split(X, y, test_size=2/5, random_state=553)
poly1 = prepro.PolynomialFeatures(3)
X_train1_enhanced = poly1.fit_transform(X_train1)
X_test1_enhanced = poly1.fit_transform(X_test1)
reg8 = LinearRegression().fit(X_train1_enhanced, y_train1)
avg_expected_loss6, avg_bias6, avg_var6 = bias_variance_decomp(reg8, X_train1_enhanced, y_train1, X_test1_enhanced, y_test1,loss='mse',num_rounds= 100, random_seed=  123)
print("Model 8's MSE, Bias, Var:%f, %f, %f " % (avg_expected_loss6, avg_bias6, avg_var6))


Model 8's MSE, Bias, Var:85952.579571, 1882.715186, 84069.864385 


## 2.2
Apply PCA with a number of 5, 50, 100 and 200 principal component, and run LinearRegression subsequently on the resulting principal components. What can be observed?

In [55]:
from sklearn import decomposition
pca = decomposition.PCA(n_components=X_train[:,12].size)
# sklearn uses a different convention
pca.fit(X_train.T) # note the transpose
# pca.transform(X.T)
print (pca.components_.T, pca.explained_variance_)

ValueError: n_components=339 must be between 0 and min(n_samples, n_features)=13 with svd_solver='full'