# Linear Regression with Regularization

In this notebook, you will create a **gradient descent** solver for **ridge regression** and **lasso regression** then compare it to the built-in libraries in `sklearn.linear_model`.

## Set up notebook and load data set

After loading in some standard packages, we load synthetic data set consisting of data points `(x,y)`:
* `x`: 100-dimensional vector whose coordinates are independent draws from a standard normal (Gaussian) distribution
* `y`: response value given by `y = wx + e` where `w` is a target regression function and `e` is Gaussian noise. **`y` was generated using only 10 of the 100 features (which are unknown to you, at least for now!).**}

In [None]:
%matplotlib inline
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
matplotlib.rc('xtick', labelsize=14) 
matplotlib.rc('ytick', labelsize=14)

This next snippet of code loads in the dataset. There are 100 data points for train and test set respectively, each with 100 predictor variables (which we'll denote `x`) and one response variable (which we'll denote `y`).

Make sure the files `'trainx.csv'`,`'trainy.csv'`,`'testx.csv'`, and `'testy.csv'` are in the same directory as this notebook.


In [None]:
trainx = np.genfromtxt('trainx.csv', delimiter=',')
trainy = np.genfromtxt('trainy.csv', delimiter=',')
testx = np.genfromtxt('testx.csv', delimiter=',')
testy = np.genfromtxt('testy.csv', delimiter=',')
trainx.shape, trainy.shape,testx.shape,testy.shape

## 1. Gradient descent solver for ridge regression

<font color="magenta">**For you to do:**</font> Define a procedure, **ridge_regression_GD**, that uses gradient descent to solve the ridge regression problem. It is invoked as follows:

* `w,b,losses = ridge_regression_GD(x,y,C)`

Here, the input consists of:
* training data `trainx, trainy`, where `trainx` and `trainy` are numpy arrays of dimension `m`-by-`n` and `m`, respectively (if there are `m` training points and `n` features)
* regularization constant `C`, we normally use the term **lambda**.

The function should find the `n`-dimensional vector `w` and offset `b` that minimize the MSE loss function (with regularization constant `C`), and return:
* `w` and `b`
* `losses`, an array containing the MSE loss at each iteration

<font color="magenta">Advice:</font> First figure out the derivative, which has a relatively simple form. Next, when implementing gradient descent, think carefully about two issues.

1. What is the step size (learning rate)?
2. When has the procedure converged?

Take the time to experiment with different ways of handling these.

**Note:** You can use additional methods as helpers if you feel the need.

In [None]:
def ridge_regression_GD(x,y,C):
    ### START CODE HERE ###
    return w,b,losses
    ### END CODE HERE ###

Let's try it out and print a graph of the loss values during the optimization process.

In [None]:
# Set regularization constant
C = 1.0 # you can try different values for C
# Run gradient descent solver
w, b, losses = ridge_regression_GD(trainx,trainy,C)
# Plot the losses
plt.plot(losses,'r')
plt.xlabel('Iterations', fontsize=14)
plt.ylabel('Loss', fontsize=14)
plt.show()

## 2. Evaluate the gradient descent solver for ridge regression

Now let's compare the regressor found by your gradient descent procedure to that returned by the built-in ridge regression solver in `sklearn`. We will compare them by their resulting MSE values. We will also compare the results of the built-in linear regression (without regularization).

Complete the following code to compute the MSE value given `w`, `b`, `x`, and `y`.

In [None]:
def compute_mse(w,b,x,y):
    ### START CODE HERE ###
    return None
    ### END CODE HERE ###

In [None]:
# Set regularization constant
C = 1.0 # you can change it
# Run gradient descent solver and compute its MSE
w, b, losses = ridge_regression_GD(trainx, trainy, C)
# Use built-in routine for linear regression and compute MSE
lin_regr = linear_model.LinearRegression()
lin_regr.fit(trainx, trainy)
# Use built-in routine for ridge regression and compute MSE
ridge_regr = linear_model.Ridge(alpha=1.0) # you can try different values
ridge_regr.fit(trainx, trainy)
# Print MSE values
print "MSE of built-in linear regression(training): ", mean_squared_error(lin_regr.predict(trainx), trainy)
print "MSE of gradient descent solver for ridge regression (training): ", compute_mse(w,b,trainx, trainy)
print "MSE of built-in solver for ridge regression (training): ", mean_squared_error(ridge_regr.predict(trainx), trainy)
print "MSE of built-in linear regression(test): ", mean_squared_error(lin_regr.predict(testx), testy)
print "MSE of gradient descent solver for ridge regression (test): ", compute_mse(w,b,testx, testy)
print "MSE of built-in solver for ridge regression (test): ", mean_squared_error(ridge_regr.predict(testx), testy)

## 3. Gradient descent solver for lasso regression

<font color="magenta">**For you to do:**</font> Define a procedure, **lasso_regression_GD**, that uses gradient descent to solve the lasso regression problem. It is invoked as follows:

* `w,b,losses = lasso_regression_GD(x,y,C)`

Here, the input consists of:
* training data `trainx, trainy`, where `trainx` and `trainy` are numpy arrays of dimension `m`-by-`n` and `m`, respectively (if there are `m` training points and `n` features)
* regularization constant `C`, we normally use the term **lambda**.

The function should find the `n`-dimensional vector `w` and offset `b` that minimize the MSE loss function (with regularization constant `C`), and return:
* `w` and `b`
* `losses`, an array containing the MSE loss at each iteration

<font color="magenta">Advice:</font> First figure out the derivative, which has a relatively simple form. Next, when implementing gradient descent, think carefully about two issues.

1. What is the step size (learning rate)?
2. When has the procedure converged?

Take the time to experiment with different ways of handling these.

**Note:** You can use additional methods as helpers if you feel the need.

In [None]:
def lasso_regression_GD(x,y,C):
    ### START CODE HERE ###
    return w,b,losses
    ### END CODE HERE ###

Let's try it out and print a graph of the loss values during the optimization process.

In [None]:
# Set regularization constant
C = 1.0 # you can try different values for C
# Run gradient descent solver
w, b, losses = lasso_regression_GD(trainx,trainy,C)
# Plot the losses
plt.plot(losses,'r')
plt.xlabel('Iterations', fontsize=14)
plt.ylabel('Loss', fontsize=14)
plt.show()

## 4. Evaluate the gradient descent solver for lasso regression

Now let's compare the regressor found by your gradient descent procedure to that returned by the built-in ridge regression solver in `sklearn`. We will compare them by their resulting MSE values.

In [None]:
# Set regularization constant
C = 1.0 # you can change it
# Run gradient descent solver and compute its MSE
w, b, losses = lasso_regression_GD(trainx,trainy,C)
# Use built-in routine for ridge regression and compute MSE
lasso_regr = linear_model.Lasso(alpha=1.0) # you can try different values
lasso_regr.fit(trainx,trainy)
# Print MSE values
print "MSE of built-in linear regression(training): ", mean_squared_error(lin_regr.predict(trainx), trainy)
print "MSE of gradient descent solver for lasso regression (training): ", compute_mse(w,b,trainx, trainy)
print "MSE of built-in solver for lasso regression (training): ", mean_squared_error(lasso_regr.predict(trainx), trainy)
print "MSE of built-in linear regression(test): ", mean_squared_error(lin_regr.predict(testx), testy)
print "MSE of gradient descent solver for lasso regression (test): ", compute_mse(w,b,testx, testy)
print "MSE of built-in solver for lasso regression (test): ", mean_squared_error(lasso_regr.predict(testx), testy)

## Questions 

1. Documents all the results in the report.
2. Try with a large value of C (e.g., 20) for lasso and chaeck teh weights and MSE. What do you observe?
3. Compare the coeffecients (parameter values) for ridge and lasso for the best setup. What do you observe? Can you explain?
4. Compare MSE of linear, ridge, and lasso. What do you observe?
5. Which among the ridge and lasso gives teh best results on the test? Can you explain why?
6. Can the lasso regression retreive the 10 features which were used in the equation for y? List them. 