# Regression with Ridge and Lasso

In this part of the assignment, you need the predict the price of the house (i.e., column 4 of the csv file) and the features provided to you are 'len', 'width', 'rooms' (i.e., the first 3 columns of the csv file). You can use the sklearn library to use ``LinearRegression``, ``Ridge``, and ``Lasso`` from ``sklearn.linear_model``. Moreover, if you feel the need to expand the features to polynomials (say degree 2) you can either transform the CSV file manually or use the ``PolynomialFeatures`` from ``sklearn.preprocessing``. You might realize that adding polynomial features can improve the results but you have to be careful about overfitting.


In [37]:
# Standard includes
%matplotlib inline
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
# Routines for linear regression
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, mean_absolute_error
# Set label size for plots
matplotlib.rc('xtick', labelsize=14) 
matplotlib.rc('ytick', labelsize=14)

In [38]:
data = np.genfromtxt('LandPriceTrain.csv', delimiter=',')
features = ['len', 'width', 'rooms']
x_train = data[:,0:3] # predictors
y_train = data[:,3] # response variable

In [39]:
data = np.genfromtxt('LandPriceTest.csv', delimiter=',')
x_test = data[:,0:3] # predictors
y_test = data[:,3] # response variable

### 1. What best can we acheive if we have no predictors and only response (House Prices) values in the training data? What will be the mean error?

In [42]:
### START CODE HERE ###
print ('The best result if we have no predictors:')
print ("Prediction (train): %0.4f "% np.mean(y_train))
print ("Mean squared error (train): %0.4f "% np.var(y_train))
print ("Mean error (train): %0.4f "% np.std(y_train))
print ("Prediction (test): %0.4f "% np.mean(y_test))
print ("Mean squared error (test): %0.4f "% np.var(y_test))
print ("Mean error (test): %0.4f "% np.std(y_test))

The best result if we have no predictors:
Prediction (train): 84779.4500 
Mean squared error (train): 3210718511.6475 
Mean error (train): 56663.2024 
Prediction (test): 87341.6000 
Mean squared error (test): 3536459658.6400 
Mean error (test): 59468.1399 


### 2. Let's now use the features and see what we can observe  

In [43]:
def feature_subset_regression(x,y,flist):
    if len(flist) < 1:
        print ("Need at least one feature")
        return
    for f in flist:
        if (f < 0) or (f > 2):
            print ("Feature index is out of bounds")
            return
    ### COMPLETE CODE BELOW by creating an instance of LinearRegression
    regr = linear_model.LinearRegression()
    regr.fit(x[:,flist], y)
    return regr

In [44]:
flist = [0,1,2]
regr = feature_subset_regression(x_train,y_train,flist)

print ("w = ", regr.coef_)
print ("b = ", regr.intercept_)

print ("Mean squared error (train): ", mean_squared_error(y_train, regr.predict(x_train[:,flist])))
print ("Mean error (train): ", mean_absolute_error(y_train, regr.predict(x_train[:,flist])))
print ("Mean squared error (test): ", mean_squared_error(y_test, regr.predict(x_test[:,flist])))
print ("Mean error (test): ", mean_absolute_error(y_test, regr.predict(x_test[:,flist])))

w =  [ 3010.83212779  2914.47821951 -2420.72225879]
b =  -77044.07528278603
Mean squared error (train):  217514383.15439194
Mean error (train):  12629.08643427137
Mean squared error (test):  158706743.7864198
Mean error (test):  9999.972138004141


### 3. It seems we are underfitting as the train and test error are significantly high. Let's try to use polynomial features.

Try incorporating polynomial features (say of degree 2) and see how you perform on the train and the test set. You can either transform the CSV file manually or use the ``PolynomialFeatures`` from ``sklearn.preprocessing``.

In [45]:
from sklearn.preprocessing import PolynomialFeatures

In [79]:
### START CODE HERE ###
#try to expand the fetaures fit the linear regression and report the results

def PolynomialFeatures_subset_regression(x,y,flist):
    if len(flist) < 1:
        print ("Need at least one feature")
        return
    for f in flist:
        if (f < 0) or (f > 9):
            print ("Feature index is out of bounds")
            return
    ### COMPLETE CODE BELOW by creating an instance of LinearRegression
    regr = linear_model.LinearRegression()
    regr.fit(x[:,flist], y)
    return regr

In [80]:
flist = [0,1,2]
poly = PolynomialFeatures(2, include_bias=False)
x_train = poly.fit_transform(x_train[:,flist])
x_test = poly.fit_transform(x_test[:,flist])

x_train.shape, x_test.shape

((20, 9), (10, 9))

In [81]:
### UPDATE THE CODE BELOW ###
flist = [0,1,2,3,4,5,6,7,8]
regr = PolynomialFeatures_subset_regression(x_train, y_train, flist)

print ("w = ", regr.coef_)
print ("b = ", regr.intercept_)

print ("Mean squared error (train): ", mean_squared_error(y_train, regr.predict(x_train[:,flist])))
print ("Mean error (train): ", mean_absolute_error(y_train, regr.predict(x_train[:,flist])))
print ("Mean squared error (test): ", mean_squared_error(y_test, regr.predict(x_test[:,flist])))
print ("Mean error (test): ", mean_absolute_error(y_test, regr.predict(x_test[:,flist])))

w =  [ 0.00000000e+00 -7.95807864e-13  9.28826930e+02  0.00000000e+00
 -7.09407348e-44  9.28826930e+02 -1.99136489e-59  9.28826930e+02
  1.37837894e+01]
b =  -1413.5787105344498
Mean squared error (train):  1179955104.4133008
Mean error (train):  28386.18577169961
Mean squared error (test):  2025065980.4057744
Mean error (test):  38135.04041554661


### 4. It seems we are overfitting as the train error is significantly lower than the test error. Let's try some regularization techniques.

### Ridge Regression

In [31]:
from sklearn.linear_model import Ridge

In [84]:
def feature_subset_ridge(x,y,flist, alp):
    if len(flist) < 1:
        print ("Need at least one feature")
        return
    for f in flist:
        if (f < 0) or (f > 8):
            print ("Feature index is out of bounds")
            return
    ### COMPLETE CODE BELOW by creating an instance of Ridge, be careful of the parameter
    regr = Ridge(alpha = alp) 
    regr.fit(x[:,flist], y)
    return regr

In [85]:
flist = [0,1,2,3,4,5,6,7,8]
regr = feature_subset_ridge(x_train,y_train,flist, 0.05)

print ("w = ", regr.coef_)
print ("b = ", regr.intercept_)

print ("Mean squared error (train): ", mean_squared_error(y_train, regr.predict(x_train[:,flist])))
print ("Mean error (train): ", mean_absolute_error(y_train, regr.predict(x_train[:,flist])))
print ("Mean squared error (test): ", mean_squared_error(y_test, regr.predict(x_test[:,flist])))
print ("Mean error (test): ", mean_absolute_error(y_test, regr.predict(x_test[:,flist])))

w =  [  0.           0.         928.66087025   0.           0.
 928.66087026   0.         928.66087026  13.79187221]
b =  -1407.2999446951872
Mean squared error (train):  1179955105.5696177
Mean error (train):  28386.300099264976
Mean squared error (test):  2025017860.8386319
Mean error (test):  38134.58438110156


### Lasso Regression

In [57]:
from sklearn.linear_model import Lasso

In [86]:
def feature_subset_lasso(x,y,flist, alp):
    if len(flist) < 1:
        print ("Need at least one feature")
        return
    for f in flist:
        if (f < 0) or (f > 8):
            print ("Feature index is out of bounds")
            return
    ### COMPLETE CODE BELOW by creating an instance of Lasso, be careful of the parameters  
    regr = Lasso(alpha = alp)
    regr.fit(x[:,flist], y)
    return regr

In [87]:
flist = [0,1,2,3,4,5,6,7,8]
regr = feature_subset_lasso(x_train,y_train,flist, 1150)
print ("w = ", regr.coef_)
print ("b = ", regr.intercept_)
print ("Mean squared error (train): ", mean_squared_error(y_train, regr.predict(x_train[:,flist])))
print ("Mean error (train): ", mean_absolute_error(y_train, regr.predict(x_train[:,flist])))
print ("Mean squared error (test): ", mean_squared_error(y_test, regr.predict(x_test[:,flist])))
print ("Mean error (test): ", mean_absolute_error(y_test, regr.predict(x_test[:,flist])))

w =  [0.00000000e+00 0.00000000e+00 2.55181954e+03 0.00000000e+00
 0.00000000e+00 3.92972587e-13 0.00000000e+00 7.48519213e-14
 1.75891198e+01]
b =  1545.6786983396014
Mean squared error (train):  1180211667.8754098
Mean error (train):  28478.725999645114
Mean squared error (test):  2002699444.2636254
Mean error (test):  37919.915186784994


## Document your observation and understanding

* We use polynomial features to fix under fitting 
* To deal with overfitting we use two types of regularization:
    1. Ridge 
    2. Lasso 
* Ridge regularization will make the value of the coefficient to tend towards zero
* Lasso regularization will select some features (set some coefficients zero).

# Additional Questions (Optional-Extra)

1. Implement the closed-form solution for Ridge
2. Implement the iterative solution (Gradient Descent) for Ridge
3. Implement the iterative solution for Lasso
4. Use the sklearn linear_model.ElasticNet and try on the above problem.

Compare your implemented solutions with the built-in solutions on the above problem