# Regression with Ridge and Lasso

In this part of the assignment, you need the predict the price of the house (i.e., column 4 of the csv file) and the features provided to you are 'len', 'width', 'rooms' (i.e., the first 3 columns of the csv file). You can use the sklearn library to use ``LinearRegression``, ``Ridge``, and ``Lasso`` from ``sklearn.linear_model``. Moreover, if you feel the need to expand the features to polynomials (say degree 2) you can either transform the CSV file manually or use the ``PolynomialFeatures`` from ``sklearn.preprocessing``. You might realize that adding polynomial features can improve the results but you have to be careful about overfitting.


In [1]:
# Standard includes
%matplotlib inline
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
# Routines for linear regression
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, mean_absolute_error
# Set label size for plots
matplotlib.rc('xtick', labelsize=14) 
matplotlib.rc('ytick', labelsize=14)

In [2]:
data = np.genfromtxt('LandPriceTrain.csv', delimiter=',')
features = ['len', 'width', 'rooms']
x_train = data[:,0:3] # predictors
y_train = data[:,3] # response variable

In [3]:
data = np.genfromtxt('LandPriceTest.csv', delimiter=',')
x_test = data[:,0:3] # predictors
y_test = data[:,3] # response variable

### 1. What best can we acheive if we have no predictors and only response (House Prices) values in the training data? What will be the mean error?

In [4]:
### START CODE HERE ###
mean = np.mean(y_train)
ME = np.mean([(price - mean) for price in y_train])
print ("Prediction: ", mean)
print("Mean squared error: ", np.var(y_train))
print ("Mean error: ", ME)

Prediction:  84779.45
Mean squared error:  3210718511.6474996
Mean error:  1.4551915228366853e-12


### 2. Let's now use the features and see what we can observe  

In [5]:
def feature_subset_regression(x,y,flist):
    if len(flist) < 1:
        print ("Need at least one feature")
        return
    for f in flist:
        if (f < 0) or (f > 2):
            print ("Feature index is out of bounds")
            return
    ### COMPLETE CODE BELOW by creating an instance of LinearRegression
    regr = linear_model.LinearRegression()
    regr.fit(x[:,flist], y)
    return regr

In [6]:
flist = [0,1,2]
regr = feature_subset_regression(x_train,y_train,flist)
print ("w = ", regr.coef_)
print ("b = ", regr.intercept_)
print ("Mean squared error (train): ", mean_squared_error(y_train, regr.predict(x_train[:,flist])))
print ("Mean error (train): ", mean_absolute_error(y_train, regr.predict(x_train[:,flist])))
print ("Mean squared error (test): ", mean_squared_error(y_test, regr.predict(x_test[:,flist])))
print ("Mean error (test): ", mean_absolute_error(y_test, regr.predict(x_test[:,flist])))

w =  [ 3010.83212779  2914.47821951 -2420.72225879]
b =  -77044.07528278617
Mean squared error (train):  217514383.15439194
Mean error (train):  12629.086434271376
Mean squared error (test):  158706743.7864192
Mean error (test):  9999.972138004121


### 3. It seems we are underfitting as the train and test error are significantly high. Let's try to use polynomial features.

Try incorporating polynomial features (say of degree 2) and see how you perform on the train and the test set. You can either transform the CSV file manually or use the ``PolynomialFeatures`` from ``sklearn.preprocessing``.

In [7]:
### START CODE HERE ###
#try to expand the fetaures fit the linear regression and report the results
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error, r2_score

poly = PolynomialFeatures(degree=2, include_bias = "False")
x_train_poly = poly.fit_transform(x_train)
x_test_poly = poly.transform(x_test)

regr = linear_model.LinearRegression()
regr.fit(x_train_poly, y_train)


y_train_pred = regr.predict(x_train_poly)
y_test_pred = regr.predict(x_test_poly)

train_mse = mean_squared_error(y_train, y_train_pred)
test_mse = mean_squared_error(y_test, y_test_pred)
train_me = mean_absolute_error(y_train, y_train_pred)
test_me = mean_absolute_error(y_test, y_test_pred)

In [8]:
### UPDATE THE CODE BELOW ###
print ("w = ", regr.coef_)
print ("b = ", regr.intercept_)
print ("Mean squared error (train): ", train_mse)
print ("Mean error (train): ", train_me)
print ("Mean squared error (test): ", test_mse)
print ("Mean error (test): ", test_me)

w =  [ 0.00000000e+00  3.38880737e+02  1.79182339e+03  2.45415349e+03
 -1.76818784e+00  8.72187773e+01 -2.26660380e+01 -2.74067566e-02
 -3.10041171e+02  9.42215698e+02]
b =  -38854.243235337024
Mean squared error (train):  21545554.458582617
Mean error (train):  3338.283508427424
Mean squared error (test):  94462869.08423862
Mean error (test):  7759.405168100534


### 4. It seems we are overfitting as the train error is significantly lower than the test error. Let's try some regularization techniques.

### Ridge Regression

In [9]:
from sklearn.linear_model import Ridge

In [10]:
def feature_subset_ridge(x,y,flist, alp):
    if len(flist) < 1:
        print ("Need at least one feature")
        return
    for f in flist:
        if (f < 0) or (f > 8):
            print ("Feature index is out of bounds")
            return
    ### COMPLETE CODE BELOW by creating an instance of Ridge, be careful of the parameters
    X = x[:, flist]
    regr = Ridge(alp) 
    regr.fit(X, y)
    return regr

In [11]:
flist = [0,1,2,3,4,5,6,7,8]
x_train = x_train_poly
x_test = x_test_poly
regr = feature_subset_ridge(x_train,y_train,flist, 0.05)
print ("w = ", regr.coef_)
print ("b = ", regr.intercept_)
print ("Mean squared error (train): ", mean_squared_error(y_train, regr.predict(x_train[:,flist])))
print ("Mean error (train): ", mean_absolute_error(y_train, regr.predict(x_train[:,flist])))
print ("Mean squared error (test): ", mean_squared_error(y_test, regr.predict(x_test[:,flist])))
print ("Mean error (test): ", mean_absolute_error(y_test, regr.predict(x_test[:,flist])))

w =  [ 0.00000000e+00  4.12101764e+02  1.57654547e+03  9.23423066e+03
 -2.19052832e+00  8.80722027e+01 -4.10782058e+01  1.71336832e+00
 -2.96897086e+02]
b =  -47065.30819461866
Mean squared error (train):  22198039.813325174
Mean error (train):  3311.9241870382257
Mean squared error (test):  94344288.83366063
Mean error (test):  7840.610425059407


### Lasso Regression

In [12]:
from sklearn.linear_model import Lasso

In [13]:
def feature_subset_lasso(x,y,flist, alp):
    if len(flist) < 1:
        print ("Need at least one feature")
        return
    for f in flist:
        if (f < 0) or (f > 8):
            print ("Feature index is out of bounds")
            return
    ### COMPLETE CODE BELOW by creating an instance of Lasso, be careful of the parameters
    regr = Lasso(alp)
    regr.fit(x[:,flist], y)
    return regr

In [14]:
flist = [0,1,2,3,4,5,6,7,8]
regr = feature_subset_lasso(x_train,y_train,flist, 1150)
print ("w = ", regr.coef_)
print ("b = ", regr.intercept_)
print ("Mean squared error (train): ", mean_squared_error(y_train, regr.predict(x_train[:,flist])))
print ("Mean error (train): ", mean_absolute_error(y_train, regr.predict(x_train[:,flist])))
print ("Mean squared error (test): ", mean_squared_error(y_test, regr.predict(x_test[:,flist])))
print ("Mean error (test): ", mean_absolute_error(y_test, regr.predict(x_test[:,flist])))

w =  [  0.           0.           0.           0.          -2.47207915
  94.95087354  50.45538117  10.62218684 -65.22359262]
b =  -2800.9911741478863
Mean squared error (train):  32376302.916847847
Mean error (train):  4501.866664200646
Mean squared error (test):  51356247.313329
Mean error (test):  6399.21983148757


## Document your observation and understanding
(What you learn from the results, what does model parameters tell you,...)

# Add you observations and understanding:
1- in this problem it seems Lasso is a better regulrazition technique to use than Ridge since it offers a lower error in general both in train and test

2- for some reason some of the Weights in the lasso are really close to zero

3- the same goes for ridge but lasso has more Zero weights

4- Feature exapnsion helped a bit but can cause overfitting too 


5- regurlazation techniqeus can help reduce the effects of overfitting

# Additional Questions (Optional-Extra)

1. Implement the closed-form solution for Ridge
2. Implement the iterative solution (Gradient Descent) for Ridge
3. Implement the iterative solution for Lasso
4. Use the sklearn linear_model.ElasticNet and try on the above problem.

Compare your implemented solutions with the built-in solutions on the above problem