# Regression Week 5: Feature Selection and LASSO (Interpretation)

In this notebook, you will use LASSO to select features, building on a pre-implemented solver for LASSO (using GraphLab Create, though you can use other solvers). You will:
* Run LASSO with different L1 penalties.
* Choose best L1 penalty using a validation set.
* Choose best L1 penalty using a validation set, with additional constraint on the size of subset.

In the second notebook, you will implement your own LASSO solver, using coordinate descent. 

# Fire up Packages

In [156]:
import pandas as pd
import numpy as np

from sklearn.linear_model import Lasso
from math import log, sqrt

# Load in house sales data

Dataset is from house sales in King County, the region where the city of Seattle, WA is located.

In [157]:
sales = pd.read_csv('kc_house_data.csv')

# Create new features

In [158]:
sales['sqft_living_sqrt'] = sales['sqft_living'].apply(sqrt)
sales['sqft_lot_sqrt'] = sales['sqft_lot'].apply(sqrt)
sales['bedrooms_square'] = sales['bedrooms']*sales['bedrooms']

# In the dataset, 'floors' was defined with type string, 
# so we'll convert them to float, before creating a new feature.
sales['floors'] = sales['floors'].astype(float) 
sales['floors_square'] = sales['floors']*sales['floors']

In [159]:
sales

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15,sqft_living_sqrt,sqft_lot_sqrt,bedrooms_square,floors_square
0,7129300520,20141013T000000,221900.0,3,1.00,1180,5650,1.0,0,0,...,0,98178,47.5112,-122.257,1340,5650,34.351128,75.166482,9,1.0
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,1991,98125,47.7210,-122.319,1690,7639,50.695167,85.099941,9,4.0
2,5631500400,20150225T000000,180000.0,2,1.00,770,10000,1.0,0,0,...,0,98028,47.7379,-122.233,2720,8062,27.748874,100.000000,4,1.0
3,2487200875,20141209T000000,604000.0,4,3.00,1960,5000,1.0,0,0,...,0,98136,47.5208,-122.393,1360,5000,44.271887,70.710678,16,1.0
4,1954400510,20150218T000000,510000.0,3,2.00,1680,8080,1.0,0,0,...,0,98074,47.6168,-122.045,1800,7503,40.987803,89.888820,9,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21608,263000018,20140521T000000,360000.0,3,2.50,1530,1131,3.0,0,0,...,0,98103,47.6993,-122.346,1530,1509,39.115214,33.630343,9,9.0
21609,6600060120,20150223T000000,400000.0,4,2.50,2310,5813,2.0,0,0,...,0,98146,47.5107,-122.362,1830,7200,48.062459,76.243032,16,4.0
21610,1523300141,20140623T000000,402101.0,2,0.75,1020,1350,2.0,0,0,...,0,98144,47.5944,-122.299,1020,2007,31.937439,36.742346,4,4.0
21611,291310100,20150116T000000,400000.0,3,2.50,1600,2388,2.0,0,0,...,0,98027,47.5345,-122.069,1410,1287,40.000000,48.867167,9,4.0


* Squaring bedrooms will increase the separation between not many bedrooms (e.g. 1) and lots of bedrooms (e.g. 4) since 1^2 = 1 but 4^2 = 16. Consequently this variable will mostly affect houses with many bedrooms.
* On the other hand, taking square root of sqft_living will decrease the separation between big house and small house. The owner may not be exactly twice as happy for getting a house that is twice as big.

# Learn regression weights with L1 penalty

Let us fit a model with all the features available, plus the features we just created above.

In [160]:
all_features = ['bedrooms', 'bedrooms_square',
           'bathrooms',
           'sqft_living', 'sqft_living_sqrt',
          'sqft_lot', 'sqft_lot_sqrt',
           'floors', 'floors_square',
          'waterfront', 'view', 'condition', 'grade',
          'sqft_above',
          'sqft_basement',
         'yr_built', 'yr_renovated',
          'sqft_lot']

Applying L1 penalty requires adding an extra parameter (`l1_penalty`) to the linear regression call in GraphLab Create. (Other tools may have separate implementations of LASSO.)  Note that it's important to set `l2_penalty=0` to ensure we don't introduce an additional L2 penalty.

In [174]:
# Create a model variables using all the features and the outcome price
price = np.array(sales['price']).reshape(-1,1)
mod_all_features = sales[all_features]

In [181]:
# create base lasso model
lasso_model = Lasso(alpha = 1.0)
mod_all = lasso_model.fit(price, mod_all_features)
mod_all.coef_

array([[7.81150159e-07],
       [5.71613588e-06],
       [1.10163137e-06],
       [1.75627877e-03],
       [1.71930895e-05],
       [1.01158365e-02],
       [2.62779250e-05],
       [3.77697771e-07],
       [1.17348885e-06],
       [6.27652030e-08],
       [8.29277316e-07],
       [6.44448509e-08],
       [2.13696704e-06],
       [1.36591573e-03],
       [3.90363034e-04],
       [4.32139104e-06],
       [1.38333057e-04],
       [1.01158365e-02]])

In [186]:
# create pseduo-ols lasso model
ols_lasso_model = Lasso(alpha = 0.01)
ols_mod_all = ols_lasso_model.fit(price, mod_all_features)
ols_mod_all.coef_

array([[7.81157505e-07],
       [5.71614323e-06],
       [1.10163871e-06],
       [1.75627878e-03],
       [1.71930968e-05],
       [1.01158365e-02],
       [2.62779323e-05],
       [3.77705116e-07],
       [1.17349620e-06],
       [6.27725485e-08],
       [8.29284662e-07],
       [6.44521964e-08],
       [2.13697438e-06],
       [1.36591574e-03],
       [3.90363042e-04],
       [4.32139839e-06],
       [1.38333064e-04],
       [1.01158365e-02]])