# Validation and Feature Selection Lab

### Introduction

In this lab, we'll work use our learnings about feature selection and validation sets to better select features for our model, and evaluate our model's performance.

### Loading our Data

In [46]:
import pandas as pd

housing_df = pd.read_csv('./kc_house_data.csv')

In [47]:
housing_df[:2]

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639


Let's take a look at the dataframe's `info`.

In [48]:

# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 21613 entries, 0 to 21612
# Data columns (total 21 columns):
# id               21613 non-null int64
# date             21613 non-null object
# price            21613 non-null float64
# bedrooms         21613 non-null int64
# bathrooms        21613 non-null float64
# sqft_living      21613 non-null int64
# sqft_lot         21613 non-null int64
# floors           21613 non-null float64
# waterfront       21613 non-null int64
# view             21613 non-null int64
# condition        21613 non-null int64
# grade            21613 non-null int64
# sqft_above       21613 non-null int64
# sqft_basement    21613 non-null int64
# yr_built         21613 non-null int64
# yr_renovated     21613 non-null int64
# zipcode          21613 non-null int64
# lat              21613 non-null float64
# long             21613 non-null float64
# sqft_living15    21613 non-null int64
# sqft_lot15       21613 non-null int64
# dtypes: float64(5), int64(15), object(1)
# memory usage: 3.5+ MB

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
id               21613 non-null int64
date             21613 non-null object
price            21613 non-null float64
bedrooms         21613 non-null int64
bathrooms        21613 non-null float64
sqft_living      21613 non-null int64
sqft_lot         21613 non-null int64
floors           21613 non-null float64
waterfront       21613 non-null int64
view             21613 non-null int64
condition        21613 non-null int64
grade            21613 non-null int64
sqft_above       21613 non-null int64
sqft_basement    21613 non-null int64
yr_built         21613 non-null int64
yr_renovated     21613 non-null int64
zipcode          21613 non-null int64
lat              21613 non-null float64
long             21613 non-null float64
sqft_living15    21613 non-null int64
sqft_lot15       21613 non-null int64
dtypes: float64(5), int64(15), object(1)
memory usage: 3.5+ MB


As we can see, none of our data has nan values and only the date column is non-numeric.  Let's change the date column to be a datetime column, and change the name of the column to be `sale_date`.

In [449]:
housing_df[:1]
# 	id	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	waterfront	view	condition	...	yr_built	yr_renovated	zipcode	lat	long	sqft_living15	sqft_lot15	sale_date	sale_month	sale_year
# 0	7129300520	221900.0	3	1.0	1180	5650	1.0	0	0	3	...	1955	0	98178	47.5112	-122.257	1340	5650	2014-10-13	10	2014

Unnamed: 0,id,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,...,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15,sale_date,sale_month,sale_year
0,7129300520,221900.0,3,1.0,1180,5650,1.0,0,0,3,...,1955,0,98178,47.5112,-122.257,1340,5650,2014-10-13,10,2014


In [51]:
housing_df['sale_date'].dtype
# dtype('<M8[ns]')

dtype('<M8[ns]')

In [52]:
'date' in housing_df.columns
# False

False

### Missing Value Analysis and Feature Engineering

> In future lessons, we will learn how to look for additional missing data in our dataset, and extract further features from our dataset.  But we haven't learned the best ways of doing that yet.

Still let's perform some basic feature engineering with the `sale_date`.  Our machine learning model cannot handle datetimes, so we'll need to extract what we can from the date.  Let's add a `sale_month` column and a `sale_year` column.

In [450]:
sale_month = None
sale_month[:2]
# 0    10
# 1    12
# Name: sale_date, dtype: int64

0    10
1    12
Name: sale_date, dtype: int64

In [451]:
sale_year = None
sale_year[:2]
# 0    2014
# 1    2014
# Name: sale_date, dtype: int64

0    2014
1    2014
Name: sale_date, dtype: int64

Let's sort our dataframe by the sale_date, and then remove our `sale_date` column.

In [57]:
sorted_housing_df = None

In [61]:
sorted_housing_df['sale_date'][:5]
# 16768   2014-05-02
# 9596    2014-05-02
# 9587    2014-05-02
# 20602   2014-05-02
# 11577   2014-05-02
# Name: sale_date, dtype: datetime64[ns]


16768   2014-05-02
9596    2014-05-02
9587    2014-05-02
20602   2014-05-02
11577   2014-05-02
Name: sale_date, dtype: datetime64[ns]

In [63]:
'sale_date' in sorted_housing_df.columns
# False

False

In [66]:
sorted_housing_df[['sale_month', 'sale_year']][:3]
# 	sale_month	sale_year
# 16768	5	2014
# 9596	5	2014
# 9587	5	2014

Unnamed: 0,sale_month,sale_year
16768,5,2014
9596,5,2014
9587,5,2014


In [147]:
sorted_housing_df[:2]

Unnamed: 0,id,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,...,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15,sale_month,sale_year
16768,5561000190,437500.0,3,2.25,1970,35100,2.0,0,0,4,...,0,1977,0,98027,47.4635,-121.991,2340,35100,5,2014
9596,472000620,790000.0,3,2.5,2600,4750,1.0,0,0,4,...,900,1951,0,98117,47.6833,-122.4,2380,4750,5,2014


Let's do some initial feature selection.  In general our strategy will be to include as many features as possible.  So right now, let's eliminate `id` as it doesn't contain information, and `price` as this will be our target.

In [452]:
X = None

Let's also remove our geographic columns, as we'll learn how to work with geographic columns later, and they can contribute to variance as they are not linear features.

In [453]:
X = None

In [454]:
X.columns
# Index(['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
#        'waterfront', 'view', 'condition', 'grade', 'sqft_above',
#        'sqft_basement', 'yr_built', 'yr_renovated', 'sqft_living15',
#        'sqft_lot15', 'sale_month', 'sale_year'],
#       dtype='object')

Index(['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
       'waterfront', 'view', 'condition', 'grade', 'sqft_above',
       'sqft_basement', 'yr_built', 'yr_renovated', 'sqft_living15',
       'sqft_lot15', 'sale_month', 'sale_year'],
      dtype='object')

Next, let's assign our variable `y` to be the price.

In [455]:
y = None
y[:3]
# 16768    437500.0
# 9596     790000.0
# 9587     675000.0
# Name: price, dtype: float64

16768    437500.0
9596     790000.0
9587     675000.0
Name: price, dtype: float64

### Splitting our Data

Ok, now that our data is transformed, we can split our data training validation and test sets, and then perform feature selection.  Before dividing our data, let's get an overview how large our dataset is.

In [108]:
X_scaled_df.shape
# (21613, 20)

(21613, 20)

So we have 21500 records.  That's a good amount.  Let's get a sense of when these records are from, to see how spread out our data is over time.  Plot a histogram of the years of the records.

> Answer: <img src="./year_dist.png" width="50%">

So there is not a huge gap in times in our data - we have sales between 2014 and 2015.  This means that we don't have to be too concerned about spending our most recent data on validation and testing.  But one thing to be aware of is that because we only have two years of data, we may not want to consume all of our data for specific months.  This is our most recent row.

In [187]:
X[-1:]

# bedrooms	bathrooms	sqft_living	sqft_lot	floors	waterfront	view	condition	grade	sqft_above	sqft_basement	yr_built	yr_renovated	sqft_living15	sqft_lot15	sale_month	sale_year
# 16594	4	2.25	3750	5000	2.0	0	0	5	8	2440	1310	1924	0	2170	4590	5	2015

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,sqft_living15,sqft_lot15,sale_month,sale_year
16594,4,2.25,3750,5000,2.0,0,0,5,8,2440,1310,1924,0,2170,4590,5,2015


And let's see how far back 4000 rows takes us.

In [185]:
X[-4001:-4000]
# 	bedrooms	bathrooms	sqft_living	sqft_lot	floors	waterfront	view	condition	grade	sqft_above	sqft_basement	yr_built	yr_renovated	sqft_living15	sqft_lot15	sale_month	sale_year
# 7433	5	3.5	4050	20925	2.0	0	3	3	10	3020	1030	1973	2005	3880	18321	3	2015

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,sqft_living15,sqft_lot15,sale_month,sale_year
7433,5,3.5,4050,20925,2.0,0,3,3,10,3020,1030,1973,2005,3880,18321,3,2015


Whereas another 4000 will take us into the previous year.

In [186]:
X[-8001:-8000]
# 	bedrooms	bathrooms	sqft_living	sqft_lot	floors	waterfront	view	condition	grade	sqft_above	sqft_basement	yr_built	yr_renovated	sqft_living15	sqft_lot15	sale_month	sale_year
# 4579	4	1.75	1650	6900	1.0	0	0	3	7	910	740	1978	1993	1540	7645	12	2014

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,sqft_living15,sqft_lot15,sale_month,sale_year
4579,4,1.75,1650,6900,1.0,0,0,3,7,910,740,1978,1993,1540,7645,12,2014


It would be nice to allow our model to see some 2015 data, so that it can determine if the year is influential.  So let's allocate the last 2000 observations to our test set, and the 2000 before that to the validation set.  The rest is the training set.

In [195]:
X_test = None
y_test = None

X_validate = None
y_validate = None

X_train = None
y_train = None

In [196]:
sale_cols = ['sale_year', 'sale_month']
X_train[-1:][sale_cols], X_validate[-1:][sale_cols], X_test[-1:][sale_cols]
# (      sale_year  sale_month
#  7433       2015           3,
#        sale_year  sale_month
#  1021       2015           4,
#         sale_year  sale_month
#  16594       2015           5)


(      sale_year  sale_month
 7433       2015           3,
       sale_year  sale_month
 1021       2015           4,
        sale_year  sale_month
 16594       2015           5)

Before moving on, let's train our model.

In [197]:
model = None

Check the score on the validation data.

In [199]:

# 0.6190954414632961

0.6190954414632961

Ok, not too bad.

### Scaling our Data

Now that we've performed some basic feature engineering and divided our data.  Let's move onto scaling our data.

In [458]:
from sklearn.preprocessing import StandardScaler
transformer = StandardScaler()
X_train_scaled = transformer.fit_transform(X_train)
X_train_scaled[:2]

array([[-0.40007305,  0.16654127, -0.12714512,  0.52118016,  0.92982032,
        -0.08788627, -0.30543791,  0.87668489,  1.13039067,  0.21054702,
        -0.65827427,  0.20447298, -0.21264288,  0.50954597,  0.85687474,
        -0.69872446, -0.45127518],
       [-0.40007305,  0.49016125,  0.55362948, -0.26377938, -0.91859112,
        -0.08788627, -0.30543791,  0.87668489,  1.13039067, -0.11326886,
         1.36222672, -0.68078079, -0.21264288,  0.56768548, -0.3003021 ,
        -0.69872446, -0.45127518]])

Once you have scaled the data once, you can call `transform` on the validation data using the same `transformer`.

> Look through the rightmost column of data to check your work.

In [457]:
X_validate_scaled = None
X_validate_scaled[:2]
# array([[-0.40007305,  0.16654127,  1.06150893, -0.13717717,  0.92982032,
#         -0.08788627, -0.30543791, -0.63862924,  1.97775068,  0.37845156,
#          1.49692679,  0.54495519, -0.21264288,  1.65780117, -0.12110173,
#         -1.3379471 ,  2.21594284],
#        [-0.40007305, -0.1570787 , -0.46212944, -0.18080903, -0.91859112,
#         -0.08788627, -0.30543791, -0.63862924, -0.56432936, -0.79688018,
#          0.53157631,  0.40876231, -0.21264288, -0.63870922, -0.18984604,
#         -1.3379471 ,  2.21594284]])

array([[-0.40007305,  0.16654127,  1.06150893, -0.13717717,  0.92982032,
        -0.08788627, -0.30543791, -0.63862924,  1.97775068,  0.37845156,
         1.49692679,  0.54495519, -0.21264288,  1.65780117, -0.12110173,
        -1.3379471 ,  2.21594284],
       [-0.40007305, -0.1570787 , -0.46212944, -0.18080903, -0.91859112,
        -0.08788627, -0.30543791, -0.63862924, -0.56432936, -0.79688018,
         0.53157631,  0.40876231, -0.21264288, -0.63870922, -0.18984604,
        -1.3379471 ,  2.21594284]])

One annoying thing about scaling our data is that we then lose a sense of our columns.

In [203]:
X_train_scaled[:2]
# array([[-0.40007305,  0.16654127, -0.12714512,  0.52118016,  0.92982032,
#         -0.08788627, -0.30543791,  0.87668489,  1.13039067,  0.21054702,
#         -0.65827427,  0.20447298, -0.21264288,  0.50954597,  0.85687474,
#         -0.69872446, -0.45127518],
#        [-0.40007305,  0.49016125,  0.55362948, -0.26377938, -0.91859112,
#         -0.08788627, -0.30543791,  0.87668489,  1.13039067, -0.11326886,
#          1.36222672, -0.68078079, -0.21264288,  0.56768548, -0.3003021 ,
#         -0.69872446, -0.45127518]])

array([[-0.40007305,  0.16654127, -0.12714512,  0.52118016,  0.92982032,
        -0.08788627, -0.30543791,  0.87668489,  1.13039067,  0.21054702,
        -0.65827427,  0.20447298, -0.21264288,  0.50954597,  0.85687474,
        -0.69872446, -0.45127518],
       [-0.40007305,  0.49016125,  0.55362948, -0.26377938, -0.91859112,
        -0.08788627, -0.30543791,  0.87668489,  1.13039067, -0.11326886,
         1.36222672, -0.68078079, -0.21264288,  0.56768548, -0.3003021 ,
        -0.69872446, -0.45127518]])

Change our array back to a dataframe, and assign the columns in X to be the columns here.

In [204]:
X_train_scaled_df = None

In [205]:
X_train_scaled_df[:2]
# 	bedrooms	bathrooms	sqft_living	sqft_lot	floors	waterfront	view	condition	grade	sqft_above	sqft_basement	yr_built	yr_renovated	sqft_living15	sqft_lot15	sale_month	sale_year
# 0	-0.400073	0.166541	-0.127145	0.521180	0.929820	-0.087886	-0.305438	0.876685	1.130391	0.210547	-0.658274	0.204473	-0.212643	0.509546	0.856875	-0.698724	-0.451275
# 1	-0.400073	0.490161	0.553629	-0.263779	-0.918591	-0.087886	-0.305438	0.876685	1.130391	-0.113269	1.362227	-0.680781	-0.212643	0.567685	-0.300302	-0.698724	-0.451275

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,sqft_living15,sqft_lot15,sale_month,sale_year
0,-0.400073,0.166541,-0.127145,0.52118,0.92982,-0.087886,-0.305438,0.876685,1.130391,0.210547,-0.658274,0.204473,-0.212643,0.509546,0.856875,-0.698724,-0.451275
1,-0.400073,0.490161,0.553629,-0.263779,-0.918591,-0.087886,-0.305438,0.876685,1.130391,-0.113269,1.362227,-0.680781,-0.212643,0.567685,-0.300302,-0.698724,-0.451275


And do the same with the validation data.

In [206]:
X_validate_scaled_df = None

In [207]:
X_validate_scaled_df[:2]
# 	bedrooms	bathrooms	sqft_living	sqft_lot	floors	waterfront	view	condition	grade	sqft_above	sqft_basement	yr_built	yr_renovated	sqft_living15	sqft_lot15	sale_month	sale_year
# 0	-0.400073	0.166541	1.061509	-0.137177	0.929820	-0.087886	-0.305438	-0.638629	1.977751	0.378452	1.496927	0.544955	-0.212643	1.657801	-0.121102	-1.337947	2.215943
# 1	-0.400073	-0.157079	-0.462129	-0.180809	-0.918591	-0.087886	-0.305438	-0.638629	-0.564329	-0.796880	0.531576	0.408762	-0.212643	-0.638709	-0.189846	-1.337947	2.215943

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,sqft_living15,sqft_lot15,sale_month,sale_year
0,-0.400073,0.166541,1.061509,-0.137177,0.92982,-0.087886,-0.305438,-0.638629,1.977751,0.378452,1.496927,0.544955,-0.212643,1.657801,-0.121102,-1.337947,2.215943
1,-0.400073,-0.157079,-0.462129,-0.180809,-0.918591,-0.087886,-0.305438,-0.638629,-0.564329,-0.79688,0.531576,0.408762,-0.212643,-0.638709,-0.189846,-1.337947,2.215943


### Feature Selection

Ok, now it's time to perform feature selection.  Let's fit our model again, first check that the score is not drastically different.   Then we'll use the coefficients of our model to sort our features by order of importance.

 

In [403]:
scaled_model = None

# LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Now check the score on the validation set.

In [404]:

# 0.6190954414632183

0.6190954414632183

Ok, now we can use the coeficients to perform feature selection.  Return the columns and the related coefficients, sorted from largest to smallest by absolute value.  

> Begin by getting the sorted indices.

In [405]:
sorted_idcs = None
sorted_idcs
# array([15, 16,  3, 12, 13, 14,  7,  4,  1,  6, 10,  0,  5,  9,  2, 11,  8])

array([15, 16,  3, 12, 13, 14,  7,  4,  1,  6, 10,  0,  5,  9,  2, 11,  8])

The display the sorted_coef and columns side by side.

> But if you get really stuck, just display them individually.

In [407]:
sorted_coef_and_cols = None
sorted_coef_and_cols

# array([['grade', 138836.31679376948],
#        ['yr_built', -101790.58028249719],
#        ['sqft_living', 78287.60048048769],
#        ['sqft_above', 67873.70025402229],
#        ['waterfront', 49773.09979959703],
#        ['bedrooms', -36493.80787921594],
#        ['sqft_basement', 35594.618127523405],
#        ['view', 35147.109199198654],
#        ['bathrooms', 34642.41372185266],
#        ['floors', 14596.913311944081],
#        ['condition', 13624.900481106748],
#        ['sqft_lot15', -13270.10570317243],
#        ['sqft_living15', 12748.83638370339],
#        ['yr_renovated', 5011.776176784931],
#        ['sqft_lot', -4060.1168984997607],
#        ['sale_year', 3368.824213386116],
#        ['sale_month', -780.5262792850269]], dtype=object)

array([['grade', 138836.31679376948],
       ['yr_built', -101790.58028249719],
       ['sqft_living', 78287.60048048769],
       ['sqft_above', 67873.70025402229],
       ['waterfront', 49773.09979959703],
       ['bedrooms', -36493.80787921594],
       ['sqft_basement', 35594.618127523405],
       ['view', 35147.109199198654],
       ['bathrooms', 34642.41372185266],
       ['floors', 14596.913311944081],
       ['condition', 13624.900481106748],
       ['sqft_lot15', -13270.10570317243],
       ['sqft_living15', 12748.83638370339],
       ['yr_renovated', 5011.776176784931],
       ['sqft_lot', -4060.1168984997607],
       ['sale_year', 3368.824213386116],
       ['sale_month', -780.5262792850269]], dtype=object)

Now order the columns of the `X_train_scaled_df` and `X_validate_scaled_df` in order of feature importance.

In [460]:
sorted_cols = None

In [459]:
sorted_cols
# array(['grade', 'yr_built', 'sqft_living', 'sqft_above', 'waterfront',
#        'bedrooms', 'sqft_basement', 'view', 'bathrooms', 'floors',
#        'condition', 'sqft_lot15', 'sqft_living15', 'yr_renovated',
#        'sqft_lot', 'sale_year', 'sale_month'], dtype=object)

array(['grade', 'yr_built', 'sqft_living', 'sqft_above', 'waterfront',
       'bedrooms', 'sqft_basement', 'view', 'bathrooms', 'floors',
       'condition', 'sqft_lot15', 'sqft_living15', 'yr_renovated',
       'sqft_lot', 'sale_year', 'sale_month'], dtype=object)

In [409]:
X_train_sorted_cols_df = X_train_scaled_df[sorted_cols]
X_train_sorted_cols_df[:1]
# 	grade	yr_built	sqft_living	sqft_above	waterfront	bedrooms	sqft_basement	view	bathrooms	floors	condition	sqft_lot15	sqft_living15	yr_renovated	sqft_lot	sale_year	sale_month
# 0	1.130391	0.204473	-0.127145	0.210547	-0.087886	-0.400073	-0.658274	-0.305438	0.166541	0.92982	0.876685	0.856875	0.509546	-0.212643	0.52118	-0.451275	-0.698724

Unnamed: 0,grade,yr_built,sqft_living,sqft_above,waterfront,bedrooms,sqft_basement,view,bathrooms,floors,condition,sqft_lot15,sqft_living15,yr_renovated,sqft_lot,sale_year,sale_month
0,1.130391,0.204473,-0.127145,0.210547,-0.087886,-0.400073,-0.658274,-0.305438,0.166541,0.92982,0.876685,0.856875,0.509546,-0.212643,0.52118,-0.451275,-0.698724


In [411]:
X_validate_sorted_cols_df = X_validate_scaled_df[sorted_cols]

In [412]:
X_validate_sorted_cols_df[:1]

# 	grade	yr_built	sqft_living	sqft_above	waterfront	bedrooms	sqft_basement	view	bathrooms	floors	condition	sqft_lot15	sqft_living15	yr_renovated	sqft_lot	sale_year	sale_month
# 0	1.977751	0.544955	1.061509	0.378452	-0.087886	-0.400073	1.496927	-0.305438	0.166541	0.92982	-0.638629	-0.121102	1.657801	-0.212643	-0.137177	2.215943	-1.337947

Unnamed: 0,grade,yr_built,sqft_living,sqft_above,waterfront,bedrooms,sqft_basement,view,bathrooms,floors,condition,sqft_lot15,sqft_living15,yr_renovated,sqft_lot,sale_year,sale_month
0,1.977751,0.544955,1.061509,0.378452,-0.087886,-0.400073,1.496927,-0.305438,0.166541,0.92982,-0.638629,-0.121102,1.657801,-0.212643,-0.137177,2.215943,-1.337947


In [348]:
X_validate_sorted_cols_df.shape

(2000, 17)

Let's begin by creating a set of datasets for our training data by 1 to n columns from `X_validate_sorted_cols_df` by order of importance.  

> There should be 16 datasets when were done.

In [413]:
X_train_datasets = [X_train_sorted_cols_df.iloc[:, :i] for i in range(1, 18)]

In [461]:
X_train_datasets[0][:2]

# grade
# 0	1.130391
# 1	1.130391

Unnamed: 0,grade
0,1.130391
1,1.130391


In [462]:
X_train_datasets[-1][:2]
# 	grade	yr_built	sqft_living	sqft_above	waterfront	bedrooms	sqft_basement	view	bathrooms	floors	condition	sqft_lot15	sqft_living15	yr_renovated	sqft_lot	sale_year	sale_month
# 0	1.130391	0.204473	-0.127145	0.210547	-0.087886	-0.400073	-0.658274	-0.305438	0.166541	0.929820	0.876685	0.856875	0.509546	-0.212643	0.521180	-0.451275	-0.698724
# 1	1.130391	-0.680781	0.553629	-0.113269	-0.087886	-0.400073	1.362227	-0.305438	0.490161	-0.918591	0.876685	-0.300302	0.567685	-0.212643	-0.263779	-0.451275	-0.698724

Unnamed: 0,grade,yr_built,sqft_living,sqft_above,waterfront,bedrooms,sqft_basement,view,bathrooms,floors,condition,sqft_lot15,sqft_living15,yr_renovated,sqft_lot,sale_year,sale_month
0,1.130391,0.204473,-0.127145,0.210547,-0.087886,-0.400073,-0.658274,-0.305438,0.166541,0.92982,0.876685,0.856875,0.509546,-0.212643,0.52118,-0.451275,-0.698724
1,1.130391,-0.680781,0.553629,-0.113269,-0.087886,-0.400073,1.362227,-0.305438,0.490161,-0.918591,0.876685,-0.300302,0.567685,-0.212643,-0.263779,-0.451275,-0.698724


In [417]:
X_validation_datasets = [X_validate_sorted_cols_df.iloc[:, :i] for i in range(1, 18)]

In [418]:
X_validation_datasets[0].shape

(2000, 1)

In [463]:
X_validation_datasets[0][:2]
# 	grade
# 0	1.977751
# 1	-0.564329

Unnamed: 0,grade
0,1.977751
1,-0.564329


In [419]:
y_validate.shape

(2000,)

Ok, now let's create a separate model for each of our training datatsets, and then evaluate the performance on the validation datasets.

In [424]:
trained_models = [LinearRegression().fit(X_train, y_train) for X_train in X_train_datasets]

In [425]:
model_scores = [train_model.score(X_validation, y_validate) 
                for X_validation, train_model 
                in zip(X_validation_datasets, trained_models)]

In [426]:
model_scores
# [0.4053380440665897,
#  0.4899088511946995,
#  0.5717998433450644,
#  0.5728928677207712,
#  0.6017730203588975,
#  0.6053559741628769,
#  0.6053559741628769,
#  0.6129705564283209,
#  0.6156319529519583,
#  0.6153500371327996,
#  0.6153681296984064,
#  0.6143169643835164,
#  0.6164987325065734,
#  0.6159903523685051,
#  0.6148762075573468,
#  0.6192834942955159,
#  0.6190954414632186]

[0.4053380440665897,
 0.4899088511946995,
 0.5717998433450644,
 0.5728928677207712,
 0.6017730203588975,
 0.6053559741628769,
 0.6053559741628769,
 0.6129705564283209,
 0.6156319529519583,
 0.6153500371327996,
 0.6153681296984064,
 0.6143169643835164,
 0.6164987325065734,
 0.6159903523685051,
 0.6148762075573468,
 0.6192834942955159,
 0.6190954414632186]

Well it looks our scores peak at about 61 percent better than the mean, and with the top nine features.  

In [428]:
model_scores[:9]
# [0.4053380440665897,
#  0.4899088511946995,
#  0.5717998433450644,
#  0.5728928677207712,
#  0.6017730203588975,
#  0.6053559741628769,
#  0.6053559741628769,
#  0.6129705564283209,
#  0.6156319529519583]

[0.4053380440665897,
 0.4899088511946995,
 0.5717998433450644,
 0.5728928677207712,
 0.6017730203588975,
 0.6053559741628769,
 0.6053559741628769,
 0.6129705564283209,
 0.6156319529519583]

First assign these top nine columns as `selected_columns`.

In [429]:
selected_columns = None

# Index(['grade', 'yr_built', 'sqft_living', 'sqft_above', 'waterfront',
#        'bedrooms', 'sqft_basement', 'view', 'bathrooms'],
#       dtype='object')

Index(['grade', 'yr_built', 'sqft_living', 'sqft_above', 'waterfront',
       'bedrooms', 'sqft_basement', 'view', 'bathrooms'],
      dtype='object')

Then select them from `X_train` and `X_validate`.

> Note that we are done with feature selection, so we can use our unscaled data.

In [431]:
selected_X_train = None

In [432]:
selected_X_validate = None

In [433]:
selected_X_train[:2]

# grade	yr_built	sqft_living	sqft_above	waterfront	bedrooms	sqft_basement	view	bathrooms
# 16768	9	1977	1970	1970	0	3	0	0	2.25
# 9596	9	1951	2600	1700	0	3	900	0	2.50

Unnamed: 0,grade,yr_built,sqft_living,sqft_above,waterfront,bedrooms,sqft_basement,view,bathrooms
16768,9,1977,1970,1970,0,3,0,0,2.25
9596,9,1951,2600,1700,0,3,900,0,2.5


Now we can combine our training and validation datasets.

> Do so using the `pd.concat` method.

In [434]:
combined_X_df = None

In [435]:
y_combined = None

Then we can train another model, and score it on our X_test and y_test data.

In [436]:
selected_model = None

In [437]:

# 0.6338842028361624

0.6338842028361624

Ok, not too bad.

And if we remove grade we get the following:

In [438]:
selected_model = LinearRegression().fit(combined_X_df.iloc[:, 1:], y_combined)

In [439]:
selected_model.score(X_test[selected_cols[1:]], y_test)

0.5613837635074814

In [448]:
selected_model.coef_, X_test[selected_cols].columns[1:]

(array([-2.88065107e+03,  1.69713453e+02,  1.13550841e+02,  5.42766415e+05,
        -5.42665987e+04,  5.61626121e+01,  6.46936831e+04,  7.78885350e+04]),
 Index(['yr_built', 'sqft_living', 'sqft_above', 'waterfront', 'bedrooms',
        'sqft_basement', 'view', 'bathrooms'],
       dtype='object'))

### Summary

Take a break!