# Feature Engineering

In this lesson we'll cover automated ways to select features for modeling.

This is not all that there is to feature engineering!

In [22]:
import pandas as pd
import numpy as np
import wrangle

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor

In [2]:
# Here's the source for the dataset and data dictionary https://archive.ics.uci.edu/ml/datasets/student+performance
path = "https://gist.githubusercontent.com/ryanorsinger/55ccfd2f7820af169baea5aad3a9c60d/raw/da6c5a33307ed7ee207bd119d3361062a1d1c07e/student-mat.csv"

df, X_train_explore, \
    X_train_scaled, y_train, \
    X_validate_scaled, y_validate, \
    X_test_scaled, y_test = wrangle.wrangle_student_math(path)

In [6]:
X_train_scaled

Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2
142,0.000000,1.00,1.00,0.000000,0.666667,0.000000,0.75,0.25,0.25,0.00,0.00,1.00,0.035714,0.357143,0.578947
326,0.333333,0.75,0.75,0.000000,0.000000,0.000000,0.75,0.50,1.00,0.50,1.00,1.00,0.053571,0.714286,0.789474
88,0.166667,0.50,0.50,0.333333,0.333333,0.333333,0.75,0.75,0.25,0.00,0.00,0.50,0.214286,0.500000,0.526316
118,0.333333,0.25,0.75,0.666667,0.333333,0.333333,1.00,0.25,0.75,0.00,0.75,1.00,0.357143,0.357143,0.368421
312,0.666667,0.25,0.50,0.000000,0.333333,0.333333,0.75,1.00,0.25,0.25,0.25,0.75,0.053571,0.642857,0.578947
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
229,0.333333,0.50,0.25,0.333333,0.666667,0.000000,0.50,0.25,0.50,0.00,0.25,0.50,0.178571,0.571429,0.526316
61,0.166667,0.25,0.25,1.000000,0.000000,0.000000,1.00,1.00,1.00,1.00,1.00,1.00,0.107143,0.428571,0.421053
38,0.000000,0.75,1.00,0.000000,0.666667,0.000000,0.75,0.50,0.25,0.00,0.00,1.00,0.035714,0.571429,0.631579
243,0.166667,1.00,1.00,0.000000,0.000000,0.000000,1.00,0.50,0.25,0.00,0.25,1.00,0.000000,0.642857,0.631579


### SelectKBest

Uses an [F Test][1] to compare how well each feature predicts the target variable.

[1]: https://en.wikipedia.org/wiki/F-test#Formula_and_calculation

In [15]:
from sklearn.feature_selection import SelectKBest, f_regression

f_selector = SelectKBest(score_func=f_regression, k=2)
f_selector.fit(X_train_scaled, y_train)

SelectKBest(k=2, score_func=<function f_regression at 0x7f9bf4c07160>)

In [16]:
mask = f_selector.get_support()
X_train_scaled.columns[mask]

Index(['G1', 'G2'], dtype='object')

In [20]:
X_train_kbest = f_selector.transform(X_train_scaled)

model = LinearRegression().fit(X_train_kbest, y_train)
# ...

### Recursive Feature Elimination (RFE)

Fits a model and recursively eliminates the worst performing features.

Only works for models that can rank features.

In [21]:
model = LinearRegression().fit(X_train_scaled, y_train)
model.coef_

array([-1.57248067e+00,  5.91784593e-01, -2.21242008e-01,  1.01893087e+00,
       -1.75520671e-02,  5.02414426e-01,  1.07179785e+00,  3.27646012e-02,
        3.27458627e-01, -7.51580441e-01,  2.57014436e-01,  6.14409605e-01,
        2.55747873e+00,  2.57686922e+00,  1.88234926e+01])

In [24]:
model = DecisionTreeRegressor().fit(X_train_scaled, y_train)
model.feature_importances_

array([6.44434032e-03, 7.09796491e-04, 1.24543648e-02, 2.25668148e-03,
       3.19045396e-03, 1.57436387e-03, 1.40717310e-03, 1.06422013e-03,
       4.95553456e-04, 4.32854631e-04, 1.44571677e-03, 1.53596732e-02,
       1.32441623e-01, 2.44546674e-02, 7.96268517e-01])

In [26]:
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE

lm = LinearRegression()
rfe = RFE(estimator=lm, n_features_to_select=2)
rfe.fit(X_train_scaled, y_train)

RFE(estimator=LinearRegression(), n_features_to_select=2)

In [28]:
rfe.support_

array([False, False, False, False, False, False, False, False, False,
       False, False, False, False,  True,  True])

In [29]:
X_train_scaled.columns[rfe.support_]

Index(['G1', 'G2'], dtype='object')

In [33]:
pd.Series(dict(zip(X_train_scaled.columns, rfe.ranking_))).sort_values()

G1             1
G2             1
absences       2
age            3
famrel         4
traveltime     5
health         6
failures       7
Medu           8
Dalc           9
goout         10
Walc          11
Fedu          12
freetime      13
studytime     14
dtype: int64

## Recap

- SelectKBest compares each feature against the target in isolation
- RFE compares all features by fitting multiple models on subsets of features (usually either decision trees or linear models)
- RFE generally gives more robust results, but is more expensive