# Feature Engineering

In this lesson we discuss some methods for *automated* feature engineering, specifically feature selection.

While these methods can produce useful results, they are but a single piece of of the feature engineering puzzle.

## Setup

In [2]:
import pandas as pd
import numpy as np
import pydataset

from sklearn.feature_selection import SelectKBest, f_regression, RFE
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

tips = pydataset.data('tips')
tips['smoker'] = (tips.smoker == 'Yes').astype(int)
tips['dinner'] = (tips.time == 'Dinner').astype(int)

In [21]:
X = tips[['total_bill', 'size', 'smoker', 'dinner']]
y = tips.tip

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=123)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Select K Best

- uses an f regression test
- looks at each feature in isolation
- is a model with that feature better than no model at all

In [22]:
kbest = SelectKBest(f_regression, k=1)
kbest.fit(X_train_scaled, y_train)

SelectKBest(k=1, score_func=<function f_regression at 0x7f8e3a673a60>)

In [28]:
kbest.pvalues_

array([1.28577891e-28, 6.93874955e-14, 9.90121928e-01, 1.97866553e-01])

In [23]:
kbest.get_support()

array([ True, False, False, False])

In [24]:
X_train.columns[kbest.get_support()]

Index(['total_bill'], dtype='object')

In [26]:
X_kbest = kbest.transform(X_train_scaled)
X_kbest.shape

(195, 1)

<div style="border: 1px solid black; border-radius: 3px; background: palegreen; padding: .5em 1em;">
    <p style="font-size: 1.3em; font-weight: bold">Mini Exercise</p>
    <ol>
        <li>Use <code>pydataset</code> to load the <code>swiss</code> dataset.</li>
        <li>Split the swiss dataset into X and y, and train and test. The goal is to predict <code>Fertility</code>.</li>
        <li>Use <code>SelectKBest</code> to find the top 3 features that predict fertility in the swiss data set.</li>
    </ol>
</div>

In [29]:
swiss = pydataset.data('swiss')
y = swiss.Fertility
X = swiss.drop(columns='Fertility')

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.85, random_state=123)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

kbest = SelectKBest(score_func=f_regression, k=3)
kbest.fit(X_train_scaled, y_train)
X_train.columns[kbest.get_support()]

Index(['Examination', 'Education', 'Catholic'], dtype='object')

## Recursive Feature Elimination

- Fits a model and eliminates the worst performing features
- More computationally expensive
- Looks at all the features together

In [31]:
X = tips[['total_bill', 'size', 'smoker', 'dinner']]
y = tips.tip

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=123)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [35]:
rfe = RFE(estimator=LinearRegression(), n_features_to_select=1)
rfe.fit(X_train_scaled, y_train)
rfe.get_support()

array([ True, False, False, False])

In [36]:
X_train.columns[rfe.get_support()]

Index(['total_bill'], dtype='object')

In [38]:
pd.Series(rfe.ranking_, index=X_train.columns)

total_bill    1
size          2
smoker        3
dinner        4
dtype: int64

<div style="border: 1px solid black; border-radius: 3px; background: palegreen; padding: .5em 1em;">
    <p style="font-size: 1.3em; font-weight: bold">Mini Exercise</p>
    <ol>
        <li>Use <code>RFE</code> and <code>LinearRegression</code> to find the top 3 features that predict fertility in the swiss data set.</li>
        <li>Are the results different than what select k best gave you?</li>
    </ol>
</div>

In [46]:
swiss = pydataset.data('swiss')
y = swiss.Fertility
X = swiss.drop(columns='Fertility')

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.85, random_state=123)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

rfe = RFE(LinearRegression(), n_features_to_select=1)
rfe.fit(X_train_scaled, y_train)
# X_train.columns[rfe.get_support()]
pd.Series(rfe.ranking_, index=X_train.columns).sort_values()

Education           1
Catholic            2
Agriculture         3
Infant.Mortality    4
Examination         5
dtype: int64