<a id='TOP'></a>
# Feature Engineering

For some definitions, handling outliers and missing values, scaling, and encoding may be considered feature engineering. Here we'll draw a distinction between data preparation, data preprocessing, and feature engineering.

- **data preparation**: the basic data cleaning necessary to get our data ready for exploration/analysis, e.g. correcting data types, fixing typos
- **data preprocessing**: further data transformation done for the sake of modeling, as oppsoed to exploration/analysis, e.g. scaling, imputing, encoding
- **feature engineering**: adding, combining, or removing features; usually with the help of domain knowledge

Feature engineering can happen as part of data exploration or modeling, and engineered featured are also commonly explored.

||[__--TOP--__](#TOP)||[__--SELECT K BEST--__](#SELECT_K)||[__--RFE--__](#RFE)||

Some examples of feature engineering by this definition:

- domain-based conversion (example: farenheit to celsius, BMI calculation, log transformation)
- domain based cutoffs (example: age >= 18 = is_adult; also dates)

- add / subtract (example: zillow dataset: beds + baths = room_count; total_sqft - 200 * bedrooms - 40 * bathrooms = living_area)
- combine as booleans as a count (example: telco_churn: streaming + backups + ...  = service_count)
- multiply / divide (example: tips dataset: total_bill / size = price_per_person)
- ratios (example: tips dataset: tip / total_bill = tip percentage)

Simplify!

- categorical with many unique values to top 3 + "Other"
- categorical to boolean: pool count -> has pool
- continous -> categorical via binning (aka quantization or discretization) (example: income -> high, medium, low earner)

In this lesson we'll cover some *automated* **feature selection** methods, that is, methods for determining which features are the most important.

- SelectKBest
- Recursive Feature Elimination
- Sequential Feature Selection

## Setup

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.feature_selection import SelectKBest, RFE, f_regression, SequentialFeatureSelector
from wrangle_students import wrangle_grades

In [2]:
from prepare import split_data_continuous

In [3]:
df = wrangle_grades()
df

Unnamed: 0,student_id,exam1,exam2,exam3,final_grade
0,1,100,90,95,96
1,2,98,93,96,95
2,3,85,83,87,87
3,4,83,80,86,85
4,5,93,90,96,97
...,...,...,...,...,...
99,100,70,65,78,77
100,101,62,70,79,70
101,102,58,65,70,68
102,103,57,65,75,65


In [5]:
train, validate, test = split_data_continuous(df)

Prepared df: (102, 5)

Train: (60, 5)
Validate: (21, 5)
Test: (21, 5)


In [6]:
train

Unnamed: 0,student_id,exam1,exam2,exam3,final_grade
60,61,70,65,78,77
59,60,73,70,75,76
21,22,70,65,78,77
2,3,85,83,87,87
51,52,70,75,78,72
24,25,57,65,75,65
57,58,79,70,85,81
92,93,98,93,96,95
81,82,83,80,86,85
42,43,83,80,86,85


In [8]:
X_cols = ['exam1', 'exam2', 'exam3']
# train.columns.tolist()
y_col = 'final_grade'

In [9]:
X_train, y_train = train[X_cols], train[y_col]

In [10]:
X_validate, y_validate = validate[X_cols], validate[y_col]

In [11]:
X_test, y_test = test[X_cols], test[y_col]

<a id='SELECT_K'></a>
## Select K Best

- looks at each feature in isolation against the target based on correlation
- fastest of all approaches covered in this lesson
- doesn't consider feature interactions
- After fitting: `.scores_`, `.pvalues_`, `.get_support()`, and `.transform`

||[__--TOP--__](#TOP)||[__--SELECT K BEST--__](#SELECT_K)||[__--RFE--__](#RFE)||


In [12]:
kbest = SelectKBest(f_regression, k=2)

In [13]:
kbest.fit(X_train, y_train)

SelectKBest(k=2, score_func=<function f_regression at 0x127805670>)

In [15]:
kbest.get_feature_names_out()

array(['exam1', 'exam3'], dtype=object)

In [16]:
kbest.scores_

array([1961.95852406,  328.39519279,  507.39846869])

In [17]:
kbest.pvalues_

array([2.03679579e-46, 1.47195414e-25, 2.30229464e-30])

In [18]:
kbest_results = pd.DataFrame(dict(p=kbest.pvalues_, f=kbest.scores_),
                            index=X_train.columns)
kbest_results

Unnamed: 0,p,f
exam1,2.036796e-46,1961.958524
exam2,1.471954e-25,328.395193
exam3,2.302295e-30,507.398469


In [20]:
kbest.get_support()

array([ True, False,  True])

In [22]:
X_train.columns[kbest.get_support()]

Index(['exam1', 'exam3'], dtype='object')

In [24]:
X_train.head()

Unnamed: 0,exam1,exam2,exam3
60,70,65,78
59,73,70,75
21,70,65,78
2,85,83,87
51,70,75,78


In [25]:
kbest.transform(X_train)

array([[ 70,  78],
       [ 73,  75],
       [ 70,  78],
       [ 85,  87],
       [ 70,  78],
       [ 57,  75],
       [ 79,  85],
       [ 98,  96],
       [ 83,  86],
       [ 83,  86],
       [ 58,  70],
       [ 92,  94],
       [ 58,  70],
       [ 85,  87],
       [ 70,  78],
       [ 73,  75],
       [ 58,  70],
       [ 98,  96],
       [ 62,  79],
       [ 62,  79],
       [ 85,  87],
       [ 58,  70],
       [ 62,  79],
       [ 70,  78],
       [ 79,  85],
       [ 85,  87],
       [ 73,  75],
       [ 57,  75],
       [ 73,  75],
       [ 57,  75],
       [ 57,  75],
       [ 83,  86],
       [ 85,  87],
       [ 73,  75],
       [100,  95],
       [ 92,  94],
       [ 92,  94],
       [ 83,  86],
       [ 93,  96],
       [ 92,  94],
       [ 73,  75],
       [ 83,  86],
       [ 92,  94],
       [ 93,  96],
       [ 92,  94],
       [ 93,  96],
       [ 79,  85],
       [ 85,  87],
       [ 57,  75],
       [ 83,  86],
       [ 70,  78],
       [ 73,  75],
       [100,

In [26]:
X_train_transformed = pd.DataFrame(
    kbest.transform(X_train),
    columns=X_train.columns[kbest.get_support()],
    index= X_train.index
)

In [27]:
X_train_transformed

Unnamed: 0,exam1,exam3
60,70,78
59,73,75
21,70,78
2,85,87
51,70,78
24,57,75
57,79,85
92,98,96
81,83,86
42,83,86


<a id='RFE'></a>
## RFE

- Recursive Feature Elimination
- Progressively eliminate features based on importance to the model
- Requires a model with either a `.coef_` or `.feature_importances_` property
- After fitting: `.ranking_`, `.get_support()`, and `.transform()`
***
||[__--TOP--__](#TOP)||[__--SELECT K BEST--__](#SELECT_K)||[__--RFE--__](#RFE)||

In [28]:
from sklearn.linear_model import LinearRegression

In [29]:
# initialize the ML algorithm
lm = LinearRegression()

# create the rfe object, indicating the ML object (lm) and the number of features I want to end up with. 
rfe = RFE(lm, n_features_to_select=2)

# fit the data using RFE
rfe.fit(X_train,y_train)  

# get the mask of the columns selected
feature_mask = rfe.support_

# get list of the column names. 
rfe_feature = X_train.iloc[:,feature_mask].columns.tolist()

rfe_feature

['exam1', 'exam3']

In [31]:
rfe.ranking_

array([1, 2, 1])

In [33]:
rfe.get_support()

array([ True, False,  True])

## Sequential Feature Selector

- progressively adds features based on cross validated model performance
- forwards: start with 0, add the best additional feature until you have the desired number
- backwards: start with all features, remove the worst performing until you have the desired number
- After fitting: `.support_`, `.transform`

## Conclusion

- Simpler models handle change + variability better
- Use RFE to narrow down your features and find the best ones, if your dataset is large (> 1GB; `df.info()`) use select k best instead
- Remember: feature engineering is much more than feature selection!

||[__--TOP--__](#TOP)||[__--SELECT K BEST--__](#SELECT_K)||[__--RFE--__](#RFE)||
<br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br>