# Feature Engineering

For some definitions, handling outliers and missing values, scaling, and encoding may be considered feature engineering. Here we'll draw a distinction between data preparation, data preprocessing, and feature engineering.

- **data preparation**: the basic data cleaning necessary to get our data ready for exploration/analysis, e.g. correcting data types, fixing typos
- **data preprocessing**: further data transformation done for the sake of modeling, as oppsoed to exploration/analysis, e.g. scaling, imputing, encoding
- **feature engineering**: adding, combining, or removing features; usually with the help of domain knowledge

Feature engineering can happen as part of data exploration or modeling, and engineered featured are also commonly explored.

Some examples of feature engineering by this definition:

- domain-based conversion (example: farenheit to celsius, BMI calculation, log transformation)
- domain based cutoffs (example: age >= 18 = is_adult; also dates)

- add / subtract (example: zillow dataset: beds + baths = room_count; total_sqft - 200 * bedrooms - 40 * bathrooms = living_area)
- combine as booleans as a count (example: telco_churn: streaming + backups + ...  = service_count)
- multiply / divide (example: tips dataset: total_bill / size = price_per_person)
- ratios (example: tips dataset: tip / total_bill = tip percentage)

Simplify!

- categorical with many unique values to top 3 + "Other"
- categorical to boolean: pool count -> has pool
- continous -> categorical via binning (aka quantization or discretization) (example: income -> high, medium, low earner)

In this lesson we'll cover some *automated* **feature selection** methods, that is, methods for determining which features are the most important.

- SelectKBest
- Recursive Feature Elimination
- Sequential Feature Selection

## Setup

In [1]:
import numpy as np
import pandas as pd
import wrangle
from sklearn.feature_selection import SelectKBest, RFE, f_regression, SequentialFeatureSelector

In [None]:
df = wrangle.wrangle_grades()

In [5]:
df = pd.read_csv('student_grades.csv')


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104 entries, 0 to 103
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   student_id   104 non-null    int64  
 1   exam1        103 non-null    float64
 2   exam2        104 non-null    int64  
 3   exam3        103 non-null    float64
 4   final_grade  104 non-null    int64  
dtypes: float64(2), int64(3)
memory usage: 4.2 KB


In [6]:
df = df.dropna()

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 102 entries, 0 to 103
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   student_id   102 non-null    int64  
 1   exam1        102 non-null    float64
 2   exam2        102 non-null    int64  
 3   exam3        102 non-null    float64
 4   final_grade  102 non-null    int64  
dtypes: float64(2), int64(3)
memory usage: 4.8 KB


In [8]:
df.head()

Unnamed: 0,student_id,exam1,exam2,exam3,final_grade
0,1,100.0,90,95.0,96
1,2,98.0,93,96.0,95
2,3,85.0,83,87.0,87
3,4,83.0,80,86.0,85
4,5,93.0,90,96.0,97


In [9]:
df = df.astype(int)

In [10]:
df.head()

Unnamed: 0,student_id,exam1,exam2,exam3,final_grade
0,1,100,90,95,96
1,2,98,93,96,95
2,3,85,83,87,87
3,4,83,80,86,85
4,5,93,90,96,97


In [None]:
# train_validate, test = train_test_split(df)
# train, validate = train_test_split(train_validate)

def split_zillow(df):
    train_val,test = train_test_split(df,
                                     random_state=2013,
                                     train_size=0.7)
    train, validate = train_test_split(train_val,
                                      random_state=2013,
                                      train_size=0.8)
    return train, validate, test

In [13]:
train, validate, test = wrangle.split_zillow(df)

In [14]:
train = train.drop(columns='student_id')
train.head()

Unnamed: 0,exam1,exam2,exam3,final_grade
45,92,89,94,93
38,70,75,78,72
17,93,90,96,97
32,92,89,94,93
29,83,80,86,85


In [17]:
X_train = train.drop(columns='final_grade')
y_train = train['final_grade']

## Select K Best

- looks at each feature in isolation against the target based on correlation
- fastest of all approaches covered in this lesson
- doesn't consider feature interactions
- After fitting: `.scores_`, `.pvalues_`, `.get_support()`, and `.transform`

In [18]:
# make the thing

kbest = SelectKBest(f_regression,k=2)

# fit the thing
_ = kbest.fit(X_train,y_train)

In [30]:
# statistical f-value:
kbest.scores_
#p value: 
kbest.pvalues_  # exam 1 and exam3 based upon greater F_SCORES!!!

array([1.09098345e-42, 6.57896735e-30, 2.18797325e-32])

In [26]:
X_train.columns[kbest.get_support()]

Index(['exam1', 'exam3'], dtype='object')

In [27]:
kbest.transform(X_train)

array([[ 92,  94],
       [ 70,  78],
       [ 93,  96],
       [ 92,  94],
       [ 83,  86],
       [100,  95],
       [ 57,  75],
       [ 70,  78],
       [ 92,  94],
       [100,  95],
       [ 98,  96],
       [ 62,  79],
       [100,  95],
       [ 70,  78],
       [ 92,  94],
       [ 62,  79],
       [ 79,  85],
       [ 62,  79],
       [ 70,  78],
       [ 73,  75],
       [ 98,  96],
       [ 92,  94],
       [ 83,  86],
       [100,  95],
       [ 57,  75],
       [ 58,  70],
       [ 98,  96],
       [ 62,  79],
       [ 98,  96],
       [ 58,  70],
       [ 85,  87],
       [ 70,  78],
       [ 92,  94],
       [ 93,  96],
       [100,  95],
       [ 73,  75],
       [ 70,  78],
       [ 98,  96],
       [ 70,  78],
       [ 58,  70],
       [ 85,  87],
       [ 57,  75],
       [ 83,  86],
       [ 70,  78],
       [ 57,  75],
       [ 93,  96],
       [ 73,  75],
       [100,  95],
       [ 73,  75],
       [ 57,  75],
       [ 93,  96],
       [ 62,  79],
       [ 93,

In [28]:
kbest.get_support()
X_train.iloc[:,kbest.get_support()].head()

Unnamed: 0,exam1,exam3
45,92,94
38,70,78
17,93,96
32,92,94
29,83,86


In [21]:
# get-support() will output a boolean mask to tell me which features were selected

# we can apply this mask to the columns in our original dataframe


array([ True, False,  True])

In [None]:
# kbest transform will convert our information to the selected
# feature subspace
# ****buuuuuut, its just a numpy array


## RFE

- Recursive Feature Elimination
- Progressively eliminate features based on importance to the model
- Requires a model with either a `.coef_` or `.feature_importances_` property
- After fitting: `.ranking_`, `.get_support()`, and `.transform()`

In [31]:
from sklearn.linear_model import LinearRegression

In [32]:
# establish a model for RFE to use

model = LinearRegression()

In [33]:
# make an RFE thing
rfe = RFE(model, n_features_to_select=2)

In [34]:
# fit the RFE thing
rfe.fit(X_train,y_train)

In [35]:
rfe.ranking_

array([1, 2, 1])

In [36]:
pd.DataFrame({
    'rfe_ranking':rfe.ranking_
}, index=X_train.columns)

Unnamed: 0,rfe_ranking
exam1,1
exam2,2
exam3,1


In [38]:
rfe.support_  #or rfe.get_support()

array([ True, False,  True])

In [37]:
rfe.transform(X_train)

array([[ 92,  94],
       [ 70,  78],
       [ 93,  96],
       [ 92,  94],
       [ 83,  86],
       [100,  95],
       [ 57,  75],
       [ 70,  78],
       [ 92,  94],
       [100,  95],
       [ 98,  96],
       [ 62,  79],
       [100,  95],
       [ 70,  78],
       [ 92,  94],
       [ 62,  79],
       [ 79,  85],
       [ 62,  79],
       [ 70,  78],
       [ 73,  75],
       [ 98,  96],
       [ 92,  94],
       [ 83,  86],
       [100,  95],
       [ 57,  75],
       [ 58,  70],
       [ 98,  96],
       [ 62,  79],
       [ 98,  96],
       [ 58,  70],
       [ 85,  87],
       [ 70,  78],
       [ 92,  94],
       [ 93,  96],
       [100,  95],
       [ 73,  75],
       [ 70,  78],
       [ 98,  96],
       [ 70,  78],
       [ 58,  70],
       [ 85,  87],
       [ 57,  75],
       [ 83,  86],
       [ 70,  78],
       [ 57,  75],
       [ 93,  96],
       [ 73,  75],
       [100,  95],
       [ 73,  75],
       [ 57,  75],
       [ 93,  96],
       [ 62,  79],
       [ 93,

In [39]:
pd.DataFrame(rfe.transform(X_train),
            columns=X_train.columns[rfe.get_support()],index=X_train.index).head()

Unnamed: 0,exam1,exam3
45,92,94
38,70,78
17,93,96
32,92,94
29,83,86


## Sequential Feature Selector

- progressively adds features based on cross validated model performance
- forwards: start with 0, add the best additional feature until you have the desired number
- backwards: start with all features, remove the worst performing until you have the desired number
- After fitting: `.support_`, `.transform`

In [40]:
# same as recursive elimination , but opposite, will add one by one  still make a model

model = LinearRegression()
sfs = SequentialFeatureSelector(model, n_features_to_select=2)
sfs.fit(X_train,y_train)

In [42]:
sfs.transform(X_train)
sfs.get_support()

array([ True, False,  True])

In [43]:
X_train.columns[sfs.get_support()]

Index(['exam1', 'exam3'], dtype='object')

## Conclusion

- Simpler models handle change + variability better
- Use RFE to narrow down your features and find the best ones, if your dataset is large (> 1GB; `df.info()`) use select k best instead
- Remember: feature engineering is much more than feature selection!