# Feature Engineering

For some definitions, handling outliers and missing values, scaling, and encoding may be considered feature engineering. Here we'll draw a distinction between data preparation, data preprocessing, and feature engineering.

- **Data preparation**: the basic data cleaning necessary to get our data ready for exploration/analysis, e.g. correcting data types, fixing typos
- **Data preprocessing**: further data transformation done for the sake of modeling, as oppsoed to exploration/analysis, e.g. scaling, imputing, encoding
- **Feature engineering**: adding, combining, or removing features; usually with the help of domain knowledge

Feature engineering can happen as part of data exploration or modeling, and engineered featured are also commonly explored.

Some examples of feature engineering by this definition:

- Domain-based conversion (example: fahrenheit to celsius, BMI calculation, log transformation)
- Domain based cutoffs (example: `age >= 18 = is_adult`; also dates)

- Add / subtract (example: zillow dataset: beds + baths = room_count; total_sqft - 200 * bedrooms - 40 * bathrooms = living_area)
- Combine as booleans as a count (example: telco_churn: streaming + backups + ...  = service_count)
- Multiply / divide (example: tips dataset: total_bill / size = price_per_person)
- Ratios (example: tips dataset: tip / total_bill = tip percentage)

Simplify!

- Categorical with many unique values to top 3 + "Other"
- Categorical to boolean: pool count -> has pool
- Continous -> categorical via binning (aka quantization or discretization) (example: income -> high, medium, low earner)

In this lesson we'll cover some *automated* **feature selection** methods, that is, methods for determining which features are the most important.

Feature selection can be broken down into supervised and unsupervised methods. And supervised methods can be broken down into intrinsic, filter, and wrapper methods.
- **Supervised:** Remove irrelevant variables
    - <u>Intrinsic</u>: Algorithms that perform automatic feature selection during training.
    - <u>Filter</u>: Select subsets of features based on their relationship with the target.
    - <u>Wrapper</u>: Search subsets of features that perform according to a predictive model.


- **Unsupervised:** Remove redundant variables


Methods Reviewed in this lesson:
- SelectKBest 
- Recursive Feature Elimination 
- Sequential Feature Selection 

## Setup

In [None]:
import numpy as np
import pandas as pd
from sklearn.feature_selection import SelectKBest, RFE, f_regression, SequentialFeatureSelector
from sklearn.model_selection import train_test_split
import wrangle

In [None]:
def wrangle_grades():
    '''
    Read student_grades csv file into a pandas DataFrame,
    drop student_id column, replace whitespaces with NaN values,
    drop any rows with Null values, convert all columns to int64,
    return cleaned student grades DataFrame.
    '''
    # Acquire data from csv file.
    file = "https://gist.githubusercontent.com/ryanorsinger/14c8f919920e111f53c6d2c3a3af7e70/raw/07f6e8004fa171638d6d599cfbf0513f6f60b9e8/student_grades.csv"

    grades = pd.read_csv(file)

    # Replace white space values with NaN values.
    grades = grades.replace(r'^\s*$', np.nan, regex=True)

    # Drop all rows with NaN values.
    df = grades.dropna()

    # Convert all columns to int64 data types.
    df = df.astype('int')

    # drop student_id
    df = df.drop(columns=['student_id'])
    
    return df

In [None]:
df = wrangle_grades()
train_validate, test = train_test_split(df, random_state=123, train_size=.8)
train, validate = train_test_split(train_validate, random_state=123, train_size=.7)

In [None]:
train.shape, validate.shape, test.shape

In [None]:
X_train = train[['exam1', 'exam2', 'exam3']]
y_train = train.final_grade
X_validate = validate[['exam1', 'exam2', 'exam3']]
y_validate = validate.final_grade
X_test = test[['exam1', 'exam2', 'exam3']]
y_test = test.final_grade

## Select K Best

- looks at each feature in isolation against the target based on correlation
- fastest of all approaches covered in this lesson
- doesn't consider feature interactions
- After fitting: `.scores_`, `.pvalues_`, `.get_support()`, and `.transform`

In [None]:
# Like our other sklearn objects...
kbest = SelectKBest(f_regression, k=2)
kbest.fit(X_train, y_train)

In [None]:
kbest_results = pd.DataFrame(dict(p=kbest.pvalues_, f=kbest.scores_), index=X_train.columns)
kbest_results

In [None]:
X_train.columns[kbest.get_support()]

In [None]:
X_train_transformed = pd.DataFrame(
    kbest.transform(X_train),
    index=X_train.index,
    columns=X_train.columns[kbest.get_support()]
)
X_train_transformed.head()

## RFE

- Recursive Feature Elimination
- Progressively eliminate features based on importance to the model
- Requires a model with either a `.coef_` or `.feature_importances_` property
- After fitting: `.ranking_`, `.get_support()`, and `.transform()`

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
model = LinearRegression()
rfe = RFE(model, n_features_to_select=2)
rfe.fit(X_train, y_train)

In [None]:
pd.DataFrame({'rfe_ranking': rfe.ranking_}, index=X_train.columns)

In [None]:
X_train.columns[rfe.get_support()]

In [None]:
X_train_transformed = pd.DataFrame(
    rfe.transform(X_train),
    index=X_train.index,
    columns=X_train.columns[rfe.support_]
)
X_train_transformed.head()

## Sequential Feature Selector

- progressively adds features based on cross validated model performance
- forwards: start with 0, add the best additional feature until you have the desired number
- backwards: start with all features, remove the worst performing until you have the desired number
- After fitting: `.support_`, `.transform`

In [None]:
model = LinearRegression()
sfs = SequentialFeatureSelector(model, n_features_to_select=2, scoring='neg_mean_absolute_error', direction='backward')
sfs.fit(X_train, y_train)

In [None]:
X_train_transformed = pd.DataFrame(
    sfs.transform(X_train),
    index=X_train.index,
    columns=X_train.columns[sfs.support_]
)
X_train_transformed.head()

## Conclusion

- Simpler models handle change + variability better
- Use RFE to narrow down your features and find the best ones, if your dataset is large (> 1GB; `df.info()`) use select k best instead
- Remember: feature engineering is much more than feature selection!