# Feature Engineering I, 
### a.k.a.
* ### Feature Engineering Intro
* ### Feature Engineering with _pandas_

Two sides to feature engineering:
* Which features help me build a better model?
* How should I preprocess them to include them into the model?

Why do feature engineering:
* Encode our subject matter expertise into the data
* More relevant information gives better performance
* Feed data into a model that are not numbers

In [None]:
import pandas as pd
import numpy as np

In [None]:
# Our toy dataset to work with
df = pd.DataFrame({
    'fruit': ['banana', 'banana', 'banana', 'apple', 'apple', 'apple', 'orange', 'melon'],
    'price': [1.00, 1.50, None, 2.00, 2.50, None, 3.0, 5.0],
    'bio': [1,0,1,0,1,0,1,0]
})

### 1. _Imputation_: filling in missing values

Q: What can we do with missing values?

A: ...

### 2. _One-Hot Encoding_: converting categories into numbers

`pd.factorize()`

`pd.get_dummies()`

### 3. _Scaling_: putting our variables on a common scale

_Normalization_:

_Standardization_:

### 4. _Binning_: turning scalars into categories

`pd.cut()`

`pd.qcut()`

### Feature engineering best practices:

#### 1. We should try to split our data set into training and testing sub-samples as early as we can.
   - but, this is flexible — e.g. you can drop NaNs from the entire dataset before filling.
   - still, in interest of good machine learning habits, even then, better to do this after splitting.

#### 2. We need to feature engineer our testing data in the same way that we feature-engineered our training data.
   - otherwise the performance of our model will suffer, if it runs at all.
   - writing a function is a nice way to do this.

#### 3. Feature Engineering includes any pre-processing techniques, such as:
   - imputation, dropping missing values
   - converting strings / non-numeric values into numeric values
   - scaling
   - binning
   - combining features