# 2.6 Feature Engineering with Pandas 🛠




### Goals

- Preparing the proper input dataset, compatible with the machine learning algorithm requirements.
- Improving the performance of machine learning models.

### Definition

`Feature engineering is manually designing what the input X’s should be`

— Tomasz Malisiewicz


In [None]:
from PIL import Image

im = Image.open("feature_eng.JPG")
im

> Place of feature engineering in the machine learning workflow

Two sides to feature engineering:
* Which features help me build a better model?
* How should I preprocess them to include them into the model?

Q: Why do feature engineering? What kind of problems with data do we have?


### Fancy tricks with simple numbers

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Our toy dataset to work with
df = pd.DataFrame({
    'food': ['🍣', '🍝', '🍣', '🌮', '🌮', '🌭', '🌭', '🍕'],
    'price': [1.00, 1.50, None, 2.0, 2.50, None, 3.0, 5.0]
    
})

In [3]:
df

Unnamed: 0,food,price
0,🍣,1.0
1,🍝,1.5
2,🍣,
3,🌮,2.0
4,🌮,2.5
5,🌭,
6,🌭,3.0
7,🍕,5.0


### 1. _Imputation_: filling in missing values






* `df.isna()`: check for NaNs, sum or heatmap
* `df.dropna()`: drop NaNs
* `df.fillna()`: fill NaNs

*Tip: Use `inplace=True` for the Titanic dataset to make sure you saved the changes.*

In [4]:
df.fillna(df.mean())

Unnamed: 0,food,price
0,🍣,1.0
1,🍝,1.5
2,🍣,2.5
3,🌮,2.0
4,🌮,2.5
5,🌭,2.5
6,🌭,3.0
7,🍕,5.0


In [5]:
df

Unnamed: 0,food,price
0,🍣,1.0
1,🍝,1.5
2,🍣,
3,🌮,2.0
4,🌮,2.5
5,🌭,
6,🌭,3.0
7,🍕,5.0


In [6]:
df.fillna(method='bfill')

Unnamed: 0,food,price
0,🍣,1.0
1,🍝,1.5
2,🍣,2.0
3,🌮,2.0
4,🌮,2.5
5,🌭,3.0
6,🌭,3.0
7,🍕,5.0


In [7]:
df

Unnamed: 0,food,price
0,🍣,1.0
1,🍝,1.5
2,🍣,
3,🌮,2.0
4,🌮,2.5
5,🌭,
6,🌭,3.0
7,🍕,5.0


In [8]:
df.fillna(method='ffill')

Unnamed: 0,food,price
0,🍣,1.0
1,🍝,1.5
2,🍣,1.5
3,🌮,2.0
4,🌮,2.5
5,🌭,2.5
6,🌭,3.0
7,🍕,5.0


In [9]:
df

Unnamed: 0,food,price
0,🍣,1.0
1,🍝,1.5
2,🍣,
3,🌮,2.0
4,🌮,2.5
5,🌭,
6,🌭,3.0
7,🍕,5.0


In [10]:
df.interpolate()

Unnamed: 0,food,price
0,🍣,1.0
1,🍝,1.5
2,🍣,1.75
3,🌮,2.0
4,🌮,2.5
5,🌭,2.75
6,🌭,3.0
7,🍕,5.0


In [11]:
df['price_filled'] = df['price'].fillna(df.groupby('food')['price'].transform('mean'))

In [12]:
df

Unnamed: 0,food,price,price_filled
0,🍣,1.0,1.0
1,🍝,1.5,1.5
2,🍣,,1.0
3,🌮,2.0,2.0
4,🌮,2.5,2.5
5,🌭,,3.0
6,🌭,3.0,3.0
7,🍕,5.0,5.0


### 2. _One-Hot Encoding_: converting categories into numbers

`pd.factorize()`
* turns data into categorical variable
* results in one column
* scikit-learn equivalent `LabelEncoder`

In [16]:
pd.factorize(df['food']) #codes the categorical features of the DF

(array([0, 1, 0, 2, 2, 3, 3, 4]),
 Index(['🍣', '🍝', '🌮', '🌭', '🍕'], dtype='object'))

In [14]:
df['food_factorized'] = pd.factorize(df['food'])[0]

In [15]:
df[['food', 'food_factorized']]

Unnamed: 0,food,food_factorized
0,🍣,0
1,🍝,1
2,🍣,0
3,🌮,2
4,🌮,2
5,🌭,3
6,🌭,3
7,🍕,4


`pd.get_dummies()`
* turns data into dummy/indicator variable
* results in as many columns as categories
* scikit-learn equivalent `OneHotEncoder`

In [17]:
pd.get_dummies(df['food'])

Unnamed: 0,🌭,🌮,🍕,🍝,🍣
0,0,0,0,0,1
1,0,0,0,1,0
2,0,0,0,0,1
3,0,1,0,0,0
4,0,1,0,0,0
5,1,0,0,0,0
6,1,0,0,0,0
7,0,0,1,0,0


In [19]:
df.join(pd.get_dummies(df['food'])) #, drop_first=True

Unnamed: 0,food,price,price_filled,food_factorized,🌭,🌮,🍕,🍝,🍣
0,🍣,1.0,1.0,0,0,0,0,0,1
1,🍝,1.5,1.5,1,0,0,0,1,0
2,🍣,,1.0,0,0,0,0,0,1
3,🌮,2.0,2.0,2,0,1,0,0,0
4,🌮,2.5,2.5,2,0,1,0,0,0
5,🌭,,3.0,3,1,0,0,0,0
6,🌭,3.0,3.0,3,1,0,0,0,0
7,🍕,5.0,5.0,4,0,0,1,0,0


### 3. _Scaling_: putting our variables on a common scale

_Normalization_:
* doesn't deal well with outliers
* output range is always [0, 1]
* scikit-learn equivalent `MinMaxScaler()`

In [20]:
def normalize(X):
    return (X-X.min())/(X.max()-X.min())

_Standardization_:
* deals well with outliers
* doesn't always result in the same range
* scikit-learn equivalent `StandardScaler()`

In [21]:
def standardize(X):
    return (X-X.mean())/X.std()

### 4. _Binning_: turning scalars into categories

In [22]:
df

Unnamed: 0,food,price,price_filled,food_factorized
0,🍣,1.0,1.0,0
1,🍝,1.5,1.5,1
2,🍣,,1.0,0
3,🌮,2.0,2.0,2
4,🌮,2.5,2.5,2
5,🌭,,3.0,3
6,🌭,3.0,3.0,3
7,🍕,5.0,5.0,4


`pd.cut()`
* equally spaced intervals (by specifying number of bins)
* arbitrary bin edges (by specifying bin edges)

In [27]:
pd.cut(df['price_filled'], bins=[0,1,3,5], labels=['cheap', 'medium', 'expensive'])

0        cheap
1       medium
2        cheap
3       medium
4       medium
5       medium
6       medium
7    expensive
Name: price_filled, dtype: category
Categories (3, object): ['cheap' < 'medium' < 'expensive']

`pd.qcut()`
* same number of data points per bin

In [28]:
pd.qcut(df['price_filled'], q=4, labels=['cheap', 'medium', 'expensive', 'very_expensive'])

0             cheap
1            medium
2             cheap
3            medium
4         expensive
5         expensive
6         expensive
7    very_expensive
Name: price_filled, dtype: category
Categories (4, object): ['cheap' < 'medium' < 'expensive' < 'very_expensive']

### Feature engineering best practices:

#### 1. We should try to split our data set into training and testing sub-samples as early as we can.
   - but, this is flexible — e.g. you can drop NaNs from the entire dataset before filling.
   - still, in interest of good machine learning habits, even then, better to do this after splitting.

#### 2. We need to feature engineer our testing data in the same way that we feature-engineered our training data.
   - otherwise the performance of our model will suffer, if it runs at all.
   - writing a function is a nice way to do this.
   - an even better/cleaner way is by using `ColumnTransformer` — more on that this afternoon!

