# Feature Engineering I, 
### a.k.a.
* ### Feature Engineering Intro
* ### Feature Engineering with _pandas_

Two sides to feature engineering:
* Which features help me build a better model?
* How should I preprocess them to include them into the model?

Why do feature engineering:
* Encode our subject matter expertise into the data
* More relevant information gives better performance
* Feed data that's not numbers into a model 

In [2]:
import pandas as pd
import numpy as np

In [3]:
# Our toy dataset to work with
df = pd.DataFrame({
    'fruit': ['banana', 'banana', 'banana', 'apple', 'apple', 'apple', 'orange', 'melon'],
    'price': [1.00, 1.50, None, 2.00, 2.50, None, 3.0, 5.0],
    'bio': [1,0,1,0,1,0,1,0]
})

In [4]:
df

Unnamed: 0,fruit,price,bio
0,banana,1.0,1
1,banana,1.5,0
2,banana,,1
3,apple,2.0,0
4,apple,2.5,1
5,apple,,0
6,orange,3.0,1
7,melon,5.0,0


### 1. _Imputation_: filling in missing values

Q: What can we do with missing values?

A:
*Drop:
    *drop rows with missing data
    *drop whole columns if they have a lot of data missing and arent particularly relevant
*Replace with a value:
    *mean/median/back-fill/forward-fill/interpolate
*Use data to predict the missing value.
    *"KNNImputer"
    *IterativeImputer


*"df.isnull()"-check NaNS;sum or heatmap
*df.dropna()- be careful when dropping all NaNs
*df.fillna() - fill missing value

"In real life" you'd want to use inplace=True 

In [5]:
df.fillna(df.mean())

Unnamed: 0,fruit,price,bio
0,banana,1.0,1
1,banana,1.5,0
2,banana,2.5,1
3,apple,2.0,0
4,apple,2.5,1
5,apple,2.5,0
6,orange,3.0,1
7,melon,5.0,0


In [19]:
df

Unnamed: 0,fruit,price,bio,price_filled
0,banana,1.0,1,1.0
1,banana,1.5,0,1.5
2,banana,,1,
3,apple,2.0,0,2.0
4,apple,2.5,1,2.5
5,apple,,0,
6,orange,3.0,1,3.0
7,melon,5.0,0,5.0


In [47]:
df.fillna(method='ffill')

Unnamed: 0,fruit,price,bio,price_filled,fruit_factorized
0,banana,1.0,1,1.0,0
1,banana,1.5,0,1.5,0
2,banana,1.5,1,1.5,0
3,apple,2.0,0,2.0,1
4,apple,2.5,1,2.5,1
5,apple,2.5,0,2.5,1
6,orange,3.0,1,3.0,2
7,melon,5.0,0,5.0,3


In [62]:
df.interpolate()

Unnamed: 0,fruit,price,bio,price_filled,fruit_factorized
0,banana,1.0,1,1.0,0
1,banana,1.5,0,1.5,0
2,banana,1.75,1,1.75,0
3,apple,2.0,0,2.0,1
4,apple,2.5,1,2.5,1
5,apple,2.75,0,2.75,1
6,orange,3.0,1,3.0,2
7,melon,5.0,0,5.0,3


In [63]:
df

Unnamed: 0,fruit,price,bio,price_filled,fruit_factorized
0,banana,1.0,1,1.0,0
1,banana,1.5,0,1.5,0
2,banana,,1,,0
3,apple,2.0,0,2.0,1
4,apple,2.5,1,2.5,1
5,apple,,0,,1
6,orange,3.0,1,3.0,2
7,melon,5.0,0,5.0,3


In [49]:
#df["price_filled"]=df["price"]

In [61]:
df

Unnamed: 0,fruit,price,bio,price_filled,fruit_factorized
0,banana,1.0,1,1.0,0
1,banana,1.5,0,1.5,0
2,banana,,1,,0
3,apple,2.0,0,2.0,1
4,apple,2.5,1,2.5,1
5,apple,,0,,1
6,orange,3.0,1,3.0,2
7,melon,5.0,0,5.0,3


In [50]:
#gives you as many rows as you have in dataset
df.groupby("fruit")["price"] .transform("mean")


0    1.25
1    1.25
2    1.25
3    2.25
4    2.25
5    2.25
6    3.00
7    5.00
Name: price, dtype: float64

In [51]:
df.groupby("fruit")["price"].mean()

fruit
apple     2.25
banana    1.25
melon     5.00
orange    3.00
Name: price, dtype: float64

In [52]:
df

Unnamed: 0,fruit,price,bio,price_filled,fruit_factorized
0,banana,1.0,1,1.0,0
1,banana,1.5,0,1.5,0
2,banana,,1,,0
3,apple,2.0,0,2.0,1
4,apple,2.5,1,2.5,1
5,apple,,0,,1
6,orange,3.0,1,3.0,2
7,melon,5.0,0,5.0,3


### 2. _One-Hot Encoding_: converting categories into numbers

pd.factorize()
*turns data into categorical variable
*results in one column
*sckit_learn equivalent LabelEncoder

`pd.factorize()`

In [20]:
pd.factorize(df["fruit"])

(array([0, 0, 0, 1, 1, 1, 2, 3]),
 Index(['banana', 'apple', 'orange', 'melon'], dtype='object'))

In [22]:
pd.factorize(df["fruit"])[0]

array([0, 0, 0, 1, 1, 1, 2, 3])

In [23]:
df["fruit_factorized"]=pd.factorize(df["fruit"])[0]

In [24]:
df

Unnamed: 0,fruit,price,bio,price_filled,fruit_factorized
0,banana,1.0,1,1.0,0
1,banana,1.5,0,1.5,0
2,banana,,1,,0
3,apple,2.0,0,2.0,1
4,apple,2.5,1,2.5,1
5,apple,,0,,1
6,orange,3.0,1,3.0,2
7,melon,5.0,0,5.0,3


`pd.get_dummies()`

In [30]:
df

Unnamed: 0,fruit,price,bio,price_filled,fruit_factorized
0,banana,1.0,1,1.0,0
1,banana,1.5,0,1.5,0
2,banana,,1,,0
3,apple,2.0,0,2.0,1
4,apple,2.5,1,2.5,1
5,apple,,0,,1
6,orange,3.0,1,3.0,2
7,melon,5.0,0,5.0,3


In [37]:
pd.get_dummies(df["fruit"])

Unnamed: 0,apple,banana,melon,orange
0,0,1,0,0
1,0,1,0,0
2,0,1,0,0
3,1,0,0,0
4,1,0,0,0
5,1,0,0,0
6,0,0,0,1
7,0,0,1,0


In [29]:
#drop_first is so important. So, we can drop this fruits columns from the columns area
pd.get_dummies(df["fruit"],drop_first=True)

Unnamed: 0,banana,melon,orange
0,1,0,0
1,1,0,0
2,1,0,0
3,0,0,0
4,0,0,0
5,0,0,0
6,0,0,1
7,0,1,0


In [35]:
df.join(pd.get_dummies(df["fruit"],drop_first=True))

Unnamed: 0,fruit,price,bio,price_filled,fruit_factorized,apple,banana,melon,orange
0,banana,1.0,1,1.0,0,0,1,0,0
1,banana,1.5,0,1.5,0,0,1,0,0
2,banana,,1,,0,0,1,0,0
3,apple,2.0,0,2.0,1,1,0,0,0
4,apple,2.5,1,2.5,1,1,0,0,0
5,apple,,0,,1,1,0,0,0
6,orange,3.0,1,3.0,2,0,0,0,1
7,melon,5.0,0,5.0,3,0,0,1,0


In [34]:
df.join(pd.get_dummies(df["fruit"]))

Unnamed: 0,fruit,price,bio,price_filled,fruit_factorized,apple,banana,melon,orange
0,banana,1.0,1,1.0,0,0,1,0,0
1,banana,1.5,0,1.5,0,0,1,0,0
2,banana,,1,,0,0,1,0,0
3,apple,2.0,0,2.0,1,1,0,0,0
4,apple,2.5,1,2.5,1,1,0,0,0
5,apple,,0,,1,1,0,0,0
6,orange,3.0,1,3.0,2,0,0,0,1
7,melon,5.0,0,5.0,3,0,0,1,0


### 3. _Scaling_: putting our variables on a common scale

_Normalization_:

*output range is [0,1]
*does not deal with outliers
*scikit_learns equivalnet MinMaxSclaer()

In [38]:
def normalize (X):
    return (X-X.min())/(X.max()-X.min())

_Standardization_:

*Output range is not always the same (will be centered around 0,and not go up much more than ~3)
*deals wwll with outliers
*scikit-learn equivialent  StandarScaler()


In [40]:
def standardize(X):
    return (X-X.mean())/X.std()

### 4. _Binning_: turning scalars into categories

`pd.cut()`

In [41]:
df

Unnamed: 0,fruit,price,bio,price_filled,fruit_factorized
0,banana,1.0,1,1.0,0
1,banana,1.5,0,1.5,0
2,banana,,1,,0
3,apple,2.0,0,2.0,1
4,apple,2.5,1,2.5,1
5,apple,,0,,1
6,orange,3.0,1,3.0,2
7,melon,5.0,0,5.0,3


`pd.qcut()`

*equally spaced intervals (by specifying the number od bins)

In [42]:
pd.cut(df["price_filled"],bins=3,labels=["cheap","medium","expensive"])

0        cheap
1        cheap
2          NaN
3        cheap
4       medium
5          NaN
6       medium
7    expensive
Name: price_filled, dtype: category
Categories (3, object): ['cheap' < 'medium' < 'expensive']

In [54]:
pd.cut(df["price_filled"],bins=[0,1.5,4.5,5.5],labels=["new_cheap","new_medium","new_expensive"])

0        new_cheap
1        new_cheap
2              NaN
3       new_medium
4       new_medium
5              NaN
6       new_medium
7    new_expensive
Name: price_filled, dtype: category
Categories (3, object): ['new_cheap' < 'new_medium' < 'new_expensive']

*same number of datapoints per bin

In [58]:
pd.qcut(df["price_filled"],q=4,labels=["cheap","medium","expensive","very expensive"])

0             cheap
1             cheap
2               NaN
3            medium
4         expensive
5               NaN
6    very expensive
7    very expensive
Name: price_filled, dtype: category
Categories (4, object): ['cheap' < 'medium' < 'expensive' < 'very expensive']

### Feature engineering best practices:

#### 1. We should try to split our data set into training and testing sub-samples as early as we can.
   - but, this is flexible — e.g. you can drop NaNs from the entire dataset before filling.
   - still, in interest of good machine learning habits, even then, better to do this after splitting.

#### 2. We need to feature engineer our testing data in the same way that we feature-engineered our training data.
   - otherwise the performance of our model will suffer, if it runs at all.
   - writing a function is a nice way to do this.

#### 3. Feature Engineering includes any pre-processing techniques, such as:
   - imputation, dropping missing values
   - converting strings / non-numeric values into numeric values
   - scaling
   - binning
   - combining features