# Feature Engineering II,
### a.k.a.
* ### Advanced Feature Engineering 
* ### Feature Engineering with _scikit-learn_

(Concepts are the same as in the intro to FE, how we transform the data is different)

In [299]:
# stuff you know already
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [300]:
# new stuff !!
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

### 1. Get Data

In [301]:
df = pd.read_csv('all_penguins_clean.csv')
df

Unnamed: 0,studyName,Sample Number,Species,Region,Island,Stage,Individual ID,Clutch Completion,Date Egg,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Real ID,Sex
0,PAL0708,1,Adelie,Anvers,Torgersen,"Adult, 1 Egg Stage",N1A1,Yes,11/11/07,39.1,18.7,181.0,3750.0,A_0,MALE
1,PAL0708,2,Adelie,Anvers,Torgersen,"Adult, 1 Egg Stage",N1A2,Yes,11/11/07,39.5,17.4,186.0,3800.0,A_1,FEMALE
2,PAL0708,3,Adelie,Anvers,Torgersen,"Adult, 1 Egg Stage",N2A1,Yes,11/16/07,40.3,18.0,195.0,3250.0,A_2,FEMALE
3,PAL0708,4,Adelie,Anvers,Torgersen,"Adult, 1 Egg Stage",N2A2,Yes,11/16/07,,,,,A_3,
4,PAL0708,5,Adelie,Anvers,Torgersen,"Adult, 1 Egg Stage",N3A1,Yes,11/16/07,36.7,19.3,193.0,3450.0,A_4,FEMALE
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
339,PAL0910,120,Gentoo,Anvers,Biscoe,"Adult, 1 Egg Stage",N38A2,No,12/1/09,,,,,G_339,
340,PAL0910,121,Gentoo,Anvers,Biscoe,"Adult, 1 Egg Stage",N39A1,Yes,11/22/09,46.8,14.3,215.0,4850.0,G_340,FEMALE
341,PAL0910,122,Gentoo,Anvers,Biscoe,"Adult, 1 Egg Stage",N39A2,Yes,11/22/09,50.4,15.7,222.0,5750.0,G_341,MALE
342,PAL0910,123,Gentoo,Anvers,Biscoe,"Adult, 1 Egg Stage",N43A1,Yes,11/22/09,45.2,14.8,212.0,5200.0,G_342,FEMALE


In [302]:
df.isna().sum()

studyName               0
Sample Number           0
Species                 0
Region                  0
Island                  0
Stage                   0
Individual ID           0
Clutch Completion       0
Date Egg                0
Culmen Length (mm)      2
Culmen Depth (mm)       2
Flipper Length (mm)     2
Body Mass (g)           2
Real ID                 0
Sex                    10
dtype: int64

In [303]:
df = df[~df['Individual ID'].isin(['N2A2','N38A2'])]

In [321]:
df.isna().sum()

studyName              0
Sample Number          0
Species                0
Region                 0
Island                 0
Stage                  0
Individual ID          0
Clutch Completion      0
Date Egg               0
Culmen Length (mm)     0
Culmen Depth (mm)      0
Flipper Length (mm)    0
Body Mass (g)          0
Real ID                0
Sex                    8
dtype: int64

In [304]:
X = df[['Island','Flipper Length (mm)','Body Mass (g)','Sex']]
y = df['Species']

### 2. Train-Test Split

In [305]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)

### 3. Explore the Data

In [306]:
df.isna().sum()

studyName              0
Sample Number          0
Species                0
Region                 0
Island                 0
Stage                  0
Individual ID          0
Clutch Completion      0
Date Egg               0
Culmen Length (mm)     0
Culmen Depth (mm)      0
Flipper Length (mm)    0
Body Mass (g)          0
Real ID                0
Sex                    8
dtype: int64

In [307]:
X_train['Sex'].replace('.','FEMALE',inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return self._update_inplace(result)


### 4. Feature Engineer

In [322]:
X.isna().sum()

Island                 0
Flipper Length (mm)    0
Body Mass (g)          0
Sex                    8
dtype: int64

In [308]:
X

Unnamed: 0,Island,Flipper Length (mm),Body Mass (g),Sex
0,Torgersen,181.0,3750.0,MALE
1,Torgersen,186.0,3800.0,FEMALE
2,Torgersen,195.0,3250.0,FEMALE
4,Torgersen,193.0,3450.0,FEMALE
5,Torgersen,190.0,3650.0,MALE
...,...,...,...,...
338,Biscoe,214.0,4925.0,FEMALE
340,Biscoe,215.0,4850.0,FEMALE
341,Biscoe,222.0,5750.0,MALE
342,Biscoe,212.0,5200.0,FEMALE


Q: How do we want to feature engineer our columns?

A: 
*impute missing values in Sex column
*one-hot-encode in Sex and Island column
*scale numerical values
*discretize numerical values 


Introducing `ColumnTransformer`: 

https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html
        
Takes as parameters:
* list of tuples of the format `(name, transformer, columns)`
* what to do with columns not included: `remainder='drop'/'passthrough'`

`ColumnTransformer` helps us to do all our feature engineering in one go.

In [309]:
numerical_columns = ['Flipper Length (mm)','Body Mass (g)']
categorical_columns = ['Island','Sex']

In [310]:
column_transformer = ColumnTransformer([
    ('sex_imputer',SimpleImputer(strategy='most_frequent'),['Sex']),
    ('island_ohe',OneHotEncoder(sparse=False, handle_unknown='ignore'),['Island']),
    ('num_scaler',MinMaxScaler(),numerical_columns)
])

In [311]:
column_transformer.fit(X_train)
X_train_fe = column_transformer.transform(X_train)
X_test_fe = column_transformer.transform(X_test)

In [312]:
X_train_fe

array([['MALE', 1.0, 0.0, 0.0, 0.27118644067796627, 0.38888888888888884],
       ['FEMALE', 0.0, 0.0, 1.0, 0.27118644067796627,
        0.05555555555555558],
       ['MALE', 0.0, 0.0, 1.0, 0.3559322033898309, 0.2152777777777778],
       ...,
       ['FEMALE', 1.0, 0.0, 0.0, 0.15254237288135597,
        0.13194444444444442],
       ['FEMALE', 1.0, 0.0, 0.0, 0.7627118644067798, 0.6111111111111112],
       ['FEMALE', 1.0, 0.0, 0.0, 0.3559322033898309, 0.0625]],
      dtype=object)

What happens when we want to transform the same column twice?

Introducing `Pipeline`: 

https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

Takes as parameters: 
* list of tuples of the format `(name, transformer)`

`Pipeline` allows us to apply sequential transformations to the same data.

In [323]:
categorical_pipeline = Pipeline([
    ('categorical_imputer',SimpleImputer(strategy='most_frequent')),
    ('categorical_ohe',OneHotEncoder(sparse=False, handle_unknown='error',drop='if_binary')),
])

In [329]:
column_transformer = ColumnTransformer([
    ('sex_imputer',categorical_pipeline,['Sex']),
    ('island_ohe',OneHotEncoder(sparse=False, handle_unknown='error',drop='first'),['Island']),
    ('num_scaler',MinMaxScaler(),numerical_columns)
])

In [330]:
column_transformer.fit(X_train) #learn how to do the transforamtion
X_train_fe = column_transformer.transform(X_train) #Do the accual transformation
X_test_fe = column_transformer.transform(X_test) #Do the same transformation on test set 

In [331]:
X_train_fe[0]

array([1.        , 0.        , 0.        , 0.27118644, 0.38888889])

In [317]:
#keep in mind make_pipline, make _column_transformer: the diference is there is no name

### 5. Train Model

In [327]:
m = LogisticRegression()
m.fit(X_train_fe,y_train) 

LogisticRegression()

### 6. Optimize

In [328]:
m.score(X_train_fe,y_train)

0.7011070110701108

In [320]:
m.score(X_test_fe,y_test)

0.8823529411764706

### 7. Calculate Test Score

**BONUS**: Figure out how to wrap your feature engineering / ColumnTransformer and your model / LogisticRegression into a single Pipeline.