## Additional Learning Resources
Refer to [scikit-learn documentation](https://scikit-learn.org/stable/) and the [Pandas user guide](https://pandas.pydata.org/docs/) for detailed explanations of the functions used in this notebook.
For a quick refresher on splitting data:
```python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```


# Feature Engineering II

Best Practice:

1. Fill missing values (imputation)
2. everything else (onehot, binning, others)
3. Scaling
4. fit the model
5. do the same for the test set (without .fit!!!)

## Additional Learning Resources
Refer to [scikit-learn documentation](https://scikit-learn.org/stable/) and the [Pandas user guide](https://pandas.pydata.org/docs/) for detailed explanations of the functions used in this notebook.
For a quick refresher on splitting data:
```python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```


In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, KBinsDiscretizer, MinMaxScaler
from sklearn.linear_model import LogisticRegression

In [2]:
df = pd.read_csv('penguins_simple.csv', sep=';')
df.head(3)

Unnamed: 0,Species,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex
0,Adelie,39.1,18.7,181.0,3750.0,MALE
1,Adelie,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,40.3,18.0,195.0,3250.0,FEMALE


In [3]:
X = df.iloc[:, 1:]
y = df['Species']

In [4]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=42)

Xtrain.shape, Xtest.shape, ytrain.shape, ytest.shape

((249, 5), (84, 5), (249,), (84,))

#### Feature Engineering

In [5]:
# 1. create a feature engineering tool
ohc = OneHotEncoder(sparse=False, handle_unknown='ignore')

# 2. fit with the training data (some columns of it)
ohc.fit(Xtrain[['Sex']])

# 3. transform the training data
onehot_sex = ohc.transform(Xtrain[['Sex']])
onehot_sex = pd.DataFrame(onehot_sex)
onehot_sex.head()

Unnamed: 0,0,1
0,0.0,1.0
1,1.0,0.0
2,0.0,1.0
3,1.0,0.0
4,0.0,1.0


In [7]:
# quantile strategy: different bin width, same number of penguins in each
# uniform strategy: same bin width, different number of penguins in each (like a histogram)

In [6]:
# 1. create a feature engineering tool
k = KBinsDiscretizer(n_bins=5, encode='onehot', strategy='quantile')

# 2. fit with the training data (some columns of it)
k.fit(Xtrain[['Culmen Length (mm)', 'Body Mass (g)']])

# 3. transform the training data
bins = k.transform(Xtrain[['Culmen Length (mm)', 'Body Mass (g)']])
bins = pd.DataFrame(bins.todense())  # materializes a sparse matrix so that we can see it
# ALWAYS DO THIS UNLESS YOUR DATA SET IS REALLY BIG
bins.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0


In [7]:
Xtrain.reset_index(inplace=True)
unmodified = Xtrain[['Flipper Length (mm)']]

In [8]:
unmodified.head()

Unnamed: 0,Flipper Length (mm)
0,228.0
1,217.0
2,190.0
3,212.0
4,205.0


In [9]:
onehot_sex.shape, bins.shape, unmodified.shape

((249, 2), (249, 10), (249, 1))

In [10]:
# we need one dataframe, so we need to merge them
Xtrain_fe = pd.concat([onehot_sex, bins, unmodified], axis=1)
Xtrain_fe.shape

(249, 13)

In [11]:
# we could process this further, e.g. scaling
Xtrain_fe.head(3)

Unnamed: 0,0,1,0.1,1.1,2,3,4,5,6,7,8,9,Flipper Length (mm)
0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,228.0
1,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,217.0
2,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,190.0


In [12]:
scaler = MinMaxScaler()  # scales every column independently
scaler.fit(Xtrain_fe)
Xtrain_scaled = scaler.transform(Xtrain_fe) # output is a numpy array, not a df

In [13]:
pd.DataFrame(Xtrain_scaled).head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.949153
1,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.762712
2,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.305085


#### Model building

In [14]:
m = LogisticRegression()
m.fit(Xtrain_scaled, ytrain)

LogisticRegression()

#### Evaluation

In [15]:
train_accuracy = m.score(Xtrain_scaled, ytrain)
train_accuracy

0.9879518072289156

### Now the same for the test data

* we already did ohc.fit()
* we only need to transform()
* NEVER FIT ANYTHING WITH TEST DATA!!!

In [16]:
test_ohc = ohc.transform(Xtest[['Sex']])
test_bins = k.transform(Xtest[['Culmen Length (mm)', 'Body Mass (g)']])
test_flipper = Xtest.reset_index()[['Flipper Length (mm)']]

test_ohc.shape, test_bins.shape, test_flipper.shape

((84, 2), (84, 10), (84, 1))

In [17]:
test_ohc = pd.DataFrame(test_ohc)
test_bins = pd.DataFrame(test_bins.todense())

test_ohc.shape, test_bins.shape, test_flipper.shape

((84, 2), (84, 10), (84, 1))

In [18]:
Xtest_fe = pd.concat([test_ohc, test_bins, test_flipper], axis=1)
Xtest_fe.shape

(84, 13)

In [19]:
Xtest_scaled = scaler.transform(Xtest_fe)
Xtest_scaled.shape

(84, 13)

In [20]:
train_accuracy

0.9879518072289156

In [21]:
test_accuracy = m.score(Xtest_scaled, ytest)
test_accuracy

0.9880952380952381

* test == training and both are high : GOOD!
* test < training : **Overfitting** (the model is too powerful, take features out)
* training is low : **Underfitting** (the model is not powerful enough, add more features and/or more data)
* test > training : strange a) sampling bias (luck drawing the test set) b) the model is heavily biased (lots of constraints added)

In [24]:
# inspect the coefficients of the Adelie part of the model
m.coef_[0].round(3)

array([-0.906,  0.906,  2.305,  1.51 , -0.27 , -1.598, -1.946,  0.571,
        0.297,  0.082, -0.103, -0.847, -1.18 ])

In [None]:
# Practice: implement the steps discussed above
