## Additional Learning Resources
Refer to [scikit-learn documentation](https://scikit-learn.org/stable/) and the [Pandas user guide](https://pandas.pydata.org/docs/) for detailed explanations of the functions used in this notebook.
For a quick refresher on splitting data:
```python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```


# Feature Engineering
After this encounter you should be able to 
- apply different techniques of feature engineering to your datasets


### Why Feature Engineering?

**We want to create features that are useful for the model**

* LogReg input are multiple columns (features)
* LogReg assigns one coefficient per feature
* --> number and kind of features determines power of the model
* more features -> more information -> better predictions
* kind: features should contain useful information, and not be redundant
* all features have to be floating-point numbers

Feature Engineering is creating columns (features) that make the model better.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('penguins_simple.csv', sep=';')
df.head(3)

Unnamed: 0,Species,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex
0,Adelie,39.1,18.7,181.0,3750.0,MALE
1,Adelie,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,40.3,18.0,195.0,3250.0,FEMALE


In [3]:
# convert body mass to kg
df['kg'] = df['Body Mass (g)'] / 1000

In [4]:
df.head(2)

Unnamed: 0,Species,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,kg
0,Adelie,39.1,18.7,181.0,3750.0,MALE,3.75
1,Adelie,39.5,17.4,186.0,3800.0,FEMALE,3.8


In [5]:
# convert gender to a Boolean
df['female'] = df['Sex'].replace({'MALE': 0.0, 'FEMALE': 1.0})

In [6]:
df.head(2)

Unnamed: 0,Species,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,kg,female
0,Adelie,39.1,18.7,181.0,3750.0,MALE,3.75,0.0
1,Adelie,39.5,17.4,186.0,3800.0,FEMALE,3.8,1.0


In [7]:
# combine species and gender into one column
df['species_sex'] = df['Species'] + '_' + df['Sex'].str.lower()

In [8]:
df.head(3)

Unnamed: 0,Species,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,kg,female,species_sex
0,Adelie,39.1,18.7,181.0,3750.0,MALE,3.75,0.0,Adelie_male
1,Adelie,39.5,17.4,186.0,3800.0,FEMALE,3.8,1.0,Adelie_female
2,Adelie,40.3,18.0,195.0,3250.0,FEMALE,3.25,1.0,Adelie_female
