# One-Hot Encoding

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

__Converts a categorical (nominal/ordinal) column to a matrix of binary variables__

1. Execute the code 
   

2. Understand what is happening

   QUESTION: Why do we convert to multiple columns?

3. Search on internet when can be beneficial

3. Explain to the rest of the group what you did


### Import Libraries

In [43]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

### Read data into a dataframe df

In [44]:
df = pd.read_csv('../data/penguins.csv')
df

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
4,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male
...,...,...,...,...,...,...,...
337,Gentoo,Biscoe,47.2,13.7,214.0,4925.0,Female
338,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,Female
339,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,Male
340,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,Female


### Select a categorical column/columns to one-hot-encode 

In [45]:
species_df = df[['species']]
species_df

Unnamed: 0,species
0,Adelie
1,Adelie
2,Adelie
3,Adelie
4,Adelie
...,...
337,Gentoo
338,Gentoo
339,Gentoo
340,Gentoo


### One-hot-Encode aka Dummyfying categorical columns

In [54]:
# Define the transformer
ohc_species = OneHotEncoder(drop='first', sparse=False, handle_unknown='ignore')
species_encoded = ohc_species.fit_transform(df[['species']])
species_columns = ohc_species.get_feature_names_out(['species'])



#### "Fitting" the ohc transformer
During the fit, the ohc transformer learns the unique values of the feature. This values than are stored in the "categories_" attribute of the *ohc* object.

In [55]:
ohc_species.fit(species_df)            



In [56]:
ohc_species.categories_

[array([0., 1.]), array([0., 1.]), array([0., 1.])]

#### Transforming the columns
Only during the transformation each category is converted to a column. The columns values are either 0 or 1 depending to the membership to that particular category.

In [57]:
t = ohc_species.transform(species_df)
print(t.shape)
print()
t

(342, 3)



array([[1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       ...,
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.]])

In [58]:
# format output as a DataFame
species_df = pd.DataFrame(species_encoded, columns=species_columns)
species_df.head()

Unnamed: 0,species_Chinstrap,species_Gentoo
0,0.0,0.0
1,0.0,0.0
2,0.0,0.0
3,0.0,0.0
4,0.0,0.0


---
__Hint__: You may have noticed that the output of the transformations with sklearn Feature Engineering methods is a numpy array. In case you want/need a DataFrame as output you can add to your code:
```python
from sklearn import set_config
set_config(transform_output="pandas")
```

### 🌶️ BONUS: include a second column

In [59]:
island_df = df[['island']]
island_df

ohc_island = OneHotEncoder(drop='first', sparse=False, handle_unknown='ignore')
island_encoded = ohc_island.fit_transform(df[['island']])
island_columns = ohc_island.get_feature_names_out(['island'])


island_df = pd.DataFrame(island_encoded, columns=island_columns)
island_new.head()



Unnamed: 0,island_Dream,island_Torgersen
0,0.0,1.0
1,0.0,1.0
2,0.0,1.0
3,0.0,1.0
4,0.0,1.0


In [60]:
new_df = pd.concat([island_df, species_df], axis=1)
new_df

Unnamed: 0,island_Dream,island_Torgersen,species_Chinstrap,species_Gentoo
0,0.0,1.0,0.0,0.0
1,0.0,1.0,0.0,0.0
2,0.0,1.0,0.0,0.0
3,0.0,1.0,0.0,0.0
4,0.0,1.0,0.0,0.0
...,...,...,...,...
337,0.0,0.0,0.0,1.0
338,0.0,0.0,0.0,1.0
339,0.0,0.0,0.0,1.0
340,0.0,0.0,0.0,1.0
