# Lecture 2
## Introduction to Sklearn
### Encoding in Sklearn!

<ol>
<li> Used data: Toy data (Toy data construction done in notebook)
<li> Notebook Goal: Learn how encoders work in Sklearn, what the One Hot Encoder does and how to convert to dataframes
<li> Extra Exercise: Yes, see below.
</ol>

In [1]:
#Necessary Imports
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
import numpy as np
from sklearn.ensemble import RandomForestClassifier

In the slides, we see that Sklearn cannot deal with non-numerical data. Whenever it gets fed non-numerical data, it will first try to convert it on itself. This can be useful when data from another source mistakenly coded an integer as a string, eg. '1' instead of 1.

Let's see what happens when we don't encode a non-numerical feature.

 (__spoiler alert__: it fails)

In [2]:
df = pd.DataFrame({
'A': ['a','b','a'],
'B': ['b','a','c']
})
df

Unnamed: 0,A,B
0,a,b
1,b,a
2,a,c


In [3]:
target = df['A']
data = df['B']
RandomForestClassifier().fit(data, target) #THIS FAILS!

ValueError: could not convert string to float: 'b'

We now consider the OneHotEncoder. We will use it to encode the categorical variables A and B.

In [4]:
enc = OneHotEncoder()
enc.fit(df)

This succesfully fits the 'enc' object to the data. We can check what the new variables are named. 

In [5]:
enc.get_feature_names_out()

array(['A_a', 'A_b', 'B_a', 'B_b', 'B_c'], dtype=object)

We see that, indeed, every level of the variables get a special variable assigned to them.

We only fit the model, let us consider what happens when we transform our dataframe.

In [6]:
df_OHE = enc.transform(df)
df_OHE

<3x5 sparse matrix of type '<class 'numpy.float64'>'
	with 6 stored elements in Compressed Sparse Row format>

We see that after transformation, we obtain a sparse matrix. 
From the documentation online, you can easily find that in order to make this a (readable) array, you can use the toarray() method. 

Let us actually return df_OHE as a dataframe.

In [7]:
df_OHE = pd.DataFrame(data=enc.transform(df).toarray(), columns = enc.get_feature_names_out())
df_OHE

Unnamed: 0,A_a,A_b,B_a,B_b,B_c
0,1.0,0.0,0.0,1.0,0.0
1,0.0,1.0,1.0,0.0,0.0
2,1.0,0.0,0.0,0.0,1.0


As an exercise, play around with your own data sets. In particular, what happens when you transform a dataframe that has levels which are not present in the training dataframe? Using the documentation of the OneHotEncoder, how can you change this behaviour?