### Figuring out LabelEncoder and One-hot encoder

####### Background

This is important because we features and predicted data are often skewed

Some data such as `location = ["paris", "paris", "tokyo", "amsterdam"]` might not be properly processed by the model. So, we need to make it readable (sorta).

###### Before we move forward, some background on LabelEncoder and One-hot encoder

* **Label Encoding** gives numerical aliases to different classes. If I have ‘eggs’, ‘butter’ and ‘milk’ in my column. It will give them 0,1 and 2. The problem with this approach is that there is no relation between these three classes yet our Algo might consider them to be ordered (that is there is some relation between them) maybe 0<1<2 that is ‘eggs’<‘butter’<‘milk’. Which doesnt make sense

* Label Encoding seems to be good for unordered features. Such as *True or False*

* For the above reason, **One hot encoder** is used to perform *“binarization”* of the category and include it as a feature to train the model.

In [5]:
# Lets see how this works

# Import the packages
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

In [6]:
le = LabelEncoder()
ohe = OneHotEncoder()

data = ['big', 'bigger', 'biggest']


In [9]:
# fit the encoder to the data
le.fit(data)

LabelEncoder()

In [12]:
# view the labels
list(le.classes_)

['big', 'bigger', 'biggest']

In [13]:
testdata = ['biggest', 'big']

In [15]:
#Apply the fitted encoder to new data
le.transform(testdata)

array([2, 0], dtype=int64)

* As we can see, big was transformed to 0 and biggest was transformed to 2

* As we mentioned earlier, our algo will see this as big<bigger<biggest. This is true in the actual sense of 'big, bigger, biggest'.
* But does it mean if our model internally calculates the average of biggest and big (i.e, 2+0/2 = 1) it equates to bigger?



To solve that, we we use onehotencoder

In [19]:
newdat = data.copy()

In [23]:
transdat = le.fit_transform(newdat)

In [26]:
#Gotta be a 2d array, so we reshape. From the error message (ValueError: Reshape your data either using array.reshape(-1, 1))
finaldat = ohe.fit_transform(transdat.reshape(-1, 1))

In [27]:
# yup, we have our fitted and transformed and encoded feature
finaldat

<3x3 sparse matrix of type '<class 'numpy.float64'>'
	with 3 stored elements in Compressed Sparse Row format>

In [39]:
finaldat.shape

(3, 3)

In [30]:
finaldat.toarray()

array([[ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.]])

In [34]:
import pandas as pd

In [37]:
colnames = ['big', 'bigger', 'biggest']
new_df = pd.DataFrame(finaldat.toarray(),  
                          columns=colnames)

In [38]:
new_df

Unnamed: 0,big,bigger,biggest
0,1.0,0.0,0.0
1,0.0,1.0,0.0
2,0.0,0.0,1.0


In [44]:
print(new_df['big'])
print(new_df['bigger'])
print(new_df['biggest'])

0    1.0
1    0.0
2    0.0
Name: big, dtype: float64
0    0.0
1    1.0
2    0.0
Name: bigger, dtype: float64
0    0.0
1    0.0
2    1.0
Name: biggest, dtype: float64


* Apparently, you can do the oneencoding above with the pd.getdummies

In [45]:
pd.get_dummies(data)