# `Label & One Hot Encoding: with SkLearn`

# <font color = red>Mr Fugu Data Science</font>

# (◕‿◕✿)

# Purpose & Outcome:

+ Create One Hot & Label Encoding with `skLearn`
    + When to use one over the other and thoughts
+ Learn how to endcode and decode your variables
    + Use single and multiple columns for encoding

`Original Data`:

| Country  	| Device 	| Version 	|
|----------	|--------	|---------	|
| Serbia   	| iPad   	| 10_3_4  	|
| Qatar    	| iPhone 	| 9_3_5   	|
| Cambodia 	| iPad   	| 12_4    	|
| Fiji     	| iPad   	| 7_1_2   	|


`Data Encoded`:  **End Result**
 
| Country  	| Device 	| Version 	| Country_LabEnc 	| Version_LabEnc 	| Device_OneHotEnc 	|
|----------	|--------	|---------	|----------------	|----------------	|------------------	|
| Serbia   	| iPad   	| 10_3_4  	| 164            	| 1              	| 0                	|
| Qatar    	| iPhone 	| 9_3_5   	| 127            	| 8              	| 1                	|
| Cambodia 	| iPad   	| 12_4    	| 29             	| 2              	| 0                	|
| Fiji     	| iPad   	| 7_1_2   	| 60             	| 7              	| 0                	|

In [None]:
import pandas as pd # DF 
from sklearn.preprocessing import LabelEncoder,OneHotEncoder  # Encoding
import faker # fake data generator
from collections import defaultdict

In [None]:
# Generate Random Countries:

g=faker.Faker()

countries=[]
for i in range(500):
    countries.append(g.country())
    


In [None]:
# Fake Devices and Device Versions:

import re
Os_=[]
for _ in range(500):
#     print(g.ios_platform_token())
    p=re.sub('\;','',g.ios_platform_token())
    Os_.append([p.split(' ')[0],p.split(' ')[4]])


In [4]:
# Create Dataframe:
df=pd.DataFrame(Os_,countries,columns=['Device','Version'])

# Remove 1st columns set as index which is your countries
df.reset_index(level=0, inplace=True)

# Rename the new column from index -> country
df=df.rename(columns={'index':'Country'})

In [5]:
df.head()

Unnamed: 0,Country,Device,Version
0,Anguilla,iPhone,7_1_2
1,Nicaragua,iPhone,9_3_5
2,Peru,iPad,4_2_1
3,Cote d'Ivoire,iPad,10_3_3
4,Israel,iPad,4_2_1


# Label Encoding:

+ First example, 1 column at a time
+ Second, multiple coulmns at once

`-------------------------------`

`Original Data`:

| Country  	| Device 	| Version 	|
|----------	|--------	|---------	|
| Serbia   	| iPad   	| 10_3_4  	|
| Qatar    	| iPhone 	| 9_3_5   	|
| Cambodia 	| iPad   	| 12_4    	|
| Fiji     	| iPad   	| 7_1_2   	|

`After encoding`: [*'Country','Version'*]

| Country  	| Device 	| Version 	| Country_LabEnc 	| Version_LabEnc 	|
|----------	|--------	|---------	|----------------	|----------------	|
| Serbia   	| iPad   	| 10_3_4  	| 164            	| 1              	|
| Qatar    	| iPhone 	| 9_3_5   	| 127            	| 8              	|
| Cambodia 	| iPad   	| 12_4    	| 29             	| 2              	|
| Fiji     	| iPad   	| 7_1_2   	| 60             	| 7              	|

In [12]:
# create label encoder
labelencoder = LabelEncoder()

# label encoding works by alphabetical order
df['Country_LabEnc']= labelencoder.fit_transform(df['Country'])
df.head()

df['Version_LabEnc']=labelencoder.fit_transform(df['Version'])
df.head()

Unnamed: 0,Country,Device,Version,Country_LabEnc,Version_LabEnc
0,Anguilla,iPhone,7_1_2,6,7
1,Nicaragua,iPhone,9_3_5,137,8
2,Peru,iPad,4_2_1,151,4
3,Cote d'Ivoire,iPad,10_3_3,45,0
4,Israel,iPad,4_2_1,93,4


# Lets Investigate the Version Label encoding:

In [13]:
print('Version Labels:',df['Version_LabEnc'].unique())
print('Version:',df['Version'].unique())

o=list(zip(df['Version_LabEnc'].unique(),df['Version'].unique()))

sorted(o, key=lambda x: x[0])

Version Labels: [7 8 4 0 5 3 6 2 1 9]
Version: ['7_1_2' '9_3_5' '4_2_1' '10_3_3' '5_1_1' '3_1_3' '6_1_6' '12_4' '10_3_4'
 '9_3_6']


[(0, '10_3_3'),
 (1, '10_3_4'),
 (2, '12_4'),
 (3, '3_1_3'),
 (4, '4_2_1'),
 (5, '5_1_1'),
 (6, '6_1_6'),
 (7, '7_1_2'),
 (8, '9_3_5'),
 (9, '9_3_6')]

# Reverse/Inverse Label Encoding:

In [14]:
print(labelencoder.fit_transform(df['Version'])[:5])

labelencoder.inverse_transform(labelencoder.fit_transform(df['Version']))[:5]

[7 8 4 0 4]


array(['7_1_2', '9_3_5', '4_2_1', '10_3_3', '4_2_1'], dtype=object)

# <font color=red>Consideration</font>:

+ If we are looking at this as `not a label but instead as ordinal data`; then what?
    + Well, we would have order to our data and need to reaccess! This appears to be the case; but it all depends how we want to analyze our data.
    
+ If we have skLearn take in our data it will act upon the order of which it is read in. We do have a problem and will need to adjust to make it assign our data in a specific order. 
    + There are times that even while switching encodings, we can have issues still and may need to further process as One Hot. Just bare in mind this when you are trying to figure out how to handle your data. 

In [15]:
# If we are to consider the versions let's first evaluate something:

print(10_3_1>10_3_2)
print('Unexpected Result:',10_3_1>12_4)
print(10.3>10.2)
print('Unexpected Result:','10_3'>'9_2')
print('10_3_1'>'10_3_2')

'''
We need to evaluate the string as a literal value because it is ranking as 1 >9 
which is not what we want. The string will look at each number as they appear and do 
element wise comparison.

such as 

10_3_1 vs 12_4 it is comparing: as if it were  1031 > 124

or 

10_3 > 9_2: which is looking at 1 v 9 instead of 10 vs 9
''' 

False
Unexpected Result: True
True
Unexpected Result: False
False


'\nWe need to evaluate the string as a literal value because it is ranking as 1 >9 \nwhich is not what we want. The string will look at each number as they appear and do \nelement wise comparison.\n\nsuch as \n\n10_3_1 vs 12_4 it is comparing: as if it were  1031 > 124\n\nor \n\n10_3 > 9_2: which is looking at 1 v 9 instead of 10 vs 9\n'

In [16]:
# Comparing strings of numerical data: by ranking

int('10_3_1')>int('10_3_2')
int('10_2_1')>int('9_2_1')

True

# `How Should We Proceed?`

+ Brief explaination and code at end. Let's get into One Hot Encoding first

`--------------------------`

# One Hot Encoding:

+ `handle_unknown`: by default will throw an error if an unknown value is present during transform. With the assignment of `ignore`, will avoid such an error and replace values with zero. 

In [17]:
# Create OneHot Encoder: 
OneHt_enc = OneHotEncoder(handle_unknown='ignore')

# Create One hot as an array based on number of entries to eval: 
oneh_arry=pd.DataFrame(OneHt_enc.fit_transform(df[['Device']]).toarray())

# Join DF on key vals
df.join(oneh_arry).head()

# printing off the two columns(0,1)  great if you start having more data to map as matrix

Unnamed: 0,Country,Device,Version,Country_LabEnc,Version_LabEnc,0,1
0,Anguilla,iPhone,7_1_2,6,7,0.0,1.0
1,Nicaragua,iPhone,9_3_5,137,8,0.0,1.0
2,Peru,iPad,4_2_1,151,4,1.0,0.0
3,Cote d'Ivoire,iPad,10_3_3,45,0,1.0,0.0
4,Israel,iPad,4_2_1,93,4,1.0,0.0


# Inverse One Hot Encoding:

In [18]:
print(OneHt_enc.fit_transform(df[['Device']]).toarray()[:5])


encode_=OneHt_enc.fit_transform(df[['Device']])

# Transoform back to original Devices:
OneHt_enc.inverse_transform(encode_)[:5]

[[0. 1.]
 [0. 1.]
 [1. 0.]
 [1. 0.]
 [1. 0.]]


array([['iPhone'],
       ['iPhone'],
       ['iPad'],
       ['iPad'],
       ['iPad']], dtype=object)

`----------------------------------------`

# Encode Multiple Columns at Once:

In [19]:
df_=df.iloc[:,:3]

# Encoding:
enc_all_cols=df_.iloc[:,:3].apply(LabelEncoder().fit_transform)


#Encoding: But, columns have repeating names
df_w_enc=pd.concat([df.iloc[:,:3],enc_all_cols],axis=1).head()

# Rename New Encoded Columns:
df_w_enc.columns.values[3:]=['Country_LabEnc','Device_OneHt','Version_LabEnc']

df_w_enc

Unnamed: 0,Country,Device,Version,Country_LabEnc,Device_OneHt,Version_LabEnc
0,Anguilla,iPhone,7_1_2,6,1,7
1,Nicaragua,iPhone,9_3_5,137,1,8
2,Peru,iPad,4_2_1,151,0,4
3,Cote d'Ivoire,iPad,10_3_3,45,0,0
4,Israel,iPad,4_2_1,93,0,4


# Example 02: Multiple Columns Encoded

In [20]:
# Encode:

from collections import defaultdict
d = defaultdict(LabelEncoder)

fit = df_.apply(lambda x: d[x.name].fit_transform(x))
fit.head()

Unnamed: 0,Country,Device,Version
0,6,1,7
1,137,1,8
2,151,0,4
3,45,0,0
4,93,0,4


In [21]:
# Inverse the encoded (AKA decode)
fit.apply(lambda x: d[x.name].inverse_transform(x)).head()

Unnamed: 0,Country,Device,Version
0,Anguilla,iPhone,7_1_2
1,Nicaragua,iPhone,9_3_5
2,Peru,iPad,4_2_1
3,Cote d'Ivoire,iPad,10_3_3
4,Israel,iPad,4_2_1


# Lastly, How to deal with Ordinal Data:

+ If we are to consider the `Version` column to have some ranking or order: then we need to change it. Since we are dealing with strings of numbers separated by underscores we need to do a variation to the code.

In [33]:
k=[] # store string num literals as ints
for i in df['Version']:
    k.append(int(i))
    u=labelencoder.fit_transform(k) # encode changing values to int

    uu=labelencoder.fit_transform(df['Version']) # encode string as they are
    
print(u[:5])
print(df['Version'][:5])

print(int('12_4')>int('10_3_4'))
print(uu[:5])
k[:5]

[5 6 2 8 2]
0     7_1_2
1     9_3_5
2     4_2_1
3    10_3_3
4     4_2_1
Name: Version, dtype: object
False
[7 8 4 0 4]


[712, 935, 421, 1033, 421]

In [28]:
print('Almost: but not quite',float('10_3_4'),float('12_4'))
print('Need to Refine________________')

num_lit_conv=[]


for i in df['Version']:
    num_lit_conv.append(float(i.replace('_','.',1).replace('_','')))
num_lit_conv

print('Correct Format to Order:',sorted(set(num_lit_conv)))

Almost: but not quite 1034.0 124.0
Need to Refine________________
Correct Format to Order: [3.13, 4.21, 5.11, 6.16, 7.12, 9.35, 9.36, 10.33, 10.34, 12.4]


# `Therefore`: If we wanted to refind this (literal numeric)

+ we would have to do some formatting in order to correctly *map* as `Ordinal`

# <font color=red>`Future Work: further encoding examples with memory usage considerations` Look for this video soon</font>

In [29]:
print(labelencoder.fit_transform(num_lit_conv)[:7])
print(num_lit_conv[:7])
import numpy as np

np.unique(labelencoder.fit_transform(num_lit_conv))

# zip list to compare label and value:
lst_enc_version=list(zip(labelencoder.fit_transform(num_lit_conv),num_lit_conv))

# Correctly, labeled: and checking as a unique set to verify
set(sorted(lst_enc_version, key=lambda x: x[0]))

[4 5 1 7 1 2 1]
[7.12, 9.35, 4.21, 10.33, 4.21, 5.11, 4.21]


{(0, 3.13),
 (1, 4.21),
 (2, 5.11),
 (3, 6.16),
 (4, 7.12),
 (5, 9.35),
 (6, 9.36),
 (7, 10.33),
 (8, 10.34),
 (9, 12.4)}

In [30]:
# Bring it all together now:

correct_lab_enc_ord_vers=labelencoder.fit_transform(num_lit_conv)

# New Labels for Version 
df['Version_LabEnc_Fin']=correct_lab_enc_ord_vers

# Converted String -> float for encoding as Ordinal
df['Converted_litNum']=num_lit_conv

df.head()

Unnamed: 0,Country,Device,Version,Country_LabEnc,Version_LabEnc,Version_LabEnc_Fin,Converted_litNum
0,Anguilla,iPhone,7_1_2,6,7,4,7.12
1,Nicaragua,iPhone,9_3_5,137,8,5,9.35
2,Peru,iPad,4_2_1,151,4,1,4.21
3,Cote d'Ivoire,iPad,10_3_3,45,0,7,10.33
4,Israel,iPad,4_2_1,93,4,1,4.21


# <font color =red>LIKE</font>, Share &

# <font color=red>SUB</font>scribe

`------------------------`

# Citations & Help:

# ◔̯◔

https://stackoverflow.com/questions/24458645/label-encoding-across-multiple-columns-in-scikit-learn

https://stackoverflow.com/questions/38101009/changing-multiple-column-names-but-not-all-of-them-pandas-python

https://machinelearningmastery.com/one-hot-encoding-for-categorical-data/

https://medium.com/analytics-vidhya/types-of-categorical-data-encoding-schemes-a5bbeb4ba02b

https://towardsdatascience.com/categorical-encoding-techniques-93ebd18e1f24