# Label Encoding & One Hot Encoding with Python:

# <font color=red>Mr Fugu Data Science</font>

# (◕‿◕✿)

# Purpose & Outcome:
+ Convert categorical variables to numeric: 2 Examples of each ( *Label Encoding & One Hot*) 
    + Explain some use case and pitfalls.
    
+ Create Sparse Matrices

In [351]:
import pandas as pd
import random
import scipy
from sklearn import preprocessing # label encoding for 1 example
import numpy as np                # create arrays
import collections as cc          # default dict(list)

# Ex. 1 ) Working with Column of Lists with varying size of strings:

In [1282]:
random.seed(1276) # repeatability of my random data: 

moods=['friendly','kind','forgetful','spacey','humble','irascible','loquacious','greedy']

rnd_ints=random.randint(1,len(moods)-2)

random.sample(moods,rnd_ints)

Lst_diffSize_Nestedmoods=[]
for i in range(0,100):
    rnd_ints=random.randint(1,len(moods)-2) # choose to have largest array len()-2
    Lst_diffSize_Nestedmoods.append(random.sample(moods,rnd_ints)) 
Lst_diffSize_Nestedmoods[:3]

# random.choice(random.sample(moods,rnd_ints))

[['forgetful', 'humble', 'friendly', 'greedy', 'spacey'],
 ['humble', 'loquacious'],
 ['kind', 'loquacious', 'spacey', 'greedy']]

In [1283]:
moods_ppl=pd.DataFrame(pd.Series(Lst_diffSize_Nestedmoods),columns=['Moods'])
moods_ppl.head()

Unnamed: 0,Moods
0,"[forgetful, humble, friendly, greedy, spacey]"
1,"[humble, loquacious]"
2,"[kind, loquacious, spacey, greedy]"
3,[friendly]
4,"[irascible, humble, forgetful]"


# Label Encoding:


| Label Name   	| Label Code 	|
|--------------	|------------	|
| friendly     	| 0          	|
| kind         	| 1          	|
| forgetful    	| 2          	|
| spacey       	| 3          	|
| humble       	| 4          	|
| irascible    	| 5          	|
| locquacious  	| 6          	|
| greedy       	| 7          	|


This can be useful, but also introduces us to other problems. 

`Caution`:

For instance, if you do the wrong analysis your data may look like there is a relationship or precedence  and that is not the case. These are not by rank, only a numerical placeholder. 


# One Hot:

Ex.) snippet:

| friendly 	| kind 	| forgetful 	| spacey 	| humble 	| irascible 	| locquacious 	| greedy 	|
|----------	|------	|-----------	|--------	|--------	|-----------	|-------------	|--------	|
| 0        	| 0    	| 0         	| 0      	| 1      	| 0         	| 1           	| 1      	|
| 0        	| 0    	| 1         	| 0      	| 1      	| 0         	| 1           	| 1      	|
| 0        	| 0    	| 0         	| 0      	| 0      	| 0         	| 1           	| 1      	|

One Hot allows you to split the categorical data that you have into new columns and replace with (1,0). Think of `Dummy Variables` being created.

`Caution`:

+ There is a problem that needs to be considered: the `dummy variable trap`, where we are able to predict an outcome based on the remaining variables. We also, think of `multicollinearity` of variables now. 
+ If we have problems with indepedence of variable, then using `linear` regression or `logistic` regression can be problematic. 
    + Use `Variance Inflation Factor` to check for multicollinearity

In [1276]:
# Labels with index used as encoding: 

lbl_idx=[]
for i in range(len(moods)):
    lbl_idx.append((i+1,moods[i]))
lbl_idx

[(1, 'friendly'),
 (2, 'kind'),
 (3, 'forgetful'),
 (4, 'spacey'),
 (5, 'humble'),
 (6, 'irascible'),
 (7, 'loquacious'),
 (8, 'greedy')]

# Custom One-Hot Encoding: List of List in Dataframe

In [1284]:
for g in moods:
    moods_ppl[g] = moods_ppl.Moods.map( lambda x: g in x )
    

moods_ppl.head()


Unnamed: 0,Moods,friendly,kind,forgetful,spacey,humble,irascible,loquacious,greedy
0,"[forgetful, humble, friendly, greedy, spacey]",True,False,True,True,True,False,False,True
1,"[humble, loquacious]",False,False,False,False,True,False,True,False
2,"[kind, loquacious, spacey, greedy]",False,True,False,True,False,False,True,True
3,[friendly],True,False,False,False,False,False,False,False
4,"[irascible, humble, forgetful]",False,False,True,False,True,True,False,False


In [1285]:
# One-Hot Encoding
pd.concat([moods_ppl.Moods,moods_ppl.iloc[:,1:].astype(int)],axis=1).head()


Unnamed: 0,Moods,friendly,kind,forgetful,spacey,humble,irascible,loquacious,greedy
0,"[forgetful, humble, friendly, greedy, spacey]",1,0,1,1,1,0,0,1
1,"[humble, loquacious]",0,0,0,0,1,0,1,0
2,"[kind, loquacious, spacey, greedy]",0,1,0,1,0,0,1,1
3,[friendly],1,0,0,0,0,0,0,0
4,"[irascible, humble, forgetful]",0,0,1,0,1,1,0,0


# `Alternate Way`: One Hot Encode using `.getdummies`

+ Nested Lists (different lengths)

In [1288]:
'''
[';'.join(i) for i in moods_ppl.Moods]: creating a list of strings sep by (;)
storing our moods by row

.str.get_dummies(';') : creates dummy variables (1,0) based on some sep. value
'''

OneHt=pd.Series([';'.join(i) for i in moods_ppl.Moods]).str.get_dummies(';')

pd.concat([moods_ppl.Moods,OneHt],axis=1).head() # pay attn to axis=1 NOT 0


Unnamed: 0,Moods,forgetful,friendly,greedy,humble,irascible,kind,loquacious,spacey
0,"[forgetful, humble, friendly, greedy, spacey]",1,1,1,1,0,0,0,1
1,"[humble, loquacious]",0,0,0,1,0,0,1,0
2,"[kind, loquacious, spacey, greedy]",0,0,1,0,0,1,1,1
3,[friendly],0,1,0,0,0,0,0,0
4,"[irascible, humble, forgetful]",1,0,0,1,1,0,0,0


`-----------------------------`

# Label Encoding Nested Lists: By Hand

In [1290]:
d=[] # tuples with index positions for encoding
for i in moods_ppl.iloc[:,1:].values:
    d.append(list(zip(i,range(1,9))))

fg=[]  # list saving labels in order but not nested  
for ii in range(len(d)):
    for i in d[ii]:
        if i[0]==False:
            fg.append(0)
        else:
            fg.append(i[1])


In [1272]:
# Create nested list of 8 entries each to match columns
label_enc_=pd.DataFrame([fg[x:x+8] for x in range(0, len(fg),8)],columns=moods)

# adding new columns:
pd.concat([moods_ppl.Moods,label_enc_],axis=1).head()

Unnamed: 0,Moods,friendly,kind,forgetful,spacey,humble,irascible,loquacious,greedy
0,"[forgetful, humble, friendly, greedy, spacey]",1,0,3,4,5,0,0,8
1,"[humble, loquacious]",0,0,0,0,5,0,7,0
2,"[kind, loquacious, spacey, greedy]",0,2,0,4,0,0,7,8
3,[friendly],1,0,0,0,0,0,0,0
4,"[irascible, humble, forgetful]",0,0,3,0,5,6,0,0


`---------------------------------`

# Label Encoding: With skLearn

+ Right out of the gate, it appears promising, but there is a problem. SkLearn only takes the values at face value and we need to preserve the order if we want to do an analysis. What I am referring to is the idea of a matrix setup where we reserve a (0) for absent values and fill in the TRUE values with the correct Label number.

+ This is almost correct but, how do we deal with the zeros now?

In [1273]:
# from sklearn import preprocessing

label_encoder = preprocessing.LabelEncoder()

label_encoder.fit(moods)

moods_ppl['Moods'].apply(lambda x:label_encoder.transform(x))

label_encoder.transform(moods) # indices by alphabetical order

array([1, 5, 0, 7, 3, 4, 6, 2])

In [1274]:
# Map the labels for each value found in row:
vb=moods_ppl['Moods'].apply(lambda x:label_encoder.transform(x))


ddd=[]
for i in vb:
    ddd.append(i+1)
    '''
    adding a value to each value for mapping since it is off.
    This is because label encoding for sklearn starts at zero. Creating a matrix 
    with padding is problematic, so I indexed starting at 1 instead.
    
    '''
ddd

nn=[]
for i in ddd:
    l=padarray(i,8) # padding each array so they have 8 values/array to match labels
    nn.append(l)
    
nn[:3]


[array([1, 4, 2, 3, 8, 0, 0, 0]),
 array([4, 7, 0, 0, 0, 0, 0, 0]),
 array([6, 7, 8, 3, 0, 0, 0, 0])]

# <font color=red>There is a quite of bit of change done here</font>:

+ `Since, sklearn is labeling based on alphabetical order, you have to go back and change some of the code to account for this`. 

+ `The columns are in a different order and the values reflect the column values now. The data is overall correct and doesn't affect anything.` 

In [1277]:
moods_=sorted(moods) # sorting so I have same format as sklearn to match DF
k=[]
for i in range(len(moods_)):
    k.append((i+1,moods_[i])) # adding 1 so index is Not zero starting point

reordered_=[] # other way to save data by tuples (key, val)
labl_ppl=cc.defaultdict(list) # store keys: values[list]

for i in nn:
    for j in k:
        if j[0] in set(i): 
            
            reordered_.append([j[1],j[0]])
            labl_ppl[j[1]].append(j[0])
        else:
            reordered_.append([j[1],0])
            labl_ppl[j[1]].append(0)

pd.DataFrame(labl_ppl).head() # using dict list 


Unnamed: 0,forgetful,friendly,greedy,humble,irascible,kind,loquacious,spacey
0,1,2,3,4,0,0,0,8
1,0,0,0,4,0,0,7,0
2,0,0,3,0,0,6,7,8
3,0,2,0,0,0,0,0,0
4,1,0,0,4,5,0,0,0


In [None]:
# Sparse matrix: Quick example

# sorry this video took longer to make than expected.

In [1281]:
from scipy.sparse import csr_matrix

# for i in csr_matrix(nn):
#     print(i)

# <font color =red>LIKE</font> , Share &

# <font color=red>SUB</font>scribe

`--------------------`

# Citations & Help:

# ◔̯◔

https://towardsdatascience.com/categorical-encoding-using-label-encoding-and-one-hot-encoder-911ef77fb5bd

https://www.analyticsvidhya.com/blog/2020/03/one-hot-encoding-vs-label-encoding-using-scikit-learn/

https://stackoverflow.com/questions/52189126/how-to-elegantly-one-hot-encode-a-series-of-lists-in-pandas

https://stackoverflow.com/questions/38443049/how-to-convert-a-nested-list-for-the-use-in-sklearn

https://stackoverflow.com/questions/15890743/how-can-you-split-a-list-every-x-elements-and-add-those-x-amount-of-elements-to