### Dealing with categorical variables

One hot encode
- The categorical value represents the numerical value of the entry in the dataset.
- curse of dimensionality makes dimensionality increase exponential
- lose the explicit one columns relationship of the feature

Label encode
- `0, 1, 2, 3`
- is an ordinal encoding - even if feature is not ordinal

Mean encode
- put the training data average for the target for that class
- could also use other statistics like median, quantiles or variance

Target encode
- features are replaced with a blend of the posterior probability of the target for the given particular categorical value and the prior probability of the target over all the training data.
- they are not generated for the test data. 
- We usually save the target encodings obtained from the training data set and use the same encodings to encode features in the test data set.

BaseN Encoding
- In binary encoding, we convert the integers into binary i.e base 2.
- BaseN allows us to convert the integers with any value of the base.
- ideal for columns with large categorical types

In [3]:
import category_encoders as ce
import pandas as pd
import numpy as np 

data = pd.read_csv('./data/cars.csv',index_col=0)
data.head()

Unnamed: 0,Foreign/Local Used,color,wheel drive,Automation,seat-make,price,description,make-year,manufacturer
0,Foreign Used,Black,4,Automatic,Leather,17500000,2014 Lexus LX,2014,Lexus
1,Foreign Used,Black,4,Automatic,Leather,13000000,2012 Toyota Sequoia,2012,Toyota
2,Foreign Used,Blue,4,Automatic,Cloth,6500000,2007 Toyota FJ CRUISER,2007,Toyota
3,Foreign Used,Black,4,Automatic,Leather,4700000,2005 Lexus GX,2005,Lexus
4,Foreign Used,Grey,4,Automatic,Leather,3800000,2005 Toyota 4-Runner,2008,Toyota


In [9]:
data.color.value_counts()

Black         390
Silver        243
Grey          131
Red            84
White          81
Blue           76
Gold           57
Maroon         47
Dark Grey      32
Dark Blue      25
Dark Green     13
Green           8
Other           3
Name: color, dtype: int64

In [11]:
#Label Encoding the color column
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()  #instantiate the Label Encoder
data['color'] = le.fit_transform(data['color'])

#it's ideal to always instantiate new LabelEncoders for different columns

In [12]:
data.head()

Unnamed: 0,Foreign/Local Used,color,wheel drive,Automation,seat-make,price,description,make-year,manufacturer
0,Foreign Used,0,4,Automatic,Leather,17500000,2014 Lexus LX,2014,Lexus
1,Foreign Used,0,4,Automatic,Leather,13000000,2012 Toyota Sequoia,2012,Toyota
2,Foreign Used,1,4,Automatic,Cloth,6500000,2007 Toyota FJ CRUISER,2007,Toyota
3,Foreign Used,0,4,Automatic,Leather,4700000,2005 Lexus GX,2005,Lexus
4,Foreign Used,7,4,Automatic,Leather,3800000,2005 Toyota 4-Runner,2008,Toyota


In [24]:
#one hot encoding for Foreign/Local Used column

# create an object of the OneHotEncoder
ce_one = ce.OneHotEncoder(cols=['Foreign/Local Used']) 

ce_one.fit_transform(data).head()

Unnamed: 0,Foreign/Local Used_1,Foreign/Local Used_2,color,wheel drive,Automation,seat-make,price,description,make-year,manufacturer
0,1,0,0,4,Automatic,Leather,17500000,2014 Lexus LX,2014,Lexus
1,1,0,0,4,Automatic,Leather,13000000,2012 Toyota Sequoia,2012,Toyota
2,1,0,1,4,Automatic,Cloth,6500000,2007 Toyota FJ CRUISER,2007,Toyota
3,1,0,0,4,Automatic,Leather,4700000,2005 Lexus GX,2005,Lexus
4,1,0,7,4,Automatic,Leather,3800000,2005 Toyota 4-Runner,2008,Toyota


In [15]:
#get_dummies
pd.get_dummies(data['Foreign/Local Used']).head()

#Convert categorical variable into dummy/indicator variables.

Unnamed: 0,Foreign Used,Locally Used
0,1,0
1,1,0
2,1,0
3,1,0
4,1,0


In [27]:
#Target encoding
ce_te = ce.TargetEncoder(cols=['seat-make'])

#column to perform encoding
X = data['seat-make']
y = data['color']

#create an object of the Targetencoder
ce_te.fit(X,y)

ce_te.transform(X).head()

Unnamed: 0,seat-make
0,5.253261
1,5.253261
2,5.918519
3,5.253261
4,5.253261


In [7]:
# make some data
example_df = pd.DataFrame({
 'class' : ['a', 'b', 'a', 'b', 'd', 'e', 'd', 'f', 'g', 'h', 'h', 'k', 'h', 'i', 's', 'p', 'z']})
# create an object of the BaseNEncoder
ce_baseN4 = ce.BaseNEncoder(cols=['class'],base=4)
# fit and transform and you will get the encoded data
ce_baseN4.fit_transform(example_df).head()

Unnamed: 0,class_0,class_1,class_2
0,0,0,1
1,0,0,2
2,0,0,1
3,0,0,2
4,0,0,3


In [35]:
#mean encode
def mean_encode(data, col, on):
    group = data.groupby(col).mean()
    mapper = {k: v for k, v in zip(group.index, group.loc[:, on].values)}

    data.loc[:, col] = data.loc[:, col].replace(mapper)
    data.loc[:, col].fillna(value=np.mean(data.loc[:, col]), inplace=True)

    return data


In [42]:
#example dataframe_1
store1 = pd.DataFrame({'store': ['A'] * 3,
         'Sales': [100, 200, 300],
         'noise': [0, 0, 0]})

#example dataframe_2
store2 = pd.DataFrame(
        {'store': ['B'] * 3,
         'Sales': [10, 20, 30],
         'noise': [0, 0, 0]})

data = pd.concat([store1, store2], axis=0)  #concat dataframe
#np.testing.assert_array_equal(data.loc[:, 'store'],np.array([200, 200, 200, 20, 20, 20]))

In [43]:
data

Unnamed: 0,store,Sales,noise
0,A,100,0
1,A,200,0
2,A,300,0
0,B,10,0
1,B,20,0
2,B,30,0


In [44]:
mean_encode(data, col='store', on='Sales')

Unnamed: 0,store,Sales,noise
0,200,100,0
1,200,200,0
2,200,300,0
0,20,10,0
1,20,20,0
2,20,30,0
