# Count or frequency encoding

Another way to refer to variables that have a multitude of categories, is to call them variables with high cardinality.

If we have categorical variables containing many multiple labels or high cardinality,then by using one hot encoding, we will expand the feature space dramatically.

One approach that is heavily used in Kaggle competitions, is to replace each label of the categorical variable by the count, this is the amount of times each label appears in the dataset. Or the frequency, this is the percentage of observations within that category. The 2 are equivalent.

Let's see how this works:

In [2]:
import pandas as pd
import numpy as np

In [3]:
url="https://raw.githubusercontent.com/krishnaik06/Complete-Feature-Engineering/master/mercedesbenz.csv"
data=pd.read_csv(url)

In [4]:
data.head(3)

Unnamed: 0,ID,y,X0,X1,X2,X3,X4,X5,X6,X8,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,0,130.81,k,v,at,a,d,u,j,o,...,0,0,1,0,0,0,0,0,0,0
1,6,88.53,k,t,av,e,d,y,l,o,...,1,0,0,0,0,0,0,0,0,0
2,7,76.26,az,w,n,c,d,x,j,x,...,0,0,0,0,0,0,1,0,0,0


In [5]:
df=pd.read_csv(url,usecols=['X1','X2'])

In [6]:
df.head()

Unnamed: 0,X1,X2
0,v,at
1,t,av
2,w,n
3,t,n
4,v,n


In [8]:
df.shape

(4209, 2)

# One Hot Encoding

In [11]:
pd.get_dummies(df['X1']).shape

(4209, 27)

In [15]:
pd.get_dummies(df['X2']).shape

(4209, 44)

In [12]:
pd.get_dummies(df).shape

(4209, 71)

In [13]:
df_n=pd.get_dummies(df)

In [14]:
df_n.head()

Unnamed: 0,X1_a,X1_aa,X1_ab,X1_b,X1_c,X1_d,X1_e,X1_f,X1_g,X1_h,...,X2_n,X2_o,X2_p,X2_q,X2_r,X2_s,X2_t,X2_x,X2_y,X2_z
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0


In [18]:
df_n1=pd.get_dummies(df,drop_first=True)

In [19]:
df_n1.head()

Unnamed: 0,X1_aa,X1_ab,X1_b,X1_c,X1_d,X1_e,X1_f,X1_g,X1_h,X1_i,...,X2_n,X2_o,X2_p,X2_q,X2_r,X2_s,X2_t,X2_x,X2_y,X2_z
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0


In [24]:
# let's have a look at how many labels

for col in df.columns[0:]:
    print(col, ': ', len(df[col].unique()), ' labels')

X1 :  27  labels
X2 :  44  labels


In [26]:
for col in df.columns:
    print(col,':',df[col].value_counts())

X1 : aa    833
s     598
b     592
l     590
v     408
r     251
i     203
a     143
c     121
o      82
w      52
z      46
u      37
e      33
m      32
t      31
h      29
y      23
f      23
j      22
n      19
k      17
p       9
g       6
d       3
ab      3
q       3
Name: X1, dtype: int64
X2 : as    1659
ae     496
ai     415
m      367
ak     265
r      153
n      137
s       94
f       87
e       81
aq      63
ay      54
a       47
t       29
i       25
k       25
b       21
ao      20
ag      19
z       19
d       18
ac      13
g       12
ap      11
y       11
x       10
aw       8
h        6
at       6
al       5
q        5
an       5
av       4
ah       4
p        4
au       3
l        1
j        1
ar       1
af       1
aa       1
o        1
am       1
c        1
Name: X2, dtype: int64


In [27]:
#combination of unique values usind x1 & x2
df.value_counts()

X1  X2
s   as    434
aa  as    358
l   as    295
aa  ai    195
i   as    172
         ... 
b   y       1
m   aq      1
    ap      1
l   z       1
s   r       1
Length: 255, dtype: int64

In [28]:
# let's obtain the counts for each one of the labels in variable X2
# let's capture this in a dictionary that we can use to re-map the labels

df.X2.value_counts().to_dict()

{'as': 1659,
 'ae': 496,
 'ai': 415,
 'm': 367,
 'ak': 265,
 'r': 153,
 'n': 137,
 's': 94,
 'f': 87,
 'e': 81,
 'aq': 63,
 'ay': 54,
 'a': 47,
 't': 29,
 'i': 25,
 'k': 25,
 'b': 21,
 'ao': 20,
 'ag': 19,
 'z': 19,
 'd': 18,
 'ac': 13,
 'g': 12,
 'ap': 11,
 'y': 11,
 'x': 10,
 'aw': 8,
 'h': 6,
 'at': 6,
 'al': 5,
 'q': 5,
 'an': 5,
 'av': 4,
 'ah': 4,
 'p': 4,
 'au': 3,
 'l': 1,
 'j': 1,
 'ar': 1,
 'af': 1,
 'aa': 1,
 'o': 1,
 'am': 1,
 'c': 1}

In [30]:
# And now let's replace each label in X2 by its count

# first we make a dictionary that maps each label to the counts
df_frequency_map = df.X2.value_counts().to_dict()

In [31]:
# and now we replace X2 labels in the dataset df
df.X2 = df.X2.map(df_frequency_map)

df.head()

Unnamed: 0,X1,X2
0,v,6
1,t,4
2,w,137
3,t,137
4,v,137


There are some advantages and disadvantages that we will discuss now

Advantages
1. It is very simple to implement
2. Does not increase the feature dimensional space

Disadvantages
1. If some of the labels have the same count, then they will be replaced with the same count and they will loose some valuable information.
2. Adds somewhat arbitrary numbers, and therefore weights to the different labels, that may not be related to their predictive power