# Count or frequency encoding
#High Cardinality


Another way to refer to variables that have a multitude of categories, is to call them variables with high cardinality.

If we have categorical variables containing many multiple labels or high cardinality,then by using one hot encoding, we will expand the feature space dramatically.

One approach that is heavily used in Kaggle competitions, is to replace each label of the categorical variable by the count, this is the amount of times each label appears in the dataset. Or the frequency, this is the percentage of observations within that category. The 2 are equivalent.

Let's see how this works:

In [9]:
import pandas as pd
import numpy as np
import os
os.getcwd()
os.chdir ("C:\\Users\\EARABMO\\Desktop\\ERICSSON ITEMS\\DATA SCIENTIST\\PYTHON\\.ipynb_checkpoints\\")
#https://www.kaggle.com/aditya1702/mercedes-benz-data-exploration/data

df = pd.read_csv('mercedesbenz.csv', usecols=['X1', 'X2'])
df.head()

Unnamed: 0,X1,X2
0,v,at
1,t,av
2,w,n
3,t,n
4,v,n


In [10]:
df.shape

(4209, 2)

# One hot Encoding

In [12]:
pd.get_dummies(df).shape

(4209, 71)

In [20]:
len(df['X1'].unique())

27

In [21]:
len(df['X2'].unique())

44

In [26]:
# let's have a look at how many labels
for cols in df.columns[0:]:
    print(cols,':',len(df[cols].unique()),"labels")

X1 : 27 labels
X2 : 44 labels


In [31]:
# let's obtain the counts for each one of the labels in variable X2
# let's capture this in a dictionary that we can use to re-map the labels
df.X1.value_counts().to_dict()

{'aa': 833,
 's': 598,
 'b': 592,
 'l': 590,
 'v': 408,
 'r': 251,
 'i': 203,
 'a': 143,
 'c': 121,
 'o': 82,
 'w': 52,
 'z': 46,
 'u': 37,
 'e': 33,
 'm': 32,
 't': 31,
 'h': 29,
 'f': 23,
 'y': 23,
 'j': 22,
 'n': 19,
 'k': 17,
 'p': 9,
 'g': 6,
 'ab': 3,
 'd': 3,
 'q': 3}

In [37]:
# And now let's replace each label in X2 by its count

# first we make a dictionary that maps each label to the counts
df_frequency_map=df.X2.value_counts().to_dict()

In [39]:
df_frequency_map

{'as': 1659,
 'ae': 496,
 'ai': 415,
 'm': 367,
 'ak': 265,
 'r': 153,
 'n': 137,
 's': 94,
 'f': 87,
 'e': 81,
 'aq': 63,
 'ay': 54,
 'a': 47,
 't': 29,
 'i': 25,
 'k': 25,
 'b': 21,
 'ao': 20,
 'z': 19,
 'ag': 19,
 'd': 18,
 'ac': 13,
 'g': 12,
 'y': 11,
 'ap': 11,
 'x': 10,
 'aw': 8,
 'h': 6,
 'at': 6,
 'an': 5,
 'q': 5,
 'al': 5,
 'p': 4,
 'ah': 4,
 'av': 4,
 'au': 3,
 'c': 1,
 'am': 1,
 'ar': 1,
 'aa': 1,
 'j': 1,
 'af': 1,
 'l': 1,
 'o': 1}

In [40]:
# and now we replace X2 labels in the dataset df
df.X2 = df.X2.map(df_frequency_map)
df.head()

Unnamed: 0,X1,X2
0,v,6
1,t,4
2,w,137
3,t,137
4,v,137


# There are some advantages and disadvantages 

Advantages
It is very simple to implement
Does not increase the feature dimensional space
Disadvantages
If some of the labels have the same count, then they will be replaced with the same count and they will loose some valuable information.
2 Adds somewhat arbitrary numbers, and therefore weights to the different labels, that may not be related to their predictive power

Follow this thread in Kaggle for more information: https://www.kaggle.com/general/16927

In [1]:
#Count Or Frequency Encoding

In [1]:
import pandas as pd
train_set = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data' , header = None,index_col=None)
train_set.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [2]:
#Extracting the required data and assigning the columns in a list
columns=[1,3,5,6,7,8,9,13]

In [3]:
#Checking the type of columns
type(columns)

list

In [9]:
#Assigning the required list of columns to the main dataframe
train_set=train_set[columns]

In [10]:
train_set[columns]

Unnamed: 0,1,3,5,6,7,8,9,13
0,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,United-States
1,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,United-States
2,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,United-States
3,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,United-States
4,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,Cuba
...,...,...,...,...,...,...,...,...
32556,Private,Assoc-acdm,Married-civ-spouse,Tech-support,Wife,White,Female,United-States
32557,Private,HS-grad,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,United-States
32558,Private,HS-grad,Widowed,Adm-clerical,Unmarried,White,Female,United-States
32559,Private,HS-grad,Never-married,Adm-clerical,Own-child,White,Male,United-States


In [11]:
#Renaming the columns
train_set.columns=['Employment','Degree','Status','Designation','family_job','Race','Sex','Country']

In [14]:
train_set.columns

Index(['Employment', 'Degree', 'Status', 'Designation', 'family_job', 'Race',
       'Sex', 'Country'],
      dtype='object')

In [15]:
#Checking the type of train_set
type(train_set)

pandas.core.frame.DataFrame

In [23]:
for feature in train_set.columns[:]:
    print(feature,":",len(train_set[feature].unique()),"labels")
    

Employment : 9 labels
Degree : 16 labels
Status : 7 labels
Designation : 15 labels
family_job : 6 labels
Race : 5 labels
Sex : 2 labels
Country : 42 labels


In [18]:
train_set.columns

Index(['Employment', 'Degree', 'Status', 'Designation', 'family_job', 'Race',
       'Sex', 'Country'],
      dtype='object')

In [26]:
len(train_set["Country"].unique())

42

In [33]:
#Creating a dict of a column in a data frame
Country_map=train_set["Country"].value_counts().to_dict()
train_set['Country']=train_set['Country'].map(Country_map)

In [39]:
Country_map

{' United-States': 29170,
 ' Mexico': 643,
 ' ?': 583,
 ' Philippines': 198,
 ' Germany': 137,
 ' Canada': 121,
 ' Puerto-Rico': 114,
 ' El-Salvador': 106,
 ' India': 100,
 ' Cuba': 95,
 ' England': 90,
 ' Jamaica': 81,
 ' South': 80,
 ' China': 75,
 ' Italy': 73,
 ' Dominican-Republic': 70,
 ' Vietnam': 67,
 ' Guatemala': 64,
 ' Japan': 62,
 ' Poland': 60,
 ' Columbia': 59,
 ' Taiwan': 51,
 ' Haiti': 44,
 ' Iran': 43,
 ' Portugal': 37,
 ' Nicaragua': 34,
 ' Peru': 31,
 ' France': 29,
 ' Greece': 29,
 ' Ecuador': 28,
 ' Ireland': 24,
 ' Hong': 20,
 ' Cambodia': 19,
 ' Trinadad&Tobago': 19,
 ' Laos': 18,
 ' Thailand': 18,
 ' Yugoslavia': 16,
 ' Outlying-US(Guam-USVI-etc)': 14,
 ' Hungary': 13,
 ' Honduras': 13,
 ' Scotland': 12,
 ' Holand-Netherlands': 1}

In [40]:
train_set.head(10)

Unnamed: 0,Employment,Degree,Status,Designation,family_job,Race,Sex,Country
0,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,29170
1,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,29170
2,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,29170
3,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,29170
4,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,95
5,Private,Masters,Married-civ-spouse,Exec-managerial,Wife,White,Female,29170
6,Private,9th,Married-spouse-absent,Other-service,Not-in-family,Black,Female,81
7,Self-emp-not-inc,HS-grad,Married-civ-spouse,Exec-managerial,Husband,White,Male,29170
8,Private,Masters,Never-married,Prof-specialty,Not-in-family,White,Female,29170
9,Private,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,29170


# Advantages
Easy To Use
Not increasing feature space
##### Disadvantages
It will provide same weight if the frequencies are same


# Target Guided Ordinal Encoding
Ordering the labels according to the target
Replace the labels by the joint probability of being 1 or 0

In [81]:
import pandas as pd
import numpy as np
import os
os.getcwd()
os.chdir ("C:\\Users\\EARABMO\\Desktop\\ERICSSON ITEMS\\DATA SCIENTIST\\PYTHON\\.ipynb_checkpoints\\")

df=pd.read_csv("train.csv",usecols=['Cabin','Survived'])

In [82]:
df.head()

Unnamed: 0,Survived,Cabin
0,0,
1,1,C85
2,1,
3,1,C123
4,0,


In [83]:
df["Cabin"].fillna("Missing",inplace=True)

In [84]:
df['Cabin']=df['Cabin'].astype(str).str[0]

In [85]:
type(df["Cabin"])

pandas.core.series.Series

In [86]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Survived  891 non-null    int64 
 1   Cabin     891 non-null    object
dtypes: int64(1), object(1)
memory usage: 14.0+ KB


In [87]:
df.head()

Unnamed: 0,Survived,Cabin
0,0,M
1,1,C
2,1,M
3,1,C
4,0,M


In [93]:
df.Cabin.unique()

array(['M', 'C', 'E', 'G', 'D', 'A', 'B', 'F', 'T'], dtype=object)

In [94]:
#Checking the null value of that column
df.Cabin.isnull().sum()

0

In [95]:
df.groupby(["Cabin"])["Survived"].mean()

Cabin
A    0.466667
B    0.744681
C    0.593220
D    0.757576
E    0.750000
F    0.615385
G    0.500000
M    0.299854
T    0.000000
Name: Survived, dtype: float64

In [103]:
df.groupby(["Cabin"])["Survived"].mean().sort_values().index

Index(['T', 'M', 'A', 'G', 'C', 'F', 'B', 'E', 'D'], dtype='object', name='Cabin')

In [104]:
ordinal_labels=df.groupby(["Cabin"])["Survived"].mean().sort_values().index

In [119]:
#Converting to iterable object
type(enumerate(ordinal_labels,0))

enumerate

In [125]:
ordinal_labels2={k:i for i,k in enumerate(ordinal_labels,0)}

In [121]:
type(ordinal_labels2)

dict

In [126]:
ordinal_labels2

{'T': 0, 'M': 1, 'A': 2, 'G': 3, 'C': 4, 'F': 5, 'B': 6, 'E': 7, 'D': 8}

In [128]:
df['Cabin_ordinal_labels']=df['Cabin'].map(ordinal_labels2)
df.head()

Unnamed: 0,Survived,Cabin,Cabin_ordinal_labels
0,0,M,1
1,1,C,4
2,1,M,1
3,1,C,4
4,0,M,1


# Mean Encoding

In [129]:
mean_ordinal=df.groupby(['Cabin'])['Survived'].mean().to_dict()

In [130]:

mean_ordinal

{'A': 0.4666666666666667,
 'B': 0.7446808510638298,
 'C': 0.5932203389830508,
 'D': 0.7575757575757576,
 'E': 0.75,
 'F': 0.6153846153846154,
 'G': 0.5,
 'M': 0.29985443959243085,
 'T': 0.0}

In [131]:
df['mean_ordinal_encode']=df['Cabin'].map(mean_ordinal)
df.head()

Unnamed: 0,Survived,Cabin,Cabin_ordinal_labels,mean_ordinal_encode
0,0,M,1,0.299854
1,1,C,4,0.59322
2,1,M,1,0.299854
3,1,C,4,0.59322
4,0,M,1,0.299854
