In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

* Note: Whenever you are given a problem statement, **at first categorize them** --- whether they are **nominal category** or **ordinal category** 

### High Cardinality

Another way to refer to variables that have a multitude of categories, is to call them variables with **high cardinality**.

If we have categorical variables containing many multiple labels or high cardinality,then by using one hot encoding, we will expand the feature space dramatically. This is a problem. So,

One approach that is heavily used in **Kaggle competitions**, is to **replace each label of the categorical variable by the count**, this is the amount of times each label appears in the dataset. Or **the frequency**, this is the percentage of observations within that category. The 2 are equivalent.

In [2]:
df = pd.read_csv("data/mercedes.csv",usecols=['X1','X2'])
df

Unnamed: 0,X1,X2
0,v,at
1,t,av
2,w,n
3,t,n
4,v,n
...,...,...
4204,s,as
4205,o,t
4206,v,r
4207,r,e


In [3]:
df.shape

(4209, 2)

## 4. Count Frequency Encoding 

* Mostly for **Kaggle Competitions** 

In [5]:
# 1) Number of unique values in a column

len(df['X1'].unique())

27

In [6]:
len(df['X2'].unique())

44

In [7]:
# or use a loop

for col in df.columns: 
    print(col,": ",len(df[col].unique()),'labels')

X1 :  27 labels
X2 :  44 labels


In [8]:
# 2) obtain the counts for each one of the labels in variable X2

df.X2.value_counts().to_dict()

{'as': 1659,
 'ae': 496,
 'ai': 415,
 'm': 367,
 'ak': 265,
 'r': 153,
 'n': 137,
 's': 94,
 'f': 87,
 'e': 81,
 'aq': 63,
 'ay': 54,
 'a': 47,
 't': 29,
 'k': 25,
 'i': 25,
 'b': 21,
 'ao': 20,
 'z': 19,
 'ag': 19,
 'd': 18,
 'ac': 13,
 'g': 12,
 'ap': 11,
 'y': 11,
 'x': 10,
 'aw': 8,
 'h': 6,
 'at': 6,
 'an': 5,
 'q': 5,
 'al': 5,
 'ah': 4,
 'av': 4,
 'p': 4,
 'au': 3,
 'j': 1,
 'l': 1,
 'aa': 1,
 'am': 1,
 'o': 1,
 'ar': 1,
 'c': 1,
 'af': 1}

In [9]:
df.X2.head() # or df['X2'].head()

0    at
1    av
2     n
3     n
4     n
Name: X2, dtype: object

In [10]:
# 3) lets replace each label in X2 by its count

# first, we make a dictionary that maps each label to the counts
df_frequency_map = df.X2.value_counts().to_dict()

In [11]:
# and now, we replace X2 labels in the dataset df

# Map dictionary with dataframe --> map()

df['X2'] = df['X2'].map(df_frequency_map)

df.head()

Unnamed: 0,X1,X2
0,v,6
1,t,4
2,w,137
3,t,137
4,v,137


### Advantages
1. It is very simple to implement
2. Does not increase the feature dimensional space

### Disadvantages
1. If some of the labels have the same count, then they will be replaced with the same count and they will loose some valuable information.

[Kaggle Link for more information](https://www.kaggle.com/general/16927)