<a href="https://colab.research.google.com/github/Movya777/EDA_and_Feature_Engineering/blob/main/%F0%9F%9A%98Mercedes_Benz_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Dataset: https://www.kaggle.com/competitions/mercedes-benz-greener-manufacturing/data

**Problem:** What if a feature has too many categories?

**Solution 1:** One Hot encoding on top categories

**Solution 2:** Count/Frequency encoding - Replacing each category with its respective frequency

In [1]:
# import modules
import pandas as pd
import numpy as np

In [2]:
# read files
df1=pd.read_csv('train.csv')

In [3]:
print("Length of the dataset:", len(df1))
print("Number of columns:", len(df1.columns))

Length of the dataset: 4209
Number of columns: 378


In [4]:
# lets only consider a few of the categorical columns
data=pd.read_csv('train.csv', usecols=['X1','X2','X3','X4','X5','X6'])

We choose 6 columns

In [5]:
# Lets see the count of unique categories present in each and every column
for i in data.columns:
  print(i,":", len(data[i].unique()), "labels")

X1 : 27 labels
X2 : 44 labels
X3 : 7 labels
X4 : 4 labels
X5 : 29 labels
X6 : 12 labels


In [None]:
# Lets see how many columns do we get after applying one-hot encoding
pd.get_dummies(data,drop_first=True).shape

(4209, 117)

If we apply one-hot encoding, we get 117 columns --> almost 20 times more

What can we do instead?

# Solution 1: Consider top 10 categories

We can take the 10 most frequent categories for each variable and assign 0 for others.

In [6]:
# lets find the 10 most frequent categories for variable X2
t10_X2=data['X2'].value_counts().head(10)
t10_X2

Unnamed: 0_level_0,count
X2,Unnamed: 1_level_1
as,1659
ae,496
ai,415
m,367
ak,265
r,153
n,137
s,94
f,87
e,81


In [7]:
# lets make it as a list
topx2=[x for x in t10_X2.index]
topx2

['as', 'ae', 'ai', 'm', 'ak', 'r', 'n', 's', 'f', 'e']

In [8]:
for i in topx2:
  data[i]=np.where(data['X2']==i,1,0)

data.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,as,ae,ai,m,ak,r,n,s,f,e
0,v,at,a,d,u,j,0,0,0,0,0,0,0,0,0,0
1,t,av,e,d,y,l,0,0,0,0,0,0,0,0,0,0
2,w,n,c,d,x,j,0,0,0,0,0,0,1,0,0,0
3,t,n,f,d,x,l,0,0,0,0,0,0,1,0,0,0
4,v,n,f,d,h,d,0,0,0,0,0,0,1,0,0,0


In [9]:
# apply the same steps for remaining columns
rem_cols=['X1','X3','X4','X5','X6']
for i in rem_cols:
  top10=[x for x in data[i].value_counts().head(10).index]
  for j in top10:
    data[i+'_'+j]=np.where(data[i]==j,1,0)

In [10]:
data.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,as,ae,ai,m,...,X6_g,X6_j,X6_d,X6_i,X6_l,X6_a,X6_h,X6_k,X6_c,X6_b
0,v,at,a,d,u,j,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
1,t,av,e,d,y,l,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
2,w,n,c,d,x,j,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
3,t,n,f,d,x,l,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
4,v,n,f,d,h,d,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0


In [None]:
data.columns

Index(['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'as', 'ae', 'ai', 'm', 'ak', 'r',
       'n', 's', 'f', 'e', 'X1_aa', 'X1_s', 'X1_b', 'X1_l', 'X1_v', 'X1_r',
       'X1_i', 'X1_a', 'X1_c', 'X1_o', 'X3_c', 'X3_f', 'X3_a', 'X3_d', 'X3_g',
       'X3_e', 'X3_b', 'X4_d', 'X4_a', 'X4_b', 'X4_c', 'X5_w', 'X5_v', 'X5_q',
       'X5_r', 'X5_s', 'X5_d', 'X5_n', 'X5_p', 'X5_m', 'X5_i', 'X6_g', 'X6_j',
       'X6_d', 'X6_i', 'X6_l', 'X6_a', 'X6_h', 'X6_k', 'X6_c', 'X6_b'],
      dtype='object')

Advantage of this method: Dimensionality reduction

- does not expand the feature space massively

Disadvantage of this method:
- does not keep the information of the ignored categories

# Solution 2: Count/Frequency encoding

In [12]:
# lets only consider a few of the categorical columns with high cardinality in the categories
data2=pd.read_csv('train.csv', usecols=['X1','X2'])

In [13]:
# Lets see the count of unique categories present in each and every column
for i in data2.columns:
  print(i,":", len(data2[i].unique()), "labels")

X1 : 27 labels
X2 : 44 labels


In [14]:
# lets get the value counts and store in a dictionary
X1_count=data2.X1.value_counts().to_dict()# we map it later so we store in dictionary
X2_count=data2.X2.value_counts().to_dict()

In [16]:
# map the count values with the categories
data2.X1=data2.X1.map(X1_count)

In [17]:
data2.X2=data2.X2.map(X2_count)

In [18]:
data2.head()

Unnamed: 0,X1,X2
0,408,6
1,31,4
2,52,137
3,31,137
4,408,137


Advantages:
- does not increase the dimensional space

Disadvantages:
- If a few of the categories have same count, then this method fails
- the huge values might not contribute to predictive power