# About the dataset
  
The data which we are going to work with is mercedes benz data. The dataset contains a whole lot of 378 columns but we will be using only 6 of them. 

Out motive here to perform encoding on those 6 features.

[Source](https://www.kaggle.com/datasets/philanipro/mercedesbenz-greener-manufacturing?select=train.csv)

In [3]:
import pandas as pd
import numpy as np

In [4]:
df_train = pd.read_csv(r'F:\Data Science\Datasets\Mercedes Benz/train.csv', usecols = ['X1', 'X2', 'X3', 'X4', 'X5', 'X6'])
df_train.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6
0,v,at,a,d,u,j
1,t,av,e,d,y,l
2,w,n,c,d,x,j
3,t,n,f,d,x,l
4,v,n,f,d,h,d


In [5]:
# number of categories/labels each feature has

for col in df_train.columns:
    print(col, ':',len(df_train[col].unique()), 'categories')

X1 : 27 categories
X2 : 44 categories
X3 : 7 categories
X4 : 4 categories
X5 : 29 categories
X6 : 12 categories


There are totally 119 categories. If we perform One hot encoding the number of columns is gonna spike and eventually its gonna lead to Curse of dimensionality.  

In [6]:
# performing onehot encoding

pd.get_dummies(df_train, drop_first = True).shape

(4209, 117)

We can see that initially there are only 6 columns and after performing onehot encoding we now have 117 columns.  
We have to find a solution to reduce those 117 columns.  

**Solution:**    
We can reduce the number of columns by selecting only the top 10 frequently occuring categories in each columns.  
The count viz 10 may change depending upon the total number of categories in a column.

In [7]:
# finding top 10 frequent categories for X2 variable

df_train.X2.value_counts().sort_values(ascending = False).head(20)

as    1659
ae     496
ai     415
m      367
ak     265
r      153
n      137
s       94
f       87
e       81
aq      63
ay      54
a       47
t       29
k       25
i       25
b       21
ao      20
ag      19
z       19
Name: X2, dtype: int64

In [8]:
# capturing the 10 most frequent categories

top_10 = [x for x in df_train.X2.value_counts().sort_values(ascending = False).head(10).index]
top_10

['as', 'ae', 'ai', 'm', 'ak', 'r', 'n', 's', 'f', 'e']

In [9]:
# making 10 new columns for the labels from X2
# creating new column for each labels in top_10 and whereever X2 has that specific label, the new column's value has to be 
# filled with 1, if not the new column's value will be filled with 0.

for label in top_10:
    df_train[label] = np.where(df_train['X2']==label, 1, 0)
    
df_train[['X2']+top_10].head(40)

Unnamed: 0,X2,as,ae,ai,m,ak,r,n,s,f,e
0,at,0,0,0,0,0,0,0,0,0,0
1,av,0,0,0,0,0,0,0,0,0,0
2,n,0,0,0,0,0,0,1,0,0,0
3,n,0,0,0,0,0,0,1,0,0,0
4,n,0,0,0,0,0,0,1,0,0,0
5,e,0,0,0,0,0,0,0,0,0,1
6,e,0,0,0,0,0,0,0,0,0,1
7,as,1,0,0,0,0,0,0,0,0,0
8,as,1,0,0,0,0,0,0,0,0,0
9,aq,0,0,0,0,0,0,0,0,0,0


In [10]:
df_train.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,as,ae,ai,m,ak,r,n,s,f,e
0,v,at,a,d,u,j,0,0,0,0,0,0,0,0,0,0
1,t,av,e,d,y,l,0,0,0,0,0,0,0,0,0,0
2,w,n,c,d,x,j,0,0,0,0,0,0,1,0,0,0
3,t,n,f,d,x,l,0,0,0,0,0,0,1,0,0,0
4,v,n,f,d,h,d,0,0,0,0,0,0,1,0,0,0


In [11]:
# encoding all the features 

def one_hot(data, feat, top_x):
    for label in top_x:
        data[feat+'_'+label] = np.where(data[feat]==label, 1, 0)

In [12]:
# importing the data again

df = pd.read_csv(r'F:Data Science\Datasets\Mercedes Benz\train.csv', usecols = ['X1', 'X2', 'X3', 'X4', 'X5', 'X6'])
df.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6
0,v,at,a,d,u,j
1,t,av,e,d,y,l
2,w,n,c,d,x,j
3,t,n,f,d,x,l
4,v,n,f,d,h,d


In [13]:
# calling the one_hot function which encodes X2 feature

one_hot(df, 'X2', top_10)

In [14]:
df.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X2_as,X2_ae,X2_ai,X2_m,X2_ak,X2_r,X2_n,X2_s,X2_f,X2_e
0,v,at,a,d,u,j,0,0,0,0,0,0,0,0,0,0
1,t,av,e,d,y,l,0,0,0,0,0,0,0,0,0,0
2,w,n,c,d,x,j,0,0,0,0,0,0,1,0,0,0
3,t,n,f,d,x,l,0,0,0,0,0,0,1,0,0,0
4,v,n,f,d,h,d,0,0,0,0,0,0,1,0,0,0


In [16]:
# top 20 categories of X1 feature 

df.X1.value_counts().sort_values(ascending = False).head(20)

aa    833
s     598
b     592
l     590
v     408
r     251
i     203
a     143
c     121
o      82
w      52
z      46
u      37
e      33
m      32
t      31
h      29
y      23
f      23
j      22
Name: X1, dtype: int64

In [18]:
# lets encode X1 feature

top_10_X1 = df.X1.value_counts().sort_values(ascending = False).head(10).index
top_10_X1

one_hot(df, 'X1', top_10_X1)

In [19]:
df.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X2_as,X2_ae,X2_ai,X2_m,...,X1_aa,X1_s,X1_b,X1_l,X1_v,X1_r,X1_i,X1_a,X1_c,X1_o
0,v,at,a,d,u,j,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,t,av,e,d,y,l,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,w,n,c,d,x,j,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,t,n,f,d,x,l,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,v,n,f,d,h,d,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
