<h1><b>Feature Engineering</b></h1>

<h2>1.) ENCODING</h2>

<h4>One Hot Encoding for multi-categorical features</h4>

In [2]:
import pandas as pd
import numpy as np

In [3]:
data = pd.read_csv('mercedesbenz.csv', usecols=['X1','X2','X3','X4','X5','X6']) 

In [4]:
data.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6
0,v,at,a,d,u,j
1,t,av,e,d,y,l
2,w,n,c,d,x,j
3,t,n,f,d,x,l
4,v,n,f,d,h,d


In [5]:
for col in data.columns:
    distict = len(data[col].unique())
    print('{} : {} values'.format(col,distict))

X1 : 27 values
X2 : 44 values
X3 : 7 values
X4 : 4 values
X5 : 29 values
X6 : 12 values


In [6]:
data['X2'].value_counts().sort_values(ascending=False).head(20)

as    1659
ae     496
ai     415
m      367
ak     265
r      153
n      137
s       94
f       87
e       81
aq      63
ay      54
a       47
t       29
i       25
k       25
b       21
ao      20
z       19
ag      19
Name: X2, dtype: int64

In [7]:
top_10 = [element for element in data['X2'].value_counts().sort_values(ascending=False).head(10).index]
top_10

['as', 'ae', 'ai', 'm', 'ak', 'r', 'n', 's', 'f', 'e']

<h4>For each feature, obtain top ten most frequent instances and apply one-hot encoding to those, and drop the rest</h4>

In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4209 entries, 0 to 4208
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   X1      4209 non-null   object
 1   X2      4209 non-null   object
 2   X3      4209 non-null   object
 3   X4      4209 non-null   object
 4   X5      4209 non-null   object
 5   X6      4209 non-null   object
dtypes: object(6)
memory usage: 197.4+ KB


In [12]:
top_10

['as', 'ae', 'ai', 'm', 'ak', 'r', 'n', 's', 'f', 'e']

In [20]:
for label in top_10:
    data['X2_'+label] = np.where(data['X2']==label,1,0) 

In [27]:
def one_hot_encoder_top_10(data,column):
    
    top_10 = [features for features in data[column].value_counts().sort_values(ascending=False).head(10).index]
    
    for label in top_10:
        data[column+'_'+label] = np.where(data[column]==label,1,0)

In [28]:
one_hot_encoder_top_10(data,'X1')

In [29]:
data

Unnamed: 0,X1,X2,X3,X4,X5,X6,as,ae,ai,m,...,X1_aa,X1_s,X1_b,X1_l,X1_v,X1_r,X1_i,X1_a,X1_c,X1_o
0,v,at,a,d,u,j,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,t,av,e,d,y,l,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,w,n,c,d,x,j,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,t,n,f,d,x,l,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,v,n,f,d,h,d,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4204,s,as,c,d,aa,d,1,0,0,0,...,0,1,0,0,0,0,0,0,0,0
4205,o,t,d,d,aa,h,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4206,v,r,a,d,aa,g,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
4207,r,e,f,d,aa,l,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


<h2>Encoding (Theory)</h2>

<h4>Nominal variables - Gender, State <br>
    
Ordinal variables - Education (BSc,MSc,PhD)</h4>

- Nominal
    - One Hot Encoding
    - One Hot Encoding (with many categories) - take top ten 
    - Mean Encoding
    - Count/Frequency Encoding (replace with count)
- Ordinal
    - Label Encoding 
    - Target Guided Encoding (use when there are many categories)

<h4>Ordinal Variables - Label Encoding : </h4><br>
PhD = 3 ; MSc = 2 ; BSc = 1

<h4>Ordinal Variables - Target Guided Encoding:</h4>
<p>Obtain mean for each feature wrt to the target variable, then assign rank (1,2,3) based on mean values</p>

<h4>Nominal Variables - Mean Encoding:</h4>
<p>Obtain mean for each feature wrt to the target variable, then assign mean values</p>

<h3>HIGH CARDINALITY - Categorical variable with many categories</h3>

<h4>Count/Frequency Encoding</h4>

In [30]:
df = pd.read_csv('mercedesbenz.csv', usecols=['X1','X2'])  

In [31]:
df.head()

Unnamed: 0,X1,X2
0,v,at
1,t,av
2,w,n
3,t,n
4,v,n


In [37]:
for col in df.columns:
    print("{} : {} ".format(col,len(df[col].unique()))) 

X1 : 27 
X2 : 44 


In [35]:
df['X2'].value_counts().to_dict()

{'as': 1659,
 'ae': 496,
 'ai': 415,
 'm': 367,
 'ak': 265,
 'r': 153,
 'n': 137,
 's': 94,
 'f': 87,
 'e': 81,
 'aq': 63,
 'ay': 54,
 'a': 47,
 't': 29,
 'i': 25,
 'k': 25,
 'b': 21,
 'ao': 20,
 'z': 19,
 'ag': 19,
 'd': 18,
 'ac': 13,
 'g': 12,
 'ap': 11,
 'y': 11,
 'x': 10,
 'aw': 8,
 'h': 6,
 'at': 6,
 'q': 5,
 'al': 5,
 'an': 5,
 'ah': 4,
 'p': 4,
 'av': 4,
 'au': 3,
 'c': 1,
 'aa': 1,
 'am': 1,
 'ar': 1,
 'l': 1,
 'j': 1,
 'o': 1,
 'af': 1}

In [38]:
count_map = df['X2'].value_counts().to_dict()

In [39]:
df['X2'] = df['X2'].map(count_map) 

In [41]:
df.head()

Unnamed: 0,X1,X2
0,v,6
1,t,4
2,w,137
3,t,137
4,v,137
