# Complete Feature Engineering Guide

# Types of encoding techniques

### This is basically for categorical variables

#### Nominal categories are something that does not need to be ordered 
- gender, state etc

#### Ordinal categories can be arranged based on a rank 
- Education- BE, Bcom, PHD, Masters

# There are 2 types of encoding - Nominal and ordinal encoding

### Nominal Encoding
- One hot encoding (does not work well when number of categories are more and leads to curse of dimensionality)
- One hot encoding with many categorical variable ( select most repeating category values and apply 1 hot encoding to this)
- mean encoding ( The mean of the feature categories would be replaced to that feature column )

### Ordinal Encoding
- label encoding ( Creates rank based on the category values like PHD-1, Masters-2, BE-3, Bcom-4 )
- target guided ordinal encoding ( We calculate mean for the feature categories wrt to output and based on the mean, the highest mean would be given a highest score which is the label, followed by the 2nd highest etc)

# Feature Scaling

### Convergence happens quickly to the global minima when we scale the data
- Feature scaling is efficient in linear regression as it has the slope and co-efficient that needs to reach minima quickly, KNN, Kmeans works on eucledian distance and if the distance is more then, computation time is more.
- Feature Scaling is not necessary for ensemble techniques because they use decision trees to find the relation.

# Handeling missing values
- Delete rows with missing values (feasible when data is huge)
- Replace with most frequesnt value (might lead to imbalanced dataset)
- Apply classifer algorithm (the rows that have no missing value will be training data and rows with missing value will be test data and the output value will be that missing feature for classification)
- Apply unsupervised ML (Kmeans clustering method to form clusters and then select the missing value to be one in that cluster)

# FEATURE ENGINEERING _ HANDELING CATEGORICAL FEATURES

## 1) Handeling Categorical features with small number of categories (one hot encoding)


In [3]:
# import libraries

import pandas as pd
import numpy as np

df= pd.read_csv(r"C:\Users\tejas\Desktop\ineuron\feature engineering\data\titanic.csv",usecols=['Sex'])
df.head()


Unnamed: 0,Sex
0,male
1,female
2,female
3,female
4,male


In [4]:
# get_dummies replaces the categorical values with binary values
 
list_unique_values = list(df['Sex'])

print(pd.get_dummies(list_unique_values))

     female  male
0         0     1
1         1     0
2         1     0
3         1     0
4         0     1
..      ...   ...
886       0     1
887       1     0
888       1     0
889       0     1
890       0     1

[891 rows x 2 columns]


In [5]:
# consider using a feature with more number of categorical values than before

In [6]:
df= pd.read_csv(r"C:\Users\tejas\Desktop\ineuron\feature engineering\data\titanic.csv",usecols=['Embarked'])
df.head()

Unnamed: 0,Embarked
0,S
1,C
2,S
3,S
4,S


In [7]:
df['Embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [8]:
df.dropna(inplace=True)


In [9]:
# get_dummies replaces the categorical values with binary values
 
list_unique_values = list(df['Embarked'])

print(pd.get_dummies(list_unique_values))

     C  Q  S
0    0  0  1
1    1  0  0
2    0  0  1
3    0  0  1
4    0  0  1
..  .. .. ..
884  0  0  1
885  0  0  1
886  0  0  1
887  1  0  0
888  0  1  0

[889 rows x 3 columns]


## 2) Handeling Categorical features with many number of categories (one hot encoding)

In [11]:
import pandas as pd
import numpy as np

In [12]:
# loading the data and selecting few categorical columns

data = pd.read_csv(r"C:\Users\tejas\Desktop\ineuron\feature engineering\data\mercedes.csv" , usecols=['X1','X2','X3','X4','X5','X6'])

In [13]:
data.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6
0,v,at,a,d,u,j
1,t,av,e,d,y,l
2,w,n,c,d,x,j
3,t,n,f,d,x,l
4,v,n,f,d,h,d


In [14]:
# checking how many category values are there in each feature

for col in data.columns:
    print(col ,": " , len(data[col].unique()), " labels")

X1 :  27  labels
X2 :  44  labels
X3 :  7  labels
X4 :  4  labels
X5 :  29  labels
X6 :  12  labels


In [15]:
# data shape before one hot encoding
data.shape

(4209, 6)

In [16]:
# data shape after one hot encoding
pd.get_dummies(data, drop_first=True).shape

(4209, 117)

In [17]:
# we have got 117 features after one hot encoding, this might lead to curse of dimensionality

# So another approach is to select 10 most frequently appearing categorical values and set the remaining categories to 0

In [18]:
# viewing the top 20 categorical values in feature X2
data.X2.value_counts().sort_values(ascending=False).head(20)

as    1659
ae     496
ai     415
m      367
ak     265
r      153
n      137
s       94
f       87
e       81
aq      63
ay      54
a       47
t       29
k       25
i       25
b       21
ao      20
z       19
ag      19
Name: X2, dtype: int64

#### Selecting the top 10 features out of a large categorical feature column and performing feature engineering on them by one hot encoding.

In [20]:
# Lets consider X2 feature for this example
# Lets make a list of the top 10 repeating category variables

top10 = [x for x in data.X2.value_counts().sort_values(ascending=False).head(10).index]
top10

['as', 'ae', 'ai', 'm', 'ak', 'r', 'n', 's', 'f', 'e']

In [21]:
# we manually make 10 binary variable for each corresponding categories
# if label exists then 1 else 0

for label in top10:
    data[label] = np.where(data['X2']==label, 1, 0)

data[['X2']+top10].head(20)

Unnamed: 0,X2,as,ae,ai,m,ak,r,n,s,f,e
0,at,0,0,0,0,0,0,0,0,0,0
1,av,0,0,0,0,0,0,0,0,0,0
2,n,0,0,0,0,0,0,1,0,0,0
3,n,0,0,0,0,0,0,1,0,0,0
4,n,0,0,0,0,0,0,1,0,0,0
5,e,0,0,0,0,0,0,0,0,0,1
6,e,0,0,0,0,0,0,0,0,0,1
7,as,1,0,0,0,0,0,0,0,0,0
8,as,1,0,0,0,0,0,0,0,0,0
9,aq,0,0,0,0,0,0,0,0,0,0


In [22]:
# creating a function to apply this to all the features

def one_hot(df,feature,top_labels):
    
    for label in top_labels:
        df[feature+'_'+label] = np.where(data[feature]==label, 1, 0)
        
# reding the data again
data = pd.read_csv(r"C:\Users\tejas\Desktop\ineuron\feature engineering\data\mercedes.csv" , usecols=['X1','X2','X3','X4','X5','X6'])

# perform 1 hot encoding for X1 feature
one_hot(data,'X1',top10)
data.head(10)

Unnamed: 0,X1,X2,X3,X4,X5,X6,X1_as,X1_ae,X1_ai,X1_m,X1_ak,X1_r,X1_n,X1_s,X1_f,X1_e
0,v,at,a,d,u,j,0,0,0,0,0,0,0,0,0,0
1,t,av,e,d,y,l,0,0,0,0,0,0,0,0,0,0
2,w,n,c,d,x,j,0,0,0,0,0,0,0,0,0,0
3,t,n,f,d,x,l,0,0,0,0,0,0,0,0,0,0
4,v,n,f,d,h,d,0,0,0,0,0,0,0,0,0,0
5,b,e,c,d,g,h,0,0,0,0,0,0,0,0,0,0
6,r,e,f,d,f,h,0,0,0,0,0,1,0,0,0,0
7,l,as,f,d,f,j,0,0,0,0,0,0,0,0,0,0
8,s,as,e,d,f,i,0,0,0,0,0,0,0,1,0,0
9,b,aq,c,d,f,a,0,0,0,0,0,0,0,0,0,0


In [23]:
# perform 1 hot encoding for X2 feature
one_hot(data,'X2',top10)
# perform 1 hot encoding for X3 feature
one_hot(data,'X3',top10)
# perform 1 hot encoding for X1 feature
one_hot(data,'X4',top10)
# perform 1 hot encoding for X1 feature
one_hot(data,'X5',top10)
# perform 1 hot encoding for X1 feature
one_hot(data,'X6',top10)

data.head(10)

Unnamed: 0,X1,X2,X3,X4,X5,X6,X1_as,X1_ae,X1_ai,X1_m,...,X6_as,X6_ae,X6_ai,X6_m,X6_ak,X6_r,X6_n,X6_s,X6_f,X6_e
0,v,at,a,d,u,j,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,t,av,e,d,y,l,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,w,n,c,d,x,j,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,t,n,f,d,x,l,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,v,n,f,d,h,d,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,b,e,c,d,g,h,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,r,e,f,d,f,h,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,l,as,f,d,f,j,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,s,as,e,d,f,i,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,b,aq,c,d,f,a,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [24]:
# now we can drop the initial feature to train the Machine learning model

data.drop(columns=['X1','X2','X3','X4','X5','X6'], axis=1, inplace=True)

data.head(10)

Unnamed: 0,X1_as,X1_ae,X1_ai,X1_m,X1_ak,X1_r,X1_n,X1_s,X1_f,X1_e,...,X6_as,X6_ae,X6_ai,X6_m,X6_ak,X6_r,X6_n,X6_s,X6_f,X6_e
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [25]:
# view the columns created
data.columns

Index(['X1_as', 'X1_ae', 'X1_ai', 'X1_m', 'X1_ak', 'X1_r', 'X1_n', 'X1_s',
       'X1_f', 'X1_e', 'X2_as', 'X2_ae', 'X2_ai', 'X2_m', 'X2_ak', 'X2_r',
       'X2_n', 'X2_s', 'X2_f', 'X2_e', 'X3_as', 'X3_ae', 'X3_ai', 'X3_m',
       'X3_ak', 'X3_r', 'X3_n', 'X3_s', 'X3_f', 'X3_e', 'X4_as', 'X4_ae',
       'X4_ai', 'X4_m', 'X4_ak', 'X4_r', 'X4_n', 'X4_s', 'X4_f', 'X4_e',
       'X5_as', 'X5_ae', 'X5_ai', 'X5_m', 'X5_ak', 'X5_r', 'X5_n', 'X5_s',
       'X5_f', 'X5_e', 'X6_as', 'X6_ae', 'X6_ai', 'X6_m', 'X6_ak', 'X6_r',
       'X6_n', 'X6_s', 'X6_f', 'X6_e'],
      dtype='object')

## 3) Count of Feature Encoding

### High Cardinality

### Applied when there are huge number of categorical variables in a feature
- Replace each value of categorical variables by count and this is the number of times that value appears in the data

In [27]:
# consider 2 columns
data = pd.read_csv(r"C:\Users\tejas\Desktop\ineuron\feature engineering\data\mercedes.csv" , usecols=['X1','X2'])


In [28]:
data.shape

(4209, 2)

In [29]:
# prone to curse of dimensionality
pd.get_dummies(data).shape

(4209, 71)

In [30]:
# checking for unique labels in this feature
len(data['X1'].unique())

27

In [31]:
# checking for unique labels in this feature
len(data['X2'].unique())

44

In [32]:
# let capture the count of each lable of X2 feature in a dictionary
data.X2.value_counts().to_dict()

{'as': 1659,
 'ae': 496,
 'ai': 415,
 'm': 367,
 'ak': 265,
 'r': 153,
 'n': 137,
 's': 94,
 'f': 87,
 'e': 81,
 'aq': 63,
 'ay': 54,
 'a': 47,
 't': 29,
 'k': 25,
 'i': 25,
 'b': 21,
 'ao': 20,
 'z': 19,
 'ag': 19,
 'd': 18,
 'ac': 13,
 'g': 12,
 'y': 11,
 'ap': 11,
 'x': 10,
 'aw': 8,
 'h': 6,
 'at': 6,
 'al': 5,
 'q': 5,
 'an': 5,
 'ah': 4,
 'p': 4,
 'av': 4,
 'au': 3,
 'af': 1,
 'ar': 1,
 'j': 1,
 'am': 1,
 'aa': 1,
 'o': 1,
 'l': 1,
 'c': 1}

In [33]:
# lets store it to a variable
df_frequency_map = data.X2.value_counts().to_dict()

In [34]:
# now lets map the dictionary values to the dataframe X2 feature column by replacing their values
data.X2 = data.X2.map(df_frequency_map)
data

Unnamed: 0,X1,X2
0,v,6
1,t,4
2,w,137
3,t,137
4,v,137
...,...,...
4204,s,1659
4205,o,29
4206,v,153
4207,r,81


### Advantages
- easy to implement
- does not increase feature dimension

### Disadvantage
- sometimes it may work and it many not work
- If some labels have same count then it many loose some valuable infomation

## 4) Ordinal number encoding or label encoding
- Ranking the feature values

In [37]:
# creating date time from today with 20 days of difference

import pandas as pd
import datetime

df_base = datetime.datetime.today()
df_date_list = [df_base - datetime.timedelta(days=x) for x in range(0,20)]
df = pd.DataFrame(df_date_list)
df.columns = ['day']
df

Unnamed: 0,day
0,2021-11-26 17:07:54.842531
1,2021-11-25 17:07:54.842531
2,2021-11-24 17:07:54.842531
3,2021-11-23 17:07:54.842531
4,2021-11-22 17:07:54.842531
5,2021-11-21 17:07:54.842531
6,2021-11-20 17:07:54.842531
7,2021-11-19 17:07:54.842531
8,2021-11-18 17:07:54.842531
9,2021-11-17 17:07:54.842531


In [38]:
# extracting the weekday name

df['day_of_week'] = df['day'].dt.strftime("%A")

In [39]:
df.head()

Unnamed: 0,day,day_of_week
0,2021-11-26 17:07:54.842531,Friday
1,2021-11-25 17:07:54.842531,Thursday
2,2021-11-24 17:07:54.842531,Wednesday
3,2021-11-23 17:07:54.842531,Tuesday
4,2021-11-22 17:07:54.842531,Monday


In [40]:
# Engineering the categorical variables by ordinal number replacement

weekday_map = {
    'Monday':1,
    'Tuesday':2,
    'Wednesday':3,
    'Thursday':4,
    'Friday':5,
    'Saturday':6,
    'Sunday':7
}

df['day_ordinal'] = df.day_of_week.map(weekday_map)
df.head(10)

Unnamed: 0,day,day_of_week,day_ordinal
0,2021-11-26 17:07:54.842531,Friday,5
1,2021-11-25 17:07:54.842531,Thursday,4
2,2021-11-24 17:07:54.842531,Wednesday,3
3,2021-11-23 17:07:54.842531,Tuesday,2
4,2021-11-22 17:07:54.842531,Monday,1
5,2021-11-21 17:07:54.842531,Sunday,7
6,2021-11-20 17:07:54.842531,Saturday,6
7,2021-11-19 17:07:54.842531,Friday,5
8,2021-11-18 17:07:54.842531,Thursday,4
9,2021-11-17 17:07:54.842531,Wednesday,3
