# Complete Feature Engineering Guide

# Types of encoding techniques

### This is basically for categorical variables

#### Nominal categories are something that does not need to be ordered 
- gender, state etc

#### Ordinal categories can be arranged based on a rank 
- Education- BE, Bcom, PHD, Masters

# There are 2 types of encoding - Nominal and ordinal encoding

### Nominal Encoding
- One hot encoding (does not work well when number of categories are more and leads to curse of dimensionality)
- One hot encoding with many categorical variable ( select most repeating category values and apply 1 hot encoding to this)
- mean encoding ( The mean of the feature categories would be replaced to that feature column )

### Ordinal Encoding
- label encoding ( Creates rank based on the category values like PHD-1, Masters-2, BE-3, Bcom-4 )
- target guided ordinal encoding ( We calculate mean for the feature categories wrt to output and based on the mean, the highest mean would be given a highest score which is the label, followed by the 2nd highest etc)

# Feature Scaling

### Convergence happens quickly to the global minima when we scale the data
- Feature scaling is efficient in linear regression as it has the slope and co-efficient that needs to reach minima quickly, KNN, Kmeans works on eucledian distance and if the distance is more then, computation time is more.
- Feature Scaling is not necessary for ensemble techniques because they use decision trees to find the relation.

# Handeling missing values
- Delete rows with missing values (feasible when data is huge)
- Replace with most frequesnt value (might lead to imbalanced dataset)
- Apply classifer algorithm (the rows that have no missing value will be training data and rows with missing value will be test data and the output value will be that missing feature for classification)
- Apply unsupervised ML (Kmeans clustering method to form clusters and then select the missing value to be one in that cluster)

# FEATURE ENGINEERING _ HANDELING CATEGORICAL FEATURES

#### Topics covered
- Handeling Categorical features with small number of categories (one hot encoding)
- Handeling Categorical features with many number of categories (one hot encoding)
- Count of Feature Encoding
- Ordinal number encoding or label encoding
- Count or frequency encoding
- Target Guided Ordinal Encoding
- Mean Encoding
- Probability ratio encoding

## 1) Handeling Categorical features with small number of categories (one hot encoding)
- Use only if the unique categorical values are less

In [3]:
# import libraries

import pandas as pd
import numpy as np

df= pd.read_csv(r"C:\Users\tejas\Desktop\ineuron\feature engineering\data\titanic.csv",usecols=['Sex'])
df.head()


Unnamed: 0,Sex
0,male
1,female
2,female
3,female
4,male


In [4]:
# get_dummies replaces the categorical values with binary values
 
list_unique_values = list(df['Sex'])

print(pd.get_dummies(list_unique_values))

     female  male
0         0     1
1         1     0
2         1     0
3         1     0
4         0     1
..      ...   ...
886       0     1
887       1     0
888       1     0
889       0     1
890       0     1

[891 rows x 2 columns]


In [5]:
# consider using a feature with more number of categorical values than before

In [6]:
df= pd.read_csv(r"C:\Users\tejas\Desktop\ineuron\feature engineering\data\titanic.csv",usecols=['Embarked'])
df.head()

Unnamed: 0,Embarked
0,S
1,C
2,S
3,S
4,S


In [7]:
df['Embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [8]:
df.dropna(inplace=True)


In [9]:
# get_dummies replaces the categorical values with binary values
 
list_unique_values = list(df['Embarked'])

print(pd.get_dummies(list_unique_values))

     C  Q  S
0    0  0  1
1    1  0  0
2    0  0  1
3    0  0  1
4    0  0  1
..  .. .. ..
884  0  0  1
885  0  0  1
886  0  0  1
887  1  0  0
888  0  1  0

[889 rows x 3 columns]


## 2) Handeling Categorical features with many number of categories (one hot encoding)
- Selecting only the top categorical values

In [11]:
import pandas as pd
import numpy as np

In [12]:
# loading the data and selecting few categorical columns

data = pd.read_csv(r"C:\Users\tejas\Desktop\ineuron\feature engineering\data\mercedes.csv" , usecols=['X1','X2','X3','X4','X5','X6'])

In [13]:
data.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6
0,v,at,a,d,u,j
1,t,av,e,d,y,l
2,w,n,c,d,x,j
3,t,n,f,d,x,l
4,v,n,f,d,h,d


In [14]:
# checking how many category values are there in each feature

for col in data.columns:
    print(col ,": " , len(data[col].unique()), " labels")

X1 :  27  labels
X2 :  44  labels
X3 :  7  labels
X4 :  4  labels
X5 :  29  labels
X6 :  12  labels


In [15]:
# data shape before one hot encoding
data.shape

(4209, 6)

In [16]:
# data shape after one hot encoding
pd.get_dummies(data, drop_first=True).shape

(4209, 117)

In [17]:
# we have got 117 features after one hot encoding, this might lead to curse of dimensionality

# So another approach is to select 10 most frequently appearing categorical values and set the remaining categories to 0

In [18]:
# viewing the top 20 categorical values in feature X2
data.X2.value_counts().sort_values(ascending=False).head(20)

as    1659
ae     496
ai     415
m      367
ak     265
r      153
n      137
s       94
f       87
e       81
aq      63
ay      54
a       47
t       29
k       25
i       25
b       21
ao      20
z       19
ag      19
Name: X2, dtype: int64

#### Selecting the top 10 features out of a large categorical feature column and performing feature engineering on them by one hot encoding.

In [20]:
# Lets consider X2 feature for this example
# Lets make a list of the top 10 repeating category variables

top10 = [x for x in data.X2.value_counts().sort_values(ascending=False).head(10).index]
top10

['as', 'ae', 'ai', 'm', 'ak', 'r', 'n', 's', 'f', 'e']

In [21]:
# we manually make 10 binary variable for each corresponding categories
# if label exists then 1 else 0

for label in top10:
    data[label] = np.where(data['X2']==label, 1, 0)

data[['X2']+top10].head(20)

Unnamed: 0,X2,as,ae,ai,m,ak,r,n,s,f,e
0,at,0,0,0,0,0,0,0,0,0,0
1,av,0,0,0,0,0,0,0,0,0,0
2,n,0,0,0,0,0,0,1,0,0,0
3,n,0,0,0,0,0,0,1,0,0,0
4,n,0,0,0,0,0,0,1,0,0,0
5,e,0,0,0,0,0,0,0,0,0,1
6,e,0,0,0,0,0,0,0,0,0,1
7,as,1,0,0,0,0,0,0,0,0,0
8,as,1,0,0,0,0,0,0,0,0,0
9,aq,0,0,0,0,0,0,0,0,0,0


In [22]:
# creating a function to apply this to all the features

def one_hot(df,feature,top_labels):
    
    for label in top_labels:
        df[feature+'_'+label] = np.where(data[feature]==label, 1, 0)
        
# reding the data again
data = pd.read_csv(r"C:\Users\tejas\Desktop\ineuron\feature engineering\data\mercedes.csv" , usecols=['X1','X2','X3','X4','X5','X6'])

# perform 1 hot encoding for X1 feature
one_hot(data,'X1',top10)
data.head(10)

Unnamed: 0,X1,X2,X3,X4,X5,X6,X1_as,X1_ae,X1_ai,X1_m,X1_ak,X1_r,X1_n,X1_s,X1_f,X1_e
0,v,at,a,d,u,j,0,0,0,0,0,0,0,0,0,0
1,t,av,e,d,y,l,0,0,0,0,0,0,0,0,0,0
2,w,n,c,d,x,j,0,0,0,0,0,0,0,0,0,0
3,t,n,f,d,x,l,0,0,0,0,0,0,0,0,0,0
4,v,n,f,d,h,d,0,0,0,0,0,0,0,0,0,0
5,b,e,c,d,g,h,0,0,0,0,0,0,0,0,0,0
6,r,e,f,d,f,h,0,0,0,0,0,1,0,0,0,0
7,l,as,f,d,f,j,0,0,0,0,0,0,0,0,0,0
8,s,as,e,d,f,i,0,0,0,0,0,0,0,1,0,0
9,b,aq,c,d,f,a,0,0,0,0,0,0,0,0,0,0


In [23]:
# perform 1 hot encoding for X2 feature
one_hot(data,'X2',top10)
# perform 1 hot encoding for X3 feature
one_hot(data,'X3',top10)
# perform 1 hot encoding for X1 feature
one_hot(data,'X4',top10)
# perform 1 hot encoding for X1 feature
one_hot(data,'X5',top10)
# perform 1 hot encoding for X1 feature
one_hot(data,'X6',top10)

data.head(10)

Unnamed: 0,X1,X2,X3,X4,X5,X6,X1_as,X1_ae,X1_ai,X1_m,...,X6_as,X6_ae,X6_ai,X6_m,X6_ak,X6_r,X6_n,X6_s,X6_f,X6_e
0,v,at,a,d,u,j,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,t,av,e,d,y,l,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,w,n,c,d,x,j,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,t,n,f,d,x,l,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,v,n,f,d,h,d,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,b,e,c,d,g,h,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,r,e,f,d,f,h,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,l,as,f,d,f,j,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,s,as,e,d,f,i,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,b,aq,c,d,f,a,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [24]:
# now we can drop the initial feature to train the Machine learning model

data.drop(columns=['X1','X2','X3','X4','X5','X6'], axis=1, inplace=True)

data.head(10)

Unnamed: 0,X1_as,X1_ae,X1_ai,X1_m,X1_ak,X1_r,X1_n,X1_s,X1_f,X1_e,...,X6_as,X6_ae,X6_ai,X6_m,X6_ak,X6_r,X6_n,X6_s,X6_f,X6_e
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [25]:
# view the columns created
data.columns

Index(['X1_as', 'X1_ae', 'X1_ai', 'X1_m', 'X1_ak', 'X1_r', 'X1_n', 'X1_s',
       'X1_f', 'X1_e', 'X2_as', 'X2_ae', 'X2_ai', 'X2_m', 'X2_ak', 'X2_r',
       'X2_n', 'X2_s', 'X2_f', 'X2_e', 'X3_as', 'X3_ae', 'X3_ai', 'X3_m',
       'X3_ak', 'X3_r', 'X3_n', 'X3_s', 'X3_f', 'X3_e', 'X4_as', 'X4_ae',
       'X4_ai', 'X4_m', 'X4_ak', 'X4_r', 'X4_n', 'X4_s', 'X4_f', 'X4_e',
       'X5_as', 'X5_ae', 'X5_ai', 'X5_m', 'X5_ak', 'X5_r', 'X5_n', 'X5_s',
       'X5_f', 'X5_e', 'X6_as', 'X6_ae', 'X6_ai', 'X6_m', 'X6_ak', 'X6_r',
       'X6_n', 'X6_s', 'X6_f', 'X6_e'],
      dtype='object')

## 3) Count of Feature Encoding

### High Cardinality

### Applied when there are huge number of categorical variables in a feature
- Replace each value of categorical variables by count and this is the number of times that value appears in the data

In [27]:
# consider 2 columns
data = pd.read_csv(r"C:\Users\tejas\Desktop\ineuron\feature engineering\data\mercedes.csv" , usecols=['X1','X2'])


In [28]:
data.shape

(4209, 2)

In [29]:
# prone to curse of dimensionality
pd.get_dummies(data).shape

(4209, 71)

In [30]:
# checking for unique labels in this feature
len(data['X1'].unique())

27

In [31]:
# checking for unique labels in this feature
len(data['X2'].unique())

44

In [32]:
# let capture the count of each lable of X2 feature in a dictionary
data.X2.value_counts().to_dict()

{'as': 1659,
 'ae': 496,
 'ai': 415,
 'm': 367,
 'ak': 265,
 'r': 153,
 'n': 137,
 's': 94,
 'f': 87,
 'e': 81,
 'aq': 63,
 'ay': 54,
 'a': 47,
 't': 29,
 'k': 25,
 'i': 25,
 'b': 21,
 'ao': 20,
 'z': 19,
 'ag': 19,
 'd': 18,
 'ac': 13,
 'g': 12,
 'y': 11,
 'ap': 11,
 'x': 10,
 'aw': 8,
 'h': 6,
 'at': 6,
 'al': 5,
 'q': 5,
 'an': 5,
 'ah': 4,
 'p': 4,
 'av': 4,
 'au': 3,
 'af': 1,
 'ar': 1,
 'j': 1,
 'am': 1,
 'aa': 1,
 'o': 1,
 'l': 1,
 'c': 1}

In [33]:
# lets store it to a variable
df_frequency_map = data.X2.value_counts().to_dict()

In [34]:
# now lets map the dictionary values to the dataframe X2 feature column by replacing their values
data.X2 = data.X2.map(df_frequency_map)
data

Unnamed: 0,X1,X2
0,v,6
1,t,4
2,w,137
3,t,137
4,v,137
...,...,...
4204,s,1659
4205,o,29
4206,v,153
4207,r,81


### Advantages
- easy to implement
- does not increase feature dimension

### Disadvantage
- sometimes it may work and it many not work
- If some labels have same count then it many loose some valuable infomation

## 4) Ordinal number encoding or label encoding
- Ranking the feature values based on the importance of a feature for the problem

In [37]:
# creating date time from today with 20 days of difference

import pandas as pd
import datetime

# getting the dates between the range and creating a dataframe
df_base = datetime.datetime.today()
df_date_list = [df_base - datetime.timedelta(days=x) for x in range(0,20)]
df = pd.DataFrame(df_date_list)
df.columns = ['day']
df

Unnamed: 0,day
0,2021-11-26 17:07:54.842531
1,2021-11-25 17:07:54.842531
2,2021-11-24 17:07:54.842531
3,2021-11-23 17:07:54.842531
4,2021-11-22 17:07:54.842531
5,2021-11-21 17:07:54.842531
6,2021-11-20 17:07:54.842531
7,2021-11-19 17:07:54.842531
8,2021-11-18 17:07:54.842531
9,2021-11-17 17:07:54.842531


In [38]:
# extracting the weekday name

df['day_of_week'] = df['day'].dt.strftime("%A")

In [39]:
df.head()

Unnamed: 0,day,day_of_week
0,2021-11-26 17:07:54.842531,Friday
1,2021-11-25 17:07:54.842531,Thursday
2,2021-11-24 17:07:54.842531,Wednesday
3,2021-11-23 17:07:54.842531,Tuesday
4,2021-11-22 17:07:54.842531,Monday


In [40]:
# Engineering the categorical variables by ordinal number replacement

weekday_map = {
    'Monday':1,
    'Tuesday':2,
    'Wednesday':3,
    'Thursday':4,
    'Friday':5,
    'Saturday':6,
    'Sunday':7
}

df['day_ordinal'] = df.day_of_week.map(weekday_map)
df.head(10)

Unnamed: 0,day,day_of_week,day_ordinal
0,2021-11-26 17:07:54.842531,Friday,5
1,2021-11-25 17:07:54.842531,Thursday,4
2,2021-11-24 17:07:54.842531,Wednesday,3
3,2021-11-23 17:07:54.842531,Tuesday,2
4,2021-11-22 17:07:54.842531,Monday,1
5,2021-11-21 17:07:54.842531,Sunday,7
6,2021-11-20 17:07:54.842531,Saturday,6
7,2021-11-19 17:07:54.842531,Friday,5
8,2021-11-18 17:07:54.842531,Thursday,4
9,2021-11-17 17:07:54.842531,Wednesday,3


## 5) Count or frequency encoding
- Replace the categories by the count of the observations that show that category in the dataset.

In [41]:
# importing libraries and fetching the data

import pandas as pd
import numpy as np

train_set = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data' , header = None,index_col=None)
train_set.head()  

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [42]:
# getting the categorical columns

columns=[1,3,5,6,7,8,9,13]

In [44]:
# making a new dataset with only categorical columns

train_set=train_set[columns]
train_set.head()

Unnamed: 0,1,3,5,6,7,8,9,13
0,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,United-States
1,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,United-States
2,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,United-States
3,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,United-States
4,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,Cuba


In [45]:
# renaming the columns

train_set.columns=['Employment','Degree','Status','Designation','family_job','Race','Sex','Country']
train_set.head()

Unnamed: 0,Employment,Degree,Status,Designation,family_job,Race,Sex,Country
0,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,United-States
1,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,United-States
2,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,United-States
3,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,United-States
4,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,Cuba


In [47]:
# checking unique columns in each feature

for feature in train_set.columns:
    print(feature,': ', len(train_set[feature].unique()), ' labels')

Employment :  9  labels
Degree :  16  labels
Status :  7  labels
Designation :  15  labels
family_job :  6  labels
Race :  5  labels
Sex :  2  labels
Country :  42  labels


In [49]:
# lets map the country feature to its number of counts and mapping it to a dictionary

counts = train_set['Country'].value_counts().to_dict()
counts

{' United-States': 29170,
 ' Mexico': 643,
 ' ?': 583,
 ' Philippines': 198,
 ' Germany': 137,
 ' Canada': 121,
 ' Puerto-Rico': 114,
 ' El-Salvador': 106,
 ' India': 100,
 ' Cuba': 95,
 ' England': 90,
 ' Jamaica': 81,
 ' South': 80,
 ' China': 75,
 ' Italy': 73,
 ' Dominican-Republic': 70,
 ' Vietnam': 67,
 ' Guatemala': 64,
 ' Japan': 62,
 ' Poland': 60,
 ' Columbia': 59,
 ' Taiwan': 51,
 ' Haiti': 44,
 ' Iran': 43,
 ' Portugal': 37,
 ' Nicaragua': 34,
 ' Peru': 31,
 ' France': 29,
 ' Greece': 29,
 ' Ecuador': 28,
 ' Ireland': 24,
 ' Hong': 20,
 ' Trinadad&Tobago': 19,
 ' Cambodia': 19,
 ' Laos': 18,
 ' Thailand': 18,
 ' Yugoslavia': 16,
 ' Outlying-US(Guam-USVI-etc)': 14,
 ' Hungary': 13,
 ' Honduras': 13,
 ' Scotland': 12,
 ' Holand-Netherlands': 1}

In [50]:
# creating a new feature and mapping the counts with the respective country counts
# This is called the count of frequency encoding

train_set['Country_count'] = train_set['Country'].map(counts)
train_set.head(10)

Unnamed: 0,Employment,Degree,Status,Designation,family_job,Race,Sex,Country,Country_count
0,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,United-States,29170
1,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,United-States,29170
2,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,United-States,29170
3,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,United-States,29170
4,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,Cuba,95
5,Private,Masters,Married-civ-spouse,Exec-managerial,Wife,White,Female,United-States,29170
6,Private,9th,Married-spouse-absent,Other-service,Not-in-family,Black,Female,Jamaica,81
7,Self-emp-not-inc,HS-grad,Married-civ-spouse,Exec-managerial,Husband,White,Male,United-States,29170
8,Private,Masters,Never-married,Prof-specialty,Not-in-family,White,Female,United-States,29170
9,Private,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,United-States,29170


##### Advantages
- Easy To Use
- Not increasing feature space 

##### Disadvantages
- It will provide same weight if the frequencies are same

## 6) Target Guided Ordinal Encoding
- we calculate the mean of each categorical variable based on the output and then rank them.
- Ordering the labels according to the target
- Replace the labels by the joint probability of being 1 or 0


In [1]:
# importing libraries and fetching the data

import pandas as pd

df=pd.read_csv(r"C:\Users\tejas\Desktop\ineuron\feature engineering\data\titanic.csv", usecols=['Cabin','Survived'])
df.head()

Unnamed: 0,Survived,Cabin
0,0,
1,1,C85
2,1,
3,1,C123
4,0,


In [2]:
# replacing NaN with "Missing"
df['Cabin'].fillna('Missing',inplace=True)
df.head()

Unnamed: 0,Survived,Cabin
0,0,Missing
1,1,C85
2,1,Missing
3,1,C123
4,0,Missing


In [3]:
# selecting only the first [0] value of the string to represent in which block it blongs to

df['Cabin_block']=df['Cabin'].astype(str).str[0]
df.head(10)

Unnamed: 0,Survived,Cabin,Cabin_block
0,0,Missing,M
1,1,C85,C
2,1,Missing,M
3,1,C123,C
4,0,Missing,M
5,0,Missing,M
6,0,E46,E
7,0,Missing,M
8,1,Missing,M
9,1,Missing,M


In [4]:
# checking the unique blocks
# our aim should be to get unique blocks

df.Cabin_block.unique()


array(['M', 'C', 'E', 'G', 'D', 'A', 'B', 'F', 'T'], dtype=object)

In [8]:
# grouping by cabin to check how many people survived
df.groupby(['Cabin_block'])['Survived'].mean()

Cabin_block
A    0.466667
B    0.744681
C    0.593220
D    0.757576
E    0.750000
F    0.615385
G    0.500000
M    0.299854
T    0.000000
Name: Survived, dtype: float64

In [10]:
# sorting the values in ascending and getting the index
# These are ordinal labels describing how many people died in order 

ordinal_labels = df.groupby(['Cabin_block'])['Survived'].mean().sort_values().index
ordinal_labels

Index(['T', 'M', 'A', 'G', 'C', 'F', 'B', 'E', 'D'], dtype='object', name='Cabin_block')

In [13]:
# maping the labels to a number
# Since there are majority of people who survided in D block the rank is the highest

dict_ordinal_labels = {k:i for i,k in enumerate(ordinal_labels, 0)}
dict_ordinal_labels

{'T': 0, 'M': 1, 'A': 2, 'G': 3, 'C': 4, 'F': 5, 'B': 6, 'E': 7, 'D': 8}

In [14]:
df['Cabin_ordinal_labels'] = df['Cabin_block'].map(dict_ordinal_labels)
df.head()

Unnamed: 0,Survived,Cabin,Cabin_block,Cabin_ordinal_labels
0,0,Missing,M,1
1,1,C85,C,4
2,1,Missing,M,1
3,1,C123,C,4
4,0,Missing,M,1


## 7) Mean Encoding
- mean encoding represents a probability of your target variable, conditional on each value of the feature. In a way, it embodies the target variable in its encoded value.

In [16]:
df.groupby(['Cabin_block'])['Survived'].mean()

Cabin_block
A    0.466667
B    0.744681
C    0.593220
D    0.757576
E    0.750000
F    0.615385
G    0.500000
M    0.299854
T    0.000000
Name: Survived, dtype: float64

In [17]:
# mapping the blocks to the mean values

mean_ordinal= df.groupby(['Cabin_block'])['Survived'].mean().to_dict()
mean_ordinal

{'A': 0.4666666666666667,
 'B': 0.7446808510638298,
 'C': 0.5932203389830508,
 'D': 0.7575757575757576,
 'E': 0.75,
 'F': 0.6153846153846154,
 'G': 0.5,
 'M': 0.29985443959243085,
 'T': 0.0}

In [18]:
# inserting a new column

df['mean_ordinal_encode']= df['Cabin_block'].map(mean_ordinal)
df.head()

Unnamed: 0,Survived,Cabin,Cabin_block,Cabin_ordinal_labels,mean_ordinal_encode
0,0,Missing,M,1,0.299854
1,1,C85,C,4,0.59322
2,1,Missing,M,1,0.299854
3,1,C123,C,4,0.59322
4,0,Missing,M,1,0.299854


## 7) Probability Ratio Encoding
- Probability Ratio Encoding is similar to Weight Of Evidence(WoE), with the only difference is the only ratio of good and bad probability is used. 

In [22]:
# importing libraries and fetching the data

import pandas as pd

df=pd.read_csv(r"C:\Users\tejas\Desktop\ineuron\feature engineering\data\titanic.csv", usecols=['Cabin','Survived'])
df.head()

Unnamed: 0,Survived,Cabin
0,0,
1,1,C85
2,1,
3,1,C123
4,0,


In [23]:
# Replacing missing values

df['Cabin'].fillna('Missing', inplace=True)
df.head()

Unnamed: 0,Survived,Cabin
0,0,Missing
1,1,C85
2,1,Missing
3,1,C123
4,0,Missing


In [24]:
df.Cabin.unique()


array(['Missing', 'C85', 'C123', 'E46', 'G6', 'C103', 'D56', 'A6',
       'C23 C25 C27', 'B78', 'D33', 'B30', 'C52', 'B28', 'C83', 'F33',
       'F G73', 'E31', 'A5', 'D10 D12', 'D26', 'C110', 'B58 B60', 'E101',
       'F E69', 'D47', 'B86', 'F2', 'C2', 'E33', 'B19', 'A7', 'C49', 'F4',
       'A32', 'B4', 'B80', 'A31', 'D36', 'D15', 'C93', 'C78', 'D35',
       'C87', 'B77', 'E67', 'B94', 'C125', 'C99', 'C118', 'D7', 'A19',
       'B49', 'D', 'C22 C26', 'C106', 'C65', 'E36', 'C54',
       'B57 B59 B63 B66', 'C7', 'E34', 'C32', 'B18', 'C124', 'C91', 'E40',
       'T', 'C128', 'D37', 'B35', 'E50', 'C82', 'B96 B98', 'E10', 'E44',
       'A34', 'C104', 'C111', 'C92', 'E38', 'D21', 'E12', 'E63', 'A14',
       'B37', 'C30', 'D20', 'B79', 'E25', 'D46', 'B73', 'C95', 'B38',
       'B39', 'B22', 'C86', 'C70', 'A16', 'C101', 'C68', 'A10', 'E68',
       'B41', 'A20', 'D19', 'D50', 'D9', 'A23', 'B50', 'A26', 'D48',
       'E58', 'C126', 'B71', 'B51 B53 B55', 'D49', 'B5', 'B20', 'F G63',
       'C62

In [25]:
# since all the C would fall in the same block we would take only the first character
# selecting only the first [0] value of the string to represent in which block it blongs to

df['Cabin_block']=df['Cabin'].astype(str).str[0]
df.head(10)


Unnamed: 0,Survived,Cabin,Cabin_block
0,0,Missing,M
1,1,C85,C
2,1,Missing,M
3,1,C123,C
4,0,Missing,M
5,0,Missing,M
6,0,E46,E
7,0,Missing,M
8,1,Missing,M
9,1,Missing,M


In [26]:
df.Cabin_block.unique()

array(['M', 'C', 'E', 'G', 'D', 'A', 'B', 'F', 'T'], dtype=object)

In [30]:
# We need to find the percentage of people survived

probability_df = df.groupby(['Cabin_block'])['Survived'].mean()
probability_df

Cabin_block
A    0.466667
B    0.744681
C    0.593220
D    0.757576
E    0.750000
F    0.615385
G    0.500000
M    0.299854
T    0.000000
Name: Survived, dtype: float64

In [31]:
# creating a new dataframe for survived

prob_df = pd.DataFrame(probability_df)
prob_df

Unnamed: 0_level_0,Survived
Cabin_block,Unnamed: 1_level_1
A,0.466667
B,0.744681
C,0.59322
D,0.757576
E,0.75
F,0.615385
G,0.5
M,0.299854
T,0.0


In [32]:
# now we need to find the probability of people who haven't died
# the sum of survived and not survived is always 1

prob_df['not_Survived'] = 1 - prob_df['Survived']
prob_df.head()

Unnamed: 0_level_0,Survived,not_Survived
Cabin_block,Unnamed: 1_level_1,Unnamed: 2_level_1
A,0.466667,0.533333
B,0.744681,0.255319
C,0.59322,0.40678
D,0.757576,0.242424
E,0.75,0.25


In [33]:
# Now to get the probability ratio encoding we,

prob_df['probability_ratio'] = prob_df['Survived'] / prob_df['not_Survived']
prob_df

Unnamed: 0_level_0,Survived,not_Survived,probability_ratio
Cabin_block,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,0.466667,0.533333,0.875
B,0.744681,0.255319,2.916667
C,0.59322,0.40678,1.458333
D,0.757576,0.242424,3.125
E,0.75,0.25,3.0
F,0.615385,0.384615,1.6
G,0.5,0.5,1.0
M,0.299854,0.700146,0.428274
T,0.0,1.0,0.0


In [35]:
# converting the probability_ratio to a dictionary and mapping with its index

probability_encoded = prob_df['probability_ratio'].to_dict()
probability_encoded

{'A': 0.875,
 'B': 2.916666666666666,
 'C': 1.4583333333333333,
 'D': 3.125,
 'E': 3.0,
 'F': 1.6000000000000003,
 'G': 1.0,
 'M': 0.42827442827442824,
 'T': 0.0}

In [36]:
# mapping this value to the original dataframe's blocks
df['probability_encoded'] = df['Cabin_block'].map(probability_encoded)
df.head()

Unnamed: 0,Survived,Cabin,Cabin_block,probability_encoded
0,0,Missing,M,0.428274
1,1,C85,C,1.458333
2,1,Missing,M,0.428274
3,1,C123,C,1.458333
4,0,Missing,M,0.428274


#### we can drop cabin and cabin_block and consider only probability_encoded and Survived as features for our model 